Question:
How do I flatten a Spark struct field into multiple rows?


Problem:

I've come across documentation and other pages that ask about flattening Spark structs into the same row, but I've been trying to flatten the data into different rows. Say we have the following data:


ID

struct_field

3333

[{"code": 123, "description": "Business Email"}, {"code": 789, "description": "Primary Email"}]

9234

[{"code": 123, "description": "Business Email"}, {"code": 789, "description": "Primary Email"}, {"code": 456, "description": "Secondary Email"}]


And I want it in this format:

ID

Description

3333

Business Email

3333

Primary Email

9234

Business Email

9234

Primary Email

9234

Secondary Email


I've been trying a variety of solutions but haven't been able to get anywhere with doing this all in PySpark.


Solution:

You can use explode with ".*" to get the result:

df.withColumn("struct_field", explode("struct_field")).select("ID", "struct_field.*").show(truncate=False)


Result: 


FYI:

  • explode(): is a function that is used to transform a column of an array into multiple rows.

  • ".*": is used to transform a struct column into columns of fields of that struct.


Suggested blogs:

>Solved: TaskList View in Django

>Implement nested serializers in the Django rest framework

>Define blade component property in server - Laravel

>Can not run phpstan under docker with memory lack error

>Attempt to read property "data_audit" on array laravel view loop foreach

>How to use start and limit on DataTables in Laravel for mobile API?

>Login to Laravel API- Laravel

>Make Xdebug stop at breakpoints in PhpStorm using Docker

>Creating a service in Laravel using app(FQCN)


Ritu Singh

Ritu Singh

Submit
0 Answers