Flatten, Explode, and Shred JSON– Transform, Manage, and Prepare Data

  1. Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ on the Overview blade, click the Open link in the Open Synapse Studio tile ➢ select the Develop hub ➢ hover over Notebooks ➢ click the ellipse (…) ➢ select Notebook ➢ and then load one of the brain wave JSON files from a previous exercise (for example, by using a similar syntax to the following, replacing the * with your ADLS container and account, and by replacing the directory path and filename):
    %%pyspark
    from pyspark.sql.functions import explode, col
    df = spark.read.option(‘multiline’, ‘true’) \
    .json(‘abfss://@.dfs.core.windows.net/Json/PlayingGuitar/POW/*‐0914 .json’)
    flatten = df.select(‘Session’)
    flatten.show()
  2. Add the following code to the notebook. This gets you a little bit deeper into the structure of the JSON document and closer to the data you need.
    flatten = df.select(‘Session’, ‘Session.Scenario’, 'Session.POWReading', 'Session.POWReading') flatten.show()
  3. Add the following code, which uses the explode() method on the POWReading array, to the notebook. Consider also adding the printSchema() method to see the structure of the data loaded into the DataFrame.
    exploded = df.select(col(‘Session.Scenario’).alias(‘SCENARIO’),
    explode(‘Session.POWReading’).alias(‘READING’))
    exploded.printSchema()
    exploded.show(5)
  4. To shred to JSON so that the result resembles the same structure as you achieved in Exercise 5.7, add and execute the following syntax. You can find all the queries in the pysparkFlattenExplodeShred.txt file on GitHub, in the directory Chapter05/Ch05Ex08.
    shredded = exploded.select(‘SCENARIO’,
  5. Rename, save, and commit the notebook to your GitHub repository.

The first snippet of code imports the explode() and col() methods from the pyspark.sql.functions class. Then the JSON file is loaded into a DataFrame with an option stipulating that the file is multiline as opposed to a single line. The top‐level property in the JSON document, Session, is selected and shown. You might consider passing the truncate=False parameter to the show() method, if you want to see the entire file. By default, the first 20 characters of the data are rendered to the result window. The following snippet of code illustrates how you can access the different properties and fields within the JSON document:
flatten = df.select(‘Session’, ‘Session.Scenario’,

                'Session.POWReading.ReadingDate', 'Session.POWReading')

Both the Session and Session.POWReading properties are entities that contain data within them, i.e., they have no direct value. Session.Scenario and Session.POWReading.ReadingDate do return a value when referenced from the select() method. You should notice that when reviewing the results of the query. The third code snippet is where you see the explode() method, which breaks out each reading that contains all electrodes and all their associated frequencies. At this point the data you are after is getting very close to the structure you can use to further analyze it.
exploded = df.select(col(‘Session.Scenario’).alias(‘SCENARIO’),

                   explode('Session.POWReading').alias('READING'))

exploded.printSchema()

This might be a good place to review the structure of the data that is currently loaded into the DataFrame. This is done by calling the printSchema() method on the DataFrame, as follows:
root

The schema gives you a map to navigate the data. For example, by reviewing the schema, you can see how to retrieve the ALPHA frequency value from the AF4 electrode. This is done by using dot notation like READING.AF4.ALPHA, which results in the value for that field being returned. The final code snippet of Exercise 5.8 used that approach to develop the syntax that resulted in the JSON shredding. Remember that shredding occurs when you have transformed the data from a valid JSON‐structured document into a new form. The shredded form can be stored on a Spark table or view (for example, df.createOrReplaceTempView()), stored in Parquet format, or placed into a relational database for further transformation or data analytics.

Raymond Gallardo

Learn More →

Leave a Reply

Your email address will not be published. Required fields are marked *