Azure Cosmos DB—Shred JSON– Transform, Manage, and Prepare Data

  1. Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ and then, on the Overview blade, click the Open link in the Open Synapse Studio. If you have already completed Exercise 3.8, skip to step 4; otherwise, continue to step 2.
  2. Select the Manage Hub ➢ select Linked Services found within the External Connections section ➢ click the + New button (or reuse the linked service from Exercise 3.8) ➢ select Azure Cosmos DB (SQL API) from the New Linked Service window ➢ and then click Continue.
  3. Enter a name (I used BrainjammerCosmosDb) ➢ add a description ➢ choose Account Key from the Authentication Type drop‐down list box ➢ select Connection String ➢ select the From Azure Subscription radio button ➢ select the Azure subscription into which you provisioned the Azure Cosmos DB in Exercise 2.2 ➢ select the Azure Cosmos DB account name ➢ select the database name ➢ and then click Commit. Refer to Figure 3.50 for an example of how the configuration should look.
  4. Select the Develop hub ➢ hover over SQL Scripts, click the ellipse (…) and select New SQL Script ➢ and then enter the syntax available in the shredJSON.txt file on GitHub, in the Chapter05/Ch05Ex07 directory. The result looks something similar to Figure 5.23.

FIGURE 5.23 Shredding JSON with Azure Cosmos DB

  1. Rename, save, and commit the SQL script to your GitHub repository.

The query you executed in step 4 begins with a SELECT, which is followed by the OPENROWSET that contains information about the PROVIDER, CONNECTION, and OBJECT.
SELECT TOP 10 Scenario, ReadingDate, AF3THETA, …

FROM OPENROWSET(PROVIDER = ‘CosmosDB’,
CONNECTION = ‘Account=csharpguitar;Database=brainjammer;Key=sMxCN… ko7cJxyQ==’,
OBJECT = ‘sessions’)

The PROVIDER is CosmosDB, which the platform uses to load the necessary code and drivers to make a connection with an Azure Cosmos DB. The CONNECTION contains the details for making the connection, like endpoint and credentials, and the OBJECT is the container ID within the Azure Cosmos DB. The WITH clause provides a window to perform a query within a query, which relies on the CROSS APPLY statements that follow it.
WITH ( Scenario varchar(max) ‘$.Session.Scenario’,
POWReading varchar(max) ‘$.Session.POWReading’) AS readings
CROSS APPLY OPENJSON(readings.POWReading) AS reading
CROSS APPLY OPENJSON(reading.[value])

Notice the result of the WITH clause is defined with the name readings. The first CROSS APPLY uses the readings object to reference the array of brain wave readings within the POWReading array. That POWReading array also contains an array of electrodes, frequencies, and reading values, which are references to again using a second CROSS APPLY. Finally, the last WITH clause provides a reference to the reading.[value] property, which then links into the electrode array and each of its captured frequency’s value.
WITH ( ReadingDate varchar(50),AF3THETA decimal(7,3) ‘$.AF3[0].THETA’, AF3ALPHA decimal(7,3) ‘$.AF3[0].ALPHA’, AF3BETA_L decimal(7,3) ‘$.AF3[0].BETA_L’,…) AS brainwave

Once the JSON document is in this format, i.e., shredded, you can load it into a temporary table, save it to a file, or transform it again into a relational database or star schema. Shredding a JSON document is likely something you have seen before, especially if you have worked with data in this format. Perhaps you have just never heard of it called shredding specifically, but you might remember from Chapter 2, where this kind of activity was discussed in the context of the explode() method. If you want to review what was covered in this book related to this, reread Chapter 2, specifically around Figure 2.29. Then read on to get some hands‐on experience with the explode() method in the context of a DataFrame and an Azure Spark pool.

ADLS/Apache Spark Pool
When working in the context of an Azure Synapse Analytics Spark pool, it is common to work with files residing in your ADLS container. In Exercise 5.8 you will flatten, explode, and shred a JSON file loaded into a DataFrame using PySpark.

Raymond Gallardo

Learn More →

Leave a Reply

Your email address will not be published. Required fields are marked *