Transform Data Using Apache Spark—Azure Synapse Analytics – Transform, Manage, and Prepare Data-1

Transform Data Using Apache Spark
Apache Spark can be used in a few products running on Azure: Azure Synapse Analytics Spark pools, Azure Databrick Spark clusters, Azure HDInsight Spark clusters, and Azure Data Factory. The one you choose to work with depends on many things, but the two most important are the following. First, what is the current state of your Big Data solution? If you are starting fresh, then the recommended choice is Azure Synapse Analytics. If you already do some existing on‐premises Big Data solutions that run on Databricks or HDInsight, then those Azure products would make the most sense to use. Again, the benefit from moving your on‐premises Big Data solution running on Databrick or HDInsight is that Microsoft provides the infrastructure and much of the administration of those technologies, allowing you more time to focus on data analytics instead of keeping the platform running. The other important dependency has to do with the skill set of your team, your company, and yourself. If you, your team, and company have a large pool of HDInsight experience, or it is something you are striving to standardize, then by all means choose that platform. The product you choose needs to be one you can support, configure, and optimize, so choose the one you have the skills and experience to work best with. Complete Exercise 5.3, where you will transform some brainjammer brain wave data using a Spark pool in Azure Synapse Analytics.

Azure Synapse Analytics
Azure Synapse Analytics is the suite of tools Microsoft recommends for running Big Data on the Azure platform.

EXERCISE 5.3 Transform Data Using Apache Spark—Azure Synapse Analytics

  1. Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ on the Overview blade, click the Open link in the Open Synapse Studio tile ➢ select the Manage hub ➢ select Apache Spark Pools from the menu list ➢ review the node size family ➢ review the size ➢ select the Develop hub ➢ hover over Data Flows ➢ click the ellipse (…) ➢ select New Data Flow ➢ enter an output stream name (I used BrainwavesReading) ➢ and then select the dataset you created earlier that retrieves data from the [dbo].[READING] table on your Azure SQL database. Review Exercise 4.13 and Figure 4.36 for further details of a similar dataset.
  2. To add a sink, click the + on the lower‐right corner of the Source module ➢ select Sink ➢ enter an output stream name (I used brainjammerTmpReading) ➢ select the + New link to the right of the Dataset drop‐down list box ➢ choose ADLS ➢ choose Parquet ➢ enter a name (I used BrainjammerBrainwavesTmpParquet) ➢ select WorkspaceDefaultStorage from the Linked Service drop‐down text box ➢ enable Interactive Authoring, if not already enabled ➢ select the folder icon to the right of the File Path text boxes ➢ and then navigate to the location where you want to store the Parquet file, for example:
    EMEA\brainjammer\in\2022\04\28\16
  3. Leave the defaults ➢ click the OK button ➢ consider renaming the data flow (for example, Ch05Ex3) ➢ click the Commit button ➢ select the Integrate hub ➢ hover over the Pipelines group ➢ click the ellipse (…) ➢ select New Pipeline ➢ drag and drop a Data Flow activity from the Move & Transform Activities list ➢ on the General tab enter a name (I used MoveToTmpReading) ➢ select the Setting tab ➢ select the data flow you just created from the Data Flow drop‐down list box ➢ consider renaming the pipeline (I used IngestTransformBrainwaveReadingsSpark) ➢ click the Commit button ➢ click the Validate button ➢ and then click Publish.
  4. Select the Develop hub ➢ hover over Notebooks ➢ click the ellipse (…) ➢ select New Notebook ➢ select the Spark pool from the Attach To drop‐down box ➢ and then enter the following PySpark code syntax:
  5. The complete PySpark syntax is in a file named transformApacheSpark.txt, in the directory Chapter05/Ch05Ex03, on GitHub at https://github.com/benperk/ADE. Enter the following snippet to load five reference tables required for transformation; these CSV files are from Exercise 4.13; see the directory Chapter04/Ch04Ex12 on GitHub for instructions on how to get them.

Raymond Gallardo

Learn More →

Leave a Reply

Your email address will not be published. Required fields are marked *