Now that you have performed some data transformation exercises, it is a good time to read about some applicable transformation and data management concepts. Transformation As you progressed through the exercises that transformed the...
Read More
Normalize and Denormalize Values– Transform, Manage, and Prepare Data
Normalization and denormalization can be approached in two contexts. The first context has to do with the deduplication of data and query speed on database tables in a relational database. The other context has...
Read More
Azure Cosmos DB—Shred JSON– Transform, Manage, and Prepare Data
FIGURE 5.23 Shredding JSON with Azure Cosmos DB The query you executed in step 4 begins with a SELECT, which is followed by the OPENROWSET that contains information about the PROVIDER, CONNECTION, and OBJECT.SELECT...
Read More
Split Data – Transform, Manage, and Prepare Data
FIGURE 5.21 Splitting the data source—Projection tab FIGURE 5.22 Splitting the data sink—Optimize tab In Exercise 5.6 you created a data flow that contains a source to import a large CSV file from ADLS....
Read More
Azure HDInsight – Transform, Manage, and Prepare Data
If you provision an Azure HDInsight Apache Spark cluster, there exists a Jupyter notebook interactive environment. The Jupyter notebook environment is accessible by URL. If your HDInsight cluster is named brainjammer, for example, the...
Read More
Cleanse Data – Transform, Manage, and Prepare Data
%%pysparkdf = spark.read \.load(‘abfss://*@*.dfs.core.windows.net/SessionCSV/BRAINWAVES_WITH_ NULLS.csv’,format=’csv’, header=True) The final action to take after cleansing the data is to perhaps save it to a temporary table, using the saveAsTable(tableName) method, or into the Parquet file format....
Read More
Shred JSON– Transform, Manage, and Prepare Data
When you shred something, the object being shredded is torn into small pieces. In many respects, it means that the pieces that result from being torn are in the smallest possible size. In this...
Read More
Flatten, Explode, and Shred JSON– Transform, Manage, and Prepare Data
The first snippet of code imports the explode() and col() methods from the pyspark.sql.functions class. Then the JSON file is loaded into a DataFrame with an option stipulating that the file is multiline as...
Read More
Storing Prepared, Trained, and Modeled Data – Data Sources and Ingestion
All data, regardless of the Big Data stage it is in, must be stored. The data not yet ingested into the pipeline is stored someplace too, but not yet on Azure. You can use...
Read More
Perform Exploratory Data Analysis– Transform, Manage, and Prepare Data
The previous queries are in the preliminaryEDA.sql file in the Chapter05/Ch05Ex11 folder, on GitHub at https://github.com/benperk/ADE. The previous queries are in the preliminaryEDA.sql file in the Chapter05/Ch05Ex11 folder, on GitHub at https://github.com/benperk/ADE. FIGURE 5.31...
Read More