- Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ on the Overview blade click the Open link in the Open Synapse Studio tile ➢ select the Develop hub ➢ hover over the Notebook folder ➢ click the ellipse (…) ➢ select New Notebook ➢ select your Apache Spark pool from the Attach To drop‐down list box ➢ download the BRAINWAVES_WITH_NULLS.csv file from the Chapter05/Ch05Ex0C directory ➢ upload the file into your data lake (aka the ADLS container you created in Exercise 3.1) ➢ execute the following PySpark syntax to load the data file into a DataFrame and identify rows with null values ➢ and then replace the * with your container and ADLS endpoint URL:
%%pysparkdf = spark.read \.load(‘abfss://*@*.dfs.core.windows.net/SessionCSV/BRAINWAVES_WITH_ NULLS.csv’,format=’csv’, header=True)
- Execute the following syntax, and then check the total number of rows and columns that are loaded into the Dataframe:
print((df.count(), len(df.columns)))(35, 8)
- Remove the rows that contain columns with null values ➢ execute the following code snippet ➢ and then check the number of rows and columns again:
df = df.na.drop(“any”)print((df.count(), len(df.columns)))df.show()
- Remove the data rows with corrupt data by executing the following code snippet:
df = df.filter(df.VALUE != ‘2X.n66’).filter(df.VALUE != ‘4.c1..’) .filter(df.READING_DATETIME != ‘7/40/2021 12:41:00 PM’)print((df.count(), len(df.columns)))(21, 6) - Remove the SESSION_DATETIME and READING_DATETIME columns, as follows:
df = df.drop(df.SESSION_DATETIME, df.READING_DATETIME)print((df.count(), len(df.columns)))(21, 4) - Rename the notebook to a friendly name (for example, Ch05Ex05), and then commit the code to your GitHub repository. The source code is located on GitHub at https://github.com/benperk/ADE, in the Chapter05/Ch05Ex05 directory.
The final action to take after cleansing the data is to perhaps save it to a temporary table, using the saveAsTable(tableName) method, or into the Parquet file format. When you are ready perform your analytics on this data, you know the data is clean, with no compromises, which means the conclusions made from the data will be valid.
When a column that contains no data is loaded into a DataFrame, the result is null. To find those occurrences, the first action taken was to search the content loaded into the DataFrame for nulls and display that to the output window. Filtering the rows that had any column that contained a null was then performed and resulted in the removal of nine rows. The drop() method was passed a parameter of any, which means that the row is dropped when any of the columns have a null value. There is also an all parameter, which will remove the row when all columns on the row contain null.
The next action was to remove some corrupt data. There are certainly more sophisticated techniques for validating and removing corrupt or even missing data, but this example was to show one approach and to point out that you need to know the data in order to determine whether or not the data is correct. For example, if you know that the value in VALUE must be a decimal, you could attempt a try_cast() conversion on each column in each row; if the conversion fails, then you would then filter it out. The same goes for any timestamp: Attempt to cast it, if it fails, then filter it out. That might not be feasible for massive amounts of data, but in this scenario figuring out how to implement something like that may be worthwhile.
Finally, the last action was to remove two columns from the DataFrame. It is likely that you would use the same data in different formats to gain numerous insights. Some of those transformations might need all the data from the data source, like a CSV file, while other analyses might not need a few columns. In that case it makes sense to remove unnecessary data from the final dataset used for gathering business insights. Again, you can use the drop() method, but this time passing the column names as the parameters results in the column being removed from the DataFrame.
Split Data
If you have not yet completed Exercise 5.2, consider doing so, as you will need the 391 MB CSV file that was created there in order to perform Exercise 5.6. The data in the Chapter04/Ch04Ex01 directory on GitHub contains data in the same format as the one created in Exercise 5.2, but there is much less data in it. When you have a large file that needs to be loaded into a table on a dedicated SQL pool, it is more efficient to split the file into multiple files. The most efficient number of files is based on the selected performance level of your dedicated SQL pool. Table 5.1 summarizes the recommended splits based on performance level size, which is based on the amount of data warehouse units (DWUs). A DWU is a combination of CPU, memory, and I/O resources allocated to a machine that processes your data queries and movements.
Notice in Table 5.1 the presence of the law of 60, where the number of files divided by the maximum number of nodes for the given DWU size is 60. Complete Exercise 5.6, where you will split the single 391 MB file into 60 separate files.
TABLE 5.1 Data file split recommendation
DWU | Number of Files | Maximum Nodes |
DW100c | 60 | 1 |
DW300c | 60 | 1 |
DW500c | 60 | 1 |
DW1000c | 120 | 2 |
DW1500c | 180 | 3 |
DW2500c | 300 | 5 |
DW5000c | 600 | 10 |
DW7500c | 900 | 15 |
DW10000c | 1,200 | 20 |
DW15000c | 1,800 | 30 |
DW30000c | 3,600 | 60 |