Perform Exploratory Data Analysis– Transform, Manage, and Prepare Data

  1. Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ on the Overview blade, click the Open link in the Open Synapse Studio tile ➢ navigate to the Manage hub ➢ start your dedicated SQL pool ➢ navigate to the Develop hub ➢ create a new SQL script ➢ and then execute the following SQL statement:
    SELECT COUNT(*) FROM brainwaves.FactREADING
  2. Execute the following SQL syntax:
    SELECT COUNT(*)
    FROM brainwaves.FactREADING
    WHERE [VALUE] < 0.274 OR [VALUE]> 392
  3. Execute the following SQL syntax:
    SELECT * INTO brainwaves.FactREADINGHighLow
    FROM brainwaves.FactREADING
    WHERE [VALUE] < 0.274 OR [VALUE]> 392
    SELECT COUNT(*) FROM brainwaves.FactREADINGHighLow
  4. Execute the following SQL syntax:
    DELETE FROM brainwaves.FactREADING WHERE [VALUE] < 0.274 DELETE FROM brainwaves.FactREADING WHERE [VALUE]> 392
    SELECT COUNT(*) FROM brainwaves.FactREADING
  5. Execute the SQL syntax to calculate the statistical summary again, which renders results similar to the following:

The previous queries are in the preliminaryEDA.sql file in the Chapter05/Ch05Ex11 folder, on GitHub at https://github.com/benperk/ADE.

  1. Execute the following SQL syntax. Consider adding a WHERE clause to the SELECT statement that projects only values for SCENARIO = ‘TikTok’. The results are shown and match those in the “Normalization” section.
    CREATE TABLE [brainwaves].[SCENARIO_FREQUENCY]

The previous queries are in the preliminaryEDA.sql file in the Chapter05/Ch05Ex11 folder, on GitHub at https://github.com/benperk/ADE.

  1. Select the Export Results drop‐down menu item on the Results tab, and then select CSV, as shown in Figure 5.31. Note where the file is stored, as you will use this in Exercise 5.12.

FIGURE 5.31 Performing exploratory data analysis—visualizing data in Power BI

  1. Consider renaming your SQL script (for example, Ch05Ex11), and then click the Commit button to save the notebook to your source code repository.

As you look over the output of the EDA data you created in step 6, it is hard to gain insights from that. This is why tools like Power BI and other graphing tools exist: to illustrate those numbers in a more consumable format. This is done in Exercise 5.12, but first a few points need to be discussed concerning Exercise 5.11. The approach used for the removal of outlying brain waves reading values was to remove 1 percent of the values from the bottom and top ranges, which is about 50,000 rows each. Through trial and error, it was determined that there were about 50,000 rows below a brain wave reading value of 0.274 and above 392. These constraints were then used to form the following SQL query and corresponding DELETE statements:
SELECT COUNT(*)

FROM brainwaves.FactREADING
WHERE [VALUE] < 0.274 OR [VALUE]> 392
DELETE FROM brainwaves.FactREADING WHERE [VALUE] < 0.274 DELETE FROM brainwaves.FactREADING WHERE [VALUE]> 392

Once the outliers were removed, the query to calculate the statical summary was re‐executed. The resulting brain reading values for MEAN, MIN, 25%, 50%, 75%, and MAX fell into a more expected range. For ease of reference, both tables are provided here again for side‐by‐side review.

The query used to perform EDA on the brain wave reading values was simply a step into the unknown. Perhaps this is why it is considered exploratory. The word exploratory itself implies some kind of an adventure into the unknown, where new experiences, discoveries, and insights can be realized. From this point there is no textbook path to finding or discovering something; it is now based on your creativity, training, and experience in data analysis and with the context in which the data has been collected. One approach taken for further analysis here is based on the following query, which was executed in step 6 of Exercise 5.11:
SELECT * FROM [brainwaves].[SCENARIO_FREQUENCY] ORDER BY SCENARIO, FREQUENCY

As you learned at the end of Chapter 2, each brain frequency is linked to a trait. Here is the list again:
• ALPHA = Relaxation
• BETA_H = Concentration
• BETA_L = Problem‐solving
• GAMMA = Learning
• THETA = Creativity
A hypothesis you might consider is that ALPHA values should be high while meditating and GAMMA values should be low while watching TikTok. Can you determine this when looking at the output of the SELECT statement from step 6?

To create a visual representation of this data, complete Exercise 5.12. Once completed, use the visualization to determine whether the aforementioned hypothesis is true or false.

Raymond Gallardo

Learn More →

Leave a Reply

Your email address will not be published. Required fields are marked *