In Exercise 5.4 you used the Scala language to perform data transformation. You received a data file in Parquet format, transformed it to a more queryable form, and stored it in a delta lake. If you were to briefly look over the language, you might find it difficult to differentiate the Scala language from PySpark, at least in the beginning. Consider the following example; the first line is Scala, and the second is PySpark:
var df = spark.read.parquet(‘/FileStore/business-data/2022/04/30’)
df = spark.read.parquet(‘/FileStore/business-data/2022/04/30’)
The only difference you can see is with the declaration of the df variable. The significant differences between the two will only be realized when you begin programming and coding large enterprise applications. Scala, an acronym for “scalable language,” is an object‐oriented programming language that runs in a Java virtual machine (JVM). The JVM is a runtime host that is responsible for executing code, which also provides interoperability with Java code and libraries. PySpark is an API that provides an entry point into Apache Spark using Python. PySpark is considered a tool, not a programming language, that is useful for performing data science activities, but it is not necessarily the most preferred for creating high‐scale applications. Consider the following pseudo‐Scala code.
case class Mode(id: Int, name: String)
To achieve a similar outcome using PySpark, you would use the following code snippets. This example uses the Row class from Spark SQL, which requires the inclusion of the import statement.
from pyspark.sql import *
It is not difficult to see the difference, and most of what you can do using Scala can also be done using PySpark. If you are most comfortable with the Python language, then it makes sense for you to use PySpark as the syntax; the methodology for PySpark is like that found in Python. However, Scala provides some benefits that do not exist in PySpark. For example, it is type‐safe, compiles time exceptions, and offers an IDE named IntelliJ. Notice in the previous Scala code snippet that when you are defining the class, you also need to define the data type for the value, which is not required in the PySpark example. This can avoid unexpected exceptions in the execution of your code. If your code is expecting an integer and receives a string, there will be an exception; however, if your class does not allow the string to be entered into the class at all, then the runtime exception can be avoided. This leads to the second example: compile time. Because the id in the Scala code snippet is typed as an Int, if a subsequent line of code attempted to load a string into that variable, there would be a compile time exception. That means the code would not run until the value being loaded into that variable matched the type, which prevents an exception from happening by not compiling. Finally, there is an IDE named IntelliJ, which is targeted toward Java developers. IntelliJ is a very useful software tool for coding and troubleshooting Java code, which can then be simply converted into Scala.
Perform Exploratory Data Analysis
Exploratory data analysis (EDA) is a process of performing preliminary investigation on data. Ideally, when performing EDA, you would detect patterns or discover anomalies that produce previously unknown insights into the data source, such as the most common day and hour that Microsoft stock increases, or that the range of THETA frequency brain wave reading values are very high in the PlayingGuitar scenario versus other scenarios. Additionally, EDA is a useful activity to check assumptions and test hypotheses, for example, that BETA frequency brain wave reading values should not be low in a work scenario. A low BETA reading is linked to daydreaming and therefore isn’t something you should be doing while working.
Before proceeding with the EDA exercise in Exercise 5.11, consider the following dataset, which provides some insights into the data you are about to analyze. The statisticalSummary.txt file in the Chapter05 directory on GitHub contains the query to render this output. The data being analyzed is the brainjammer brain wave reading value, grouped by the scenario in which the value was generated.
There are two important findings to call out in the summarized statistical information. The first is the great distance between the MIN and MAX brainjammer brain wave reading values, which will have a big impact on the MEAN value. Also, the difference between the MEAN, which is AVERAGE, and the three distributions (25%, 50%, and 75%) has major gaps between them. This means there are some extreme values, aka outliers, in the dataset. None of this is expected or desired; therefore, some action needs to be taken to ignore or remove these values from the dataset used for EDA. To avoid this skewing of the data, approximately 2 percent of the data on either end will be removed in steps 2–4 of Exercise 5.11. Note that you must complete Exercise 5.1 before you can complete Exercise 5.11; the brain waves data that exists in the [brainwaves].[FactREADING] table is required.