Transformation and Data Management Concepts – Transform, Manage, and Prepare Data

Now that you have performed some data transformation exercises, it is a good time to read about some applicable transformation and data management concepts.

Transformation

As you progressed through the exercises that transformed the brainjammer brain wave data, the first action you took was to learn something about the data. This phase is often called data discovery. To transform data, you need to know its structure and its characteristics, like data types and relationships. Upon discovering the shape and purpose of your data, you can then begin a phase called data mapping. This phase is where you consider the current state of the data and decide what you want the end state to be. You will need to identify each piece of data and map it to a final state. This includes identifying activities such as aggregation, filtering, joining, or modifying that will transform the data in some way. Next, you would create the code that will perform the transformation. Languages such as PySpark, Scala, C#, and T‐SQL are commonly used in a Big Data context. The execution of the code against the raw data and the review of the outcome completes the final two steps. Figure 5.37 is a visual representation of the data transformation process.

FIGURE 5.37 The data transformation process

The data review phase does not necessarily mean that the data transformation process is complete. There can be numerous iterations of this process. As you experienced while performing the exercises in this chapter, a transformation process includes the conversion of data existing on an Azure SQL database to a Parquet file. Another transformation referenced dimension tables to create a version of all available brainjammer brain wave readings that was easier to use. A common phase in the process that happens after a data review is data enrichment.

Enrichment

Each time you iterate through the data transformation process, you expect the data to get better. That is called enrichment. Removing null values, normalizing the data, or removing data outliers are all types of data enrichment that improve usability, accuracy, and understanding. In Exercise 5.13 you will perform data enrichment to aggregate and normalize data using Azure Synapse pipeline activities and data flow transformations. The completed Azure Synapse pipeline will look something like Figure 5.38.

FIGURE 5.38 Transforming and enriching the data pipeline

Raymond Gallardo

Learn More →

Leave a Reply

Your email address will not be published. Required fields are marked *