Storing Prepared, Trained, and Modeled Data – Data Sources and Ingestion

All data, regardless of the Big Data stage it is in, must be stored. The data not yet ingested into the pipeline is stored someplace too, but not yet on Azure. You can use Azure products like Azure Stream Analytics, Event Hubs, Apache Spark, Azure Databricks, and Azure Synapse Analytics to handle the ingestion of data onto the Azure platform. As the data is passed through any of those products, it finds its way into your data lake, which is most commonly an ADLS container.

Figure 4.42 demonstrates the ingestion of data, its storage on ADLS, and then its transformation through the different DLZs. It also shows which data roles would most likely use the data in the given DLZ state. Because an ADLS container has a globally discoverable endpoint, any product you find useful for your Big Data analytics can access the data and use it as required.

FIGURE 4.42 Storing prepared, trained, and modeled data

Summary

In this chapter you learned about the difference between physical and logical data storage structures, including concepts such as compression, partitioning, and sharding, which help to improve performance. Adhering to privacy compliance and reducing storage costs were also discussed. You loaded a lot of data into an ADLS container and ingested it into an Azure Synapse Analytics workspace. Then you manipulated the data using the Develop hub, SQL pools, and Spark pools. One of the highlights is that you created your first Azure Synapse pipeline and loaded data from an Azure SQL into the raw DLZ of your ADLS container. Slowly changing dimension (SCD) tables, fact tables, and external tables should all be very clear in their purpose, constructs, and use cases.

The serving layer is one of the three components of the lambda architecture, the other two being the batch layer, which is discussed in Chapter 6, and the streaming layer, which is discussed in Chapter 7. You learned that the serving layer receives data from both the cold path (i.e., the streaming layer) and the hot path (i.e., the batch layer), and both are available to data consumers who have different expectations of the data’s state. Finally, you read about how data can be stored and accessed across DLZs from Azure Databricks and Azure HDInsight. Because all Azure storage products have a globally identifiable endpoint, any product you use for performing data analytics can access, read, and process the data stored on it.

Exam Essentials

Physical vs. logical storage. The difference between physical and logical storage is that the former involves objects you can touch, while the latter is of virtual construct. Physical storage structures include objects like a hard disk or a tape containing a backup. Logical storage structures include extents, segments, and tablespaces, which reside on the physical disk.

Compression, partitioning, and sharding. The amount of storage consumed on Azure is what drives the cost. Therefore, the less storage you consume, the less you pay. Compression is an effective way to reduce size and cost. The most common codecs include ZIP, TAR, and GZ. Partitioning refers to vertical partitioning, whereas sharding refers to horizontal partitioning. Breaking a table into many tables based on an attribute like the sequential value of READING_ID is an example of vertical partitioning, whereas breaking it into SCENARIO is an example of horizontal partitioning. Vertical partitions contain a subset of fields that horizontal partitions split into rows, from one table into different tables.

Slowly changing dimension. Slowly changing dimension (SCD) tables enable you to maintain a history of the changes to your data. As you progress along types 1, 2, 3 and 6, the greater amount of data history is available. Type 6 SCD is a combination of all capabilities found in types 1, 2, and 3. Temporal tables are supported on Azure SQL and SQL server but not on a SQL pool in Azure Synapse Analytics. Temporal tables target the Type 4 SCD pattern.

The serving layer and the star schema. The serving layer is where consumers retrieve data. It is fed by both the batch and streaming layers, which together make up the lambda architecture. The transformation that happens in and around the serving layer requires reference tables, which are referred to as dimension tables in the Big Data context. The star schema can be visualized by a fact table, which is surrounded by and references four to five dimension tables. The table relationships appear in a star pattern.

Metadata. Without a way to organize and structure data, it is difficult, if not impossible, to find value in it. A data file that contains a bunch of numbers or a database table that has a generic name or generic column names has little to no value. Many define metadata as data about data. Valid table names, schemas, views, file names, and file directory structures all provide some insights into what the data within them means.

Raymond Gallardo

Learn More →

Leave a Reply

Your email address will not be published. Required fields are marked *