Storing Data Using Azure HDInsight – Data Sources and Ingestion

Like most other Azure Big Data analytics products, an Azure Storage account is provisioned along with the compute nodes and platform. Azure HDInsight is no different in this respect. When you provision an Azure HDInsight cluster, the selected Azure storage account contains the default filesystem. The storage account defaults to LRS and is intended for storing job history and logs, does not support the Premium tier (only up to Standard), and is not recommended for storing business data. That storing business data on the default filesystem is not recommended means you need an alternative solution. The recommended solution is to create a separate container or storage account. There are three options for storing and accessing data from an Azure HDInsight cluster (see Figure 4.41).

The data stored on an Azure Storage account created along with the provisioning of the Azure HDInsight cluster is accessible from both the head and worker nodes. When you ssh onto a node, the filesystem that is a container on the default storage account is accessible using the following URI:

hdfs://<nodename>/<path>

Any additional container, which exists within the default storage account, is accessible using the following syntax. This same syntax is used for accessing publicly accessible storage endpoints. Azure Storage endpoints that are publicly accessible would allow anonymous listing of containers and reading of the blobs contained within them. This scenario is reflected in Figure 4.41 as “unlinked.” The first snippet is for accessing blob data, and the second is for accessing an account that is supporting a hierarchical namespace like ADLS.

wasbs://<container>@<account>.blob.core.windows.net/<path>
abfs://<container>@<account>.dfs.core.windows.net/<path>

FIGURE 4.41 Storing data using Azure HDInsight

It is possible to access an additional Azure storage account secured by a SAS key or managed identities (MI), although the details of how to configure this aspect of an Azure HDInsight cluster are outside the scope of this book. However, it is very similar to the configuration required for an Azure Databricks cluster, in that you need to add a key‐value pair to the configuration of the Azure HDInsight cluster—something like the following syntax, which uses a SAS key:

fs.azure.sas.<container>.<account>.blob.core.windows.net <SAS-key>

There are many options for storing data securely, including redundancy and aligning the storage structure with DLZ staging principles. Keep in mind that it is important to keep the storage accounts in the same region as the cluster to reduce costs and that the Azure storage account can be modified to use GRS or RA‐GRS replication. Note, also, the associated costs of this replication and make sure your business requirements justify the potentially significant amount.

Raymond Gallardo

Learn More →

Leave a Reply

Your email address will not be published. Required fields are marked *