In a recent Doppler article, “Big Data on Microsoft Azure: from Insights to Action”, we discussed how batch movement and real-time movement pipelines can be run independently or in tandem, giving an organization the ability to generate insights from multiple data paths. In this article, we discuss the steps involved and the opportunities to be leveraged from an Azure data environment.
The Batch Process
The data deluge. Data comes at us from every direction — fast and furiously! New inputs send data from websites, mobile apps and Internet of Things (IoT)-enabled devices such as smartphones. Data collections include internally structured, unstructured and external data outside the boundaries of the organization. The data ingestion step may require a transformation to refine the data, using extract transform load techniques and tools, or directly ingesting structured data from relational database management systems (RDBMS) using tools like Sqoop.
Your answer is only as good as your data. The veracity of the data determines the correctness of the insights derived from it. The discoverability and organization of the data in an enterprise are the precursors to having a sustainable enterprise data architecture. Data quality and data registration (e.g., for identifying sensitive PCI data in the payments sector) can be powered by enterprise grade technologies such as Informatica’s Enterprise Data Catalog (EDC) and Big Data Management (BDM) and Big Data Quality (BDQ), or the Azure Data Catalog.
Data lake–swamp or swimming pool? Collecting raw and refined data sets and storing them in a single location creates the need for a data lake. Azure Data Lake Storage ADLS provides unlimited storage for structured and unstructured data, and is built to support massive read throughput for querying, analyzing and storing large analytic loads. ADLS is an Apache Hadoop file system, which supports the Hadoop Distributed File System (DFS) APIs. MapReduce, Hive and other analytic frameworks can directly interface with ADLS for processing and storing analytic data. Data organization and processes around data access, refinement and archiving will determine the return on investment (ROI) of the data lake.
So much processing on big data! Workload processing on large data sets uses HDInsight, which is an Azure PaaS for the distributed processing of big data using the Hortonworks Data Platform (HDP). The processing frameworks supported in an HDInsight deployment are Hadoop, Spark, Storm, Hbase and Interactive Query.
These clusters come by default with other open source components, such as Ambari (management and monitoring), Hive (SQL interfaces on large data sets), MapReduce (parallelized processing for large data sets in a two-step programming framework), Pig (language that parallelizes processing large data sets easily), Sqoop (transfer tool between databases and Hadoop), Tez (supports a complex directed-acyclic-graph of tasks), Oozie (workflow scheduler for Hadoop jobs) and ZooKeeper (control service for coordinating distributed services such as Hadoop and Kafka).
Edge nodes are usually Linux virtual machines with client tools installed and configured, to provide the functionality of accessing the cluster, testing and hosting client applications. Edge nodes are usually stand-alone deployments to the HDInsight cluster, to allow for a separate maintenance lifecycle.
Data on data — call it metadata. Business users need information about key data attributes, such as a customer ID–e.g., where it came from (data provenance) and how it was modified (data lineage). Technical metadata tracks the processes that modify the key attributes. These business and technical views of data allow for a 360-degree perspective on the key data attributes of the enterprise, and are stored and searchable from a SQL metadata store.
Fast and faster queries. A Hive-LLAP cluster type can support Hive services for fast Hive queries. LLAP cache queries in its containers, so future queries are read from memory rather than from disk. HIVE-LLAP, as a separate cluster type from HDInsight, allows for optimized data preparation for fast queries to support analytic work benches. Both clusters are connected to the same metastore and storage to allow sharing of metadata.
Data warehouse – the reliable workhorse. Data warehousing is a mature technology with wide adoption. Its support of structured and unstructured data in the cloud has led to cloud native data warehousing technologies. Cosmos DB in Azure is globally distributed, which independently scales throughput and storage, and also supports various commonly used APIs — among them, SQL, MongoDB, Cassandra, Tables and Gremlin. Data is stored in different levels of availability in a hot storage path – e.g., Cosmos DB – and a cold storage path using SQL Data Warehouse.
Data views. You can create easy to understand views of data from different data sources by modeling it for business users in a semantic advanced analytics layer of Azure Analysis Services.
Data –> Insights. Power BI in Azure is the last mile of the journey from data to insights and actions. Embedding visualizations and advanced analytics into organizational applications and portals provides better return on investment in the big data ecosystem.
The Real-Time, or IoT, Process
Data also moves in bursts from new inputs, such as IoT-enabled devices. Here is a summary of the steps in Figure 1 relating to IoT reference data on Microsoft Azure.
Why is the milk warm? IoT applications are devices sending data used to create insights – e.g., a truck full of dairy products with sensors sending temperature data to Walmart. The data is used to check that the dairy products are being maintained at a certain temperature. These derived insights of course have business implications.
- Azure IoT Hub provides bi-directional communications, and the ability to send commands to the device at IoT scale across gateways. It supports standard HTTP and AMQP protocols, and device authentication with commonly used standards.
- Azure Databricks is an Apache Spark-based analytics platform used for correlating data between streaming data, event feeds and historical data, to derive business insights. The output from Azure Databricks is written into the Cosmos DB, which supports time-series data modeling.
Enriching data derived from the events feed and the warm storage path of the Cosmos DB can be processed in Serverless Azure Functions. From there, the data goes to the analysts or data science users to derive insights and business transformations. In activities such as shipping and logistics, tracking and analyzing vehicle data gains visibility into how route operations can be optimized, with a lower risk of accidents.
The low cost of retaining business data indefinitely, coupled with cloud-based data processing, can uncover hidden insights from the data which can create business operating efficiencies, and decisioning and predictive analysis to mitigate risks.