“It is a capital mistake to theorize before one has data.”
More than 100 years after it was written, this observation by fictional sleuth Sherlock Holmes still rings true. Businesses today rarely act – or even contemplate an action – without analyzing troves of data.
They can make good use of data, in part, because technology has democratized the processes that bring information to the end user. Data is no longer gathered only by research companies and delivered in expensive, structured reports. Now, data gathering has been commoditized to the point where small, as well as large, companies can ingest huge amounts of unstructured data from mobile, web and IoT interfaces, and manage it in the cloud.
Historically, the major big data vendors (e.g., Cloudera) have had their own stacks to help companies capture, process and manage data–usually on an Infrastructure as a Service (IaaS) cloud platform. AWS has case studies showing companies using their data analytics–for example, GE Power, to help power plant customers save money; and C-SPAN, to identify when individual speakers appear on screen. Lahey Health uses Google Cloud analytics to help improve patient outcomes.
In recent months we have been working on a number of big data projects on the Microsoft Azure platform. Microsoft has been aggressive on the data front, creating a range of products tailored for specific customer groups – small, midsize and enterprise businesses. The big difference is in the shape of the offerings. Microsoft is providing Azure cloud-based data capabilities through a stack that resembles a Platform as a Service (PaaS). It has forged tight integrations, right out of the box, between the different data sources and the stack itself, giving customers a lot of flexibility in their deployments.
With Data Comes Flexibility
Cloud-based data projects are gaining momentum in the marketplace. Once set up, they are information powerhouses – capturing, storing and processing volumes of data companies can use to gain a business advantage. Organizations can dial the systems up or down, to ingest big chunks of information, or stream it in little by little. They can do custom projects on demand, and save time and money on bigger projects they previously outsourced to data providers.
The data pipelines may look complicated, fusing together a series of steps to bring data to its ultimate end point: the desks of business intelligence professionals. To simplify things, data movement in the pipelines can be segmented into two categories: batch and real-time.
Batch movement is applied to large data sets, usually for mainframe processing or more cost effective business operations. An example of batch data movement and processing occurs when a credit card transaction is made with a merchant, and an authorization code is sent to the merchant by the issuing bank. To settle or close, all the credit card authorization codes received by the merchant are usually compiled at the end of the day, and sent in batch, since it is more cost effective, to the payment processor for sorting and forwarding to the issuing banks. At settlement, the issuing banks release those funds to the processor, which deposits them into the merchant account, paying for the transaction. The issuing bank shows it as a purchase on the credit card account holder’s next statement. For the issuing bank, batch data movements result from asynchronous batch processing, where payloads are published at different times, and the journey to account statement generation results from stepwise processing, which includes refining, enriching, formatting and joining the data.
Real Time Movement
Companies such as Linkedin, Twitter and Facebook have event driven business models and systems, which address streaming real-time data, generating, processing and analyzing the needs. Every tweet, click and profile edit is a real-time update which is captured and can total more than a billion events per day, recorded as they stream (e.g., Kafka at LinkedIn). These streaming data platforms, developed to capture real-time updates, when combined with historical views of the data, enable businesses to respond to events in real time.
To give you an idea of how a PaaS data system operates and how data travels from place to place, Figure 1 offers a look inside Microsoft Azure’s cloud-based big data movement pattern.
Flowing Through Different Pipelines
Data follows two separate and distinct pipelines depending on how it is captured.
Existing processes for businesses usually follow batch movements and associated extract, transform and load (ETL) processes, to ensure that the data is cleaned and de-duped to enable on-premises capabilities and products. Batch movements can bring in bundles of structured, semi-structured and external data. Big data systems capture the variety, velocity and volume of the data that needs to be collected, processed, transformed and managed, to derive relevant, meaningful insights.
Then there are the new streams of data that need to move through the system in bursts. Events-based, stream-based and IoT-based data capture and processing is exploding in the data ecosystem, along with the associated architectures and cloud services. Insights can be derived from live streams, interactive sessions and logs from website clickstreams, and processed in real time.
Cloud-native warehouses are a breed of products which are taking advantage of the decoupled storage and compute in the cloud, which delivers scalability, elasticity and cost effectiveness. The decoupled storage (e.g., S3 in AWS or Blob in Azure) can persist and grow independently, while the compute can autoscale, be paused or resumed. Cloud-native warehouses replicate across regions, providing reliability and availability.
The recommended approach for exploring the Azure big data landscape on PaaS is usually to begin implementing batch data movement patterns, which in full maturity can support a business driven use case. Companies can do this on big data platform pipelines incrementally, in proof of concept (PoC) or development (DEV) environments, with a team of dedicated developers. Figure 1 shows the two parallel flows: batch movement and real-time.
Prepping for a Data Project
There are several prerequisites for a company setting up a big data services environment. First, you need people who understand the data stack. Having the technology in house is a big step, but you will need access to a strong developer community or a consulting service. The right components need to be mapped to the data. Having the right level of knowledge in house will enable you to deploy applications with the automation scripts to support ongoing integration with the data pipelines.
Second, you need big data infrastructure. Infrastructure dependencies, such as networked environments, private domains and security and authentication services in a networked environment, are traditionally managed by a dedicated infrastructure team. In the cloud, you will need people who understand the infrastructure setup on which these data movement pipelines are run. But as the pipeline matures, you will not need a whole team to maintain it. The job can generally be handled by fewer cloud infrastructure experts.
Third, you need tools. Primary tools for automating big data applications have been ARM Templates and PowerShell for cluster deployments. Azure DevOps can be the primary orchestration tool for continuous integration and delivery for all applications. Azure DevOps enables an agile board, collaboration wiki and git repositories and build and release management with manual and scheduled automation pipelines.
Now you are ready to run the operation. Batch movement and real-time movement pipelines can be run independently or in tandem, giving an organization the ability to generate insights from multiple data paths. Discussion of the steps involved and the opportunities to be leveraged from an Azure data environment are laid out here.
Data has come a long way. It is no longer a hard to get, hard to process resource confined to back rooms. Advanced technologies have put data in the hands of a wider array of people, giving organizations the ability to run more projects and generate more insights that are useful to their business, at rapid speed. Companies are not theorizing anymore – they are gathering data and acting on it. Cloud platforms such as Microsoft Azure are bringing data into the modern age.