First generation big data platforms are characterized as on-premise deployments, on physical servers, often running a software stack from Cloudera, Hortonworks, DataStax or similar vendors. These old world platforms enable many companies to begin analyzing data in new ways by providing a unique set of capabilities to analysts. These game-changing big data platforms allow companies to compete in markets in new ways.
While these systems provide new capabilities, they often come at a high cost. Costs include both capital expenses to purchase unproven capabilities, and operational expenses to keep these complex platforms functioning. Compared to the operational requirements of public cloud services, many on-premise deployments are held back by a three-fold increase in the number of operational resources needed to maintain the platform, manage capacity, respond to incidents, manage upgrades and educate platform users. Every resource that is being consumed by platform maintenance (Figure 1) is less support for application developers and data scientists looking to extract value out of the data stored in these big data platforms.
Because of these operational challenges, many customers are now actively moving their big data workloads to next generation platforms in order to leverage native services and capabilities provided by the major public cloud providers. AWS, Google and Azure all have rapidly maturing and highly competitive managed services and PaaS capabilities. These enable organizations to focus their resources on core business needs and advanced capabilities such as data integration, data analysis and reporting.
Even with next generation big data platforms, the challenges around data quality, model accuracy and data integration will continue. But now the staff will be focused on resolving those and aligning the data with the business needs, not the operational aspects of deploying services and maintaining stability of the platform.
Cloud vendors provided two primary models for big data services:
Managed – Services like AWS Elastic Map Reduce (EMR) run on instances, but also provide pre-package OS images with all the necessary software tested and integrated. Managed services provide a level of flexibility because admins and users can still get to the host operating system to manage some configurations, but have the advantage of upgrades being handled by the platform vendors. Managed services provide a minimal level of interaction and configuration, allowing operations teams and analysts deeper visibility into the platform. In highly regulated environments this can often provide the necessary level of visibility.
PaaS – Platform as a Service offerings, such as BigQuery from Google, provide an interface for users to connect to and submit queries, with no need for the user to take on administrative functionality. There is no ability to login to the instances running the services and no user visibility into the inner workings of the service as it relates to scalability and reliability. PaaS services offer a valuable balance of service capability without the need to have operations teams focused on service deployment and operations. PaaS services provide no visibility into the inner workings of the service, all aspects of scalability, data movement and upgrades are handled by the provider, allowing users to focus entirely on consuming the service, and not operating the service. This level of abstraction is very valuable for companies with small or no operations teams.
Both Managed and PaaS services provide advantages over First generation big data platforms, through their minimized requirements for operations, management and administration. The rapid pace of innovation and constant stream of new features from AWS, Google and Azure (Figure 2) only amplify the benefits with new features and capabilities, deployed at scale and often available for minimal cost uptick.
First generation big data platforms made big data analysis a reality. But if you are an owner of a First generation big data platform, and you find yourself mired with the day to day burden of operations and stability maintenance, at the expense of supporting data integration and analysis, then consider public cloud. Through the use of managed and PaaS services, your data analysis team can focus on service consumption and quality of results, while operations teams can focus on ensuring best practices and user satisfaction.