This is the first article in a multi-part series discussing the strategic considerations and crucial technical details that senior managers and CxOs need to consider in an enterprise-wide analytics infrastructure modernization strategy.
The Old Octopus and Its Tentacles
If we walked into a Fortune 500 company today and investigated the analytics infrastructure that supports its financial reporting, OLAP slicing and dicing for business intelligence, and advanced analytics and dashboards for the CXOs, we would most likely find a massive, clunky old data warehouse churning noisily at the heart of it all. The legacy data warehouse, like an old octopus, extends its tentacles into the deeper corners of the organization, either feeding on or spewing out data in various shapes and forms. Most enterprises pay millions in annual licensing fees and employ hundreds of ETL developers, DBAs and report writers to support, maintain and modify all the data feeds going in and coming out, and the thousands of static and dynamic reports hungrily consumed by business teams all over the company.
If we sat down with the business users who are the supposed beneficiaries of such a large recurring investment, we would most likely hear a long list of usability, performance and time-to-market issues, as users openly discuss their deep dissatisfaction with their dependence on the IT organization.
So, why do businesses still keep this extremely expensive, lumbering, coughing and wheezing data warehouse that has clearly outlived its purpose?
Barriers to Modernization
Here are the three most powerful factors that play a dominant role in maintaining the data warehouse status quo.
Lack of a Defining Event
Discrete pockets of dissatisfaction rarely coalesce into a voice powerful enough to force the organization to start thinking about disrupting a legacy system that has become an unruly octopus. In our experience, we rarely see a push toward data warehouse modernization without a defining event–either an enterprise license renewal/true-up or an impending end-of-life event for one of the core technologies. In these situations, it is not too difficult to make a convincing financial case. An ROI analysis will often demonstrate the potential reduction in total cost of ownership (TCO), as a result of annual cost savings amounting to millions of dollars.
The many large teams and highly skilled people who continue to care for and feed the data warehouse octopus have no real interest in venturing into the unknown. Senior and mid-level management are uncomfortable with starting an enormous undertaking termed “data warehousing modernization.” Without diligent education, awareness and change management, it will be hard to overcome this resistance within a company.
Fear of Disruption
For many years, the tentacles of the data warehouse octopus have reached far and deep into the organization. Some important processes, such as the labor-intensive, multi-week madness of generating financial reports, have been built around the limitations of the data warehouse. Without a clear strategy, merely the possibility of change and its consequent disruption makes seasoned IT directors weak at the knees.
Organizations which have embarked with us on such a journey first take a strategic pause and invest in building a careful roadmap that anticipates potential challenges and mitigates risks.
A Case for Modernization: Beyond the Obvious
Gerry Fierling, a colleague from my days as a Microsoft Senior Product Manager for Business Intelligence, could distil the essence of a complex situation into a small, intriguing sentence. He used to say:
“The biggest reason behind the failure of a data warehouse is its success.”
What Gerry meant was that as more people utilize the reports and feeds from the data warehouse, more innovative usage requirements crop up. Many of these new requirements can’t be supported by our unruly old octopus because it’s almost impossible to design a data warehouse that anticipates all possible future access patterns. The moment you commit to a structure–a relational data model, to be specific–you’ve planted the seed for future dissatisfied users, whose requirements won’t be supported without sticking some amount of plaster and glue onto the solution.
“Schema on Read” Rather than “Schema on Write”
The concept of “schema on write,” which means committing up front to a structure to store data, is inherently limited to a set of initial and some future use cases, even within a well-designed data warehouse. Yet modern enterprises need to support a wide variety of evolving analytics patterns and workloads. Precommitting to a relational structure and treating it as a panacea almost always leads to unhappy end-users.
A data design and storage approach that allows us to store data in raw form without committing to a structure enables different tools to impose a structure, or schema when the data is read. This “schema on read” approach is not tied up front to a specific model. Every tool that uses all or part of the data can use its own schema to add the specific meaning to the data required by the specific analysis pattern.
Many Tools for Many Workloads Rather than Standardization
I will use another Gerry Fierling quote that points to the ultimate limitations of all business intelligence tools when it comes to supporting self-service BI.
“The last mile of business intelligence is always Microsoft Excel.”
You may have invested in a multi-million dollar data warehouse appliance; enterprise licenses for Tableau, MicroStrategy and Business Objects; a corporate performance management software suite for planning, budgeting and forecasting; and a large onsite and/or offshore team that is furiously pumping data from your SAP or Oracle applications into your data warehouse. But if you roamed the office corridors in disguise, peeking at the computers of your most prolific business analysts, you would find them all hitting the “export to Excel” button in a hurry. Excel, notwithstanding all its inadequacies, still frees users from the shackles and idiosyncrasies of the fancy tools that you’ve invested in.
Why is “export to Excel” the most popular button in a BI tool?
A typical list of analytics activity in a large enterprise may look like this:
- Monthly data mining computation that involves running large-scale neural networks on a twenty node cluster.
- Filtering, joining and summarizing terabytes of data over the weekend for Monday’s CxO dashboard.
- Nightly fuzzy deduplication and record linkage process crawling through multiple data feeds, connecting and grouping such data.
- Full-text searches against terabytes of text that require sub-second response time. It is simply not possible to standardize on a small set of tools that gracefully serves all these masters without running into performance issues.
If we constrain users with enterprise standards, they start generating hundreds of feeds out of the data warehouse to run specific workloads, mostly using Excel. We’ve seen a large enterprise use Business Objects mainly as a data feeder to Excel. Dependence on IT grows, self-service business intelligence remains an aspiration and the proliferation of Excel worksheets permeates all levels of the organization. To enable innovation across the organization, analytics infrastructure should support a variety of front-end analysis patterns and a range of tools.
Polyglot Persistence Rather than Relational Models
James Serra defines polyglot persistence in one of his blogs as follows:
“Polyglot Persistence is a fancy term to mean that when storing data, it is best to use multiple data storage technologies, chosen based upon the way data is being used by individual applications or components of a single application.”
Martin Fowler in his blog on polyglot persistence proclaims:
“I’m confident to say that if you’re starting a new strategic enterprise application, you should no longer be assuming that your persistence should be relational. The relational option might be the right one–but you should seriously look at other alternatives.”
You need to match a specific workload to an execution engine tailored to the workload. You could run full-text searches on your data warehouse, or even your NoSQL MongoDB cluster, but that approach is not going to beat the performance of an ElasticSearch Engine. Imagine a world where the organization is supported by: a Hadoop Distributed File System (HDFS) backed data lake; a massively parallel processing data warehouse appliance for the very hungry, join intensive queries; Apache Hive on Tez with LLAP for batch SQL queries; Apache Spark for stream analytics and machine learning activities; an ElasticSearch cluster for search-based analytics; and a MongoDB based Product Catalog.
Developing the infrastructure, processes, and skills to build on and support such a diverse set of technologies requires a fundamental strategic shift and a long-term commitment to that shift and its price tag on the part of the enterprise.
Cloud Rather than On-premises
- How do we modernize our analytics infrastructure to support a large spectrum of workloads to enable business users to extract maximum value?
- How does the IT department enable self-service and get out of the way of innovation?
- How do we do this with the best possible economics?
Enterprises need to seriously consider migrating their analytics workloads to the cloud. This is a self-healing, auto-scaling infrastructure with multiple clusters that support a variety of tools and workloads and enable self-service analytics. The cloud will significantly lower downtime and TCO with increased performance and tighter security.
At CTP, we have started this journey with multiple organizations by first making a careful assessment of the analytics application portfolio and building a TCO model that highlights the benefits of cloud economics.
Key Considerations for Modernization
We strongly believe that it is going to be increasingly difficult for an enterprise to build, maintain and evolve an on-premises analytics infrastructure that supports the complex and varying need for data, analysis, and reports across the organization. Consequently, we make the critical assumption that as an organization, you are committed to taking full advantage of cloud technology for data warehouse modernization. Here are some strategic ways to proceed.
Even for organizations that have wholeheartedly committed to adopting the cloud, we recommend a hybrid approach to start. The first step is to clearly segment the current system into a set of well-defined workloads mapped to specific constituents. It is not advisable to move all workloads to the cloud in one phase, even if you are considering simple “lift and shift” operations. At CTP, our Cloud Adoption Program provides a prescriptive roadmap that details how to take your workloads to the cloud systematically. We recommend initially staying deliberately hybrid as you learn, educate and manage the change.
Enterprise Data Lake
Our earlier discussion of schema on read (with multiple analytics engines imposing a schema of their choice at the time of reading the data) naturally leads to the concept of building an enterprise data lake. This is a place to collect and store enterprise data, both structured and unstructured, without worrying about further structuring it in some fashion.
The enterprise data lake is typically built on a Hadoop Distributed File System (HDFS) that enables parallel and distributed computation on massive data sets, and scales with the growth of the enterprise and its data assets.
When migrating from large on-premises clusters with big MPP machines to a cloud-based infrastructure, we should not think about long-running, always-on clusters, unless we absolutely need them. For most advanced usages of enterprise data, especially data science related workloads, we are only interested in the end results of the analysis. Cloud offers ease and the associated cost savings by allowing you to automatically start up a massive cluster, compute the result set and shut down after the job is done. The result set can be consumed by reporting or dashboarding tools for further analysis or executive reporting.
Same Load, Many Clusters
In the same way that we need to start thinking about clusters that come alive to do their intense computing and then go back to sleep, we should also break out of the habit of considering only single clusters, and start thinking about many clusters supporting different workloads. If you get used to the specific development, test and deployment patterns associated with ephemeral clusters, the natural next step is to think about running separate clusters for the various consumption patterns; for example, one or more for data ingestion, one or more for fast queries and one for data sciences.
We’ve mentioned that the favorite technology to build a data lake is HDFS. If you commit to HDFS, natural choices are Apache Oozie for workflow management, Apache Pig for scripting and Apache Hive for batch and some interactive queries.
Apache Spark has gained popularity in recent years and should be a serious contender to handle streaming analytics and machine learning workloads. We are also seeing Redshift, DynamoDB and ElasticSearch clusters co-exist with the Hadoop ecosystem in an Amazon Web Services deployment. All tools come with certain limitations. Therefore, careful upfront analysis is required to make sure must-have features are supported or in the roadmap to be supported in the near term.
Automation of Data Ingestion
In our discussions, we repeatedly hear one concern about polyglot persistence: the complexity of back-end data integration. Multiple processing engines do require more ingestion code and associated development, maintenance and modification costs. But by splitting out the workloads into multiple automating engines, we can simplify the data structures in each individual engine. A very typical pattern is to write ingestion code for the data lake, and use transient AWS EMR transient clusters or AWS Lambda functions to trigger automatic data updates to other persistence engines.
The above diagram from AWS big data documentation illustrates a data load from on-prem to S3, through a transient EMR cluster for further transformation to the eventual load into S3.
File Formats and Performance
In the world of distributed computing using clusters, file format choice can be crucial. The core idea is to use splittable and compressible file formats that can be split and processed in various nodes and transferred compressed over the network. Avro, Parquet and ORC are now all familiar file format names, but all file formats are not equal. In our experience, Apache Hive performs much better when the data is stored in ORC format. On the other hand, Apache Impala is partial to Parquet.
Opportunity in Complexity
The transformative power of cloud technologies brings enormous value to the advanced analytics solutions benefiting the modern enterprise. But with so much power comes responsibility: to carefully analyze the complexity in the world of legacy data warehousing; to form a solid hybrid cloud strategy to evolve and modernize the analytics infrastructure; to effectively manage the change; and to eventually save millions of dollars for your organization in the process.
In subsequent articles in this series, we will go deeper into specific technologies and solutions.