Big Data Warehousing
Data Warehousing is Dead? Or Long Live Data Warehousing?
Every large organization has an enormous amount of historical data tied up in relational databases in the form of data warehouses and data marts. These data warehouses are the workhorses behind business intelligence reporting and analytics. Even when an organization is embarking on a big data journey in the cloud, recreating the data warehouse on the cloud may not be advisable. Most organizations are instead creating use case-specific slices of the legacy data warehouse when relational data warehouse-like denormalized structures are required, and then enabling them on the cloud through additional technologies. If your organization has strong resources well-trained in SQL (and it probably does!), you will want to consider Apache Hive.
Apache Hive to the Rescue
Apache Hive, initially developed by Facebook, is a popular big data warehouse solution. It provides a SQL interface to query data stored in Hadoop distributed file system (HDFS) or Amazon S3 (an AWS implementation) through an HDFS-like abstraction layer called EMRFS (Elastic MapReduce File System).
Apache Hive on EMR Clusters
Amazon Elastic MapReduce (EMR) provides a cluster-based managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. Apache Hive runs on Amazon EMR clusters and interacts with data stored in Amazon S3. A typical EMR cluster will have a master node, one or more core nodes and optional task nodes with a set of software solutions capable of distributed parallel processing of data at scale.
Software Selection in EMR Cluster and Apache Tez
You have a choice whether or not to install Apache Tez on an EMR cluster along with Apache Hive. Traditionally, Hive processing is done through the MapReduce execution engine that has to keep writing data back to disk while traversing a computational graph. This incurs a performance penalty for disk I/O. In Apache Tez, intermediate data is passed directly to the next node in the computation graph and no data is written to disk. If you install Apache Tez along with Hive, Tez becomes the default execution engine. We recommend you use Tez with Hive because Tez will generally improve the Hive query performance. We will discuss Apache Tez more in the performance optimization section of this article.
Hive Data Storage Considerations
The recommended best practice for data storage in an Apache Hive implementation on AWS is S3, with Hive tables built on top of the S3 data files. This separation of compute and storage enables the possibility of transient EMR clusters and allows the data stored in S3 to be used for other purposes. The two most important considerations for an AWS-based Apache Hive data storage design are: 1) the Hive storage structure and 2) storage format of the files in the S3 buckets.
Hive Storage Structure
Under the top level S3 bucket, we should organize the data files in a folder structure that allows a query engine to optimize data access by avoiding scanning large tables (files) and by optimizing joins of multiple tables. Two strategies that we typically employ to achieve this optimization by organizing the data on S3 are: 1) Hive-based partitioning and 2) bucketing.
A partition is a directory in Hive, where the partition key value gets stored in the actual partition directory name and the partition key is a virtual column in the table. However, in the case of bucketing, each bucket is a file that holds the actual data that is broken down on the basis of a hash algorithm. Bucketing does not add a virtual column to the table. The optimal partitioning strategy results in faster query response through partition elimination and bucketing results in better response through joint optimization.
Hive Storage Format
Items to be considered while choosing a file format for storage include:
- Support for columnar storage
- Schema evolution
- Indexing capabilities
We recommended ORC as the starting point for the most suitable file format for Apache Hive. ORC files are divided into stripes that are independent of each other. We can build indexes to determine which stripes in a file need to be read for a particular query, and the row indexes can narrow the search to a particular set of 10,000 rows for high-performance reads. Within each stripe, the columns are separated from each other so the reader can read only the columns that are required.
Slowly Changing Dimensions in Hive
Managing slowly changing dimensions is one of the most important processes in a data warehouse. The canonical example of a slowly changing dimension is a customer who has recently moved to a new address. We have a record for the customer in the data warehouse with the old address information and another record for the same customer, with a new address, has come in from the transactional system.
The two most popular approaches to handling such a situation in analytical systems are called SCD (slowly changing dimension) Type I and Type II.
A Type I strategy overwrites an existing dimensional attribute with new information. In our example, we will overwrite the old address with the new address and won’t worry about keeping historical information about the customer.
A Type II change writes a record with the new attribute information and preserves a record of the old dimensional data. So we insert a record with the new address information, make the record the active customer record and deactivate the previous customer record. At any point in time, we have a complete history of the customer’s address changes.
Strategy to Implement SCD Type II in Hive
Hive and Hadoop are optimized for write-once and read-many patterns. Any design such as SCD II that requires an update is typically not a great candidate for such systems. But over time, features have been added to support these scenarios.
In order to implement SCD II, we have to enable ACID transactions in Hive. Currently, ORC is the only file format that supports ACID transactions in Hive. As ORC is a write-once file format, changes are implemented using base files and delta files where insert, update, and delete operations are recorded. When the number of deltas exceeds a threshold, a minor compaction will automatically run and merge a set of changes into a single delta. When these compacted deltas get large enough, a major compaction will rewrite the base to incorporate these larger deltas.
Hive Security and Ranger
Apache Hive currently provides two methods of authorization: storage-based authorization and SQL standard authorization. SQL standard authorization provides grant/revoke functionality at database and table level, using commands that would be familiar to a DBA. If more detailed control is needed, we can use Apache Ranger, which provides more granular access control.
Apache Ranger Centralized Security and Audit Framework
Apache Ranger offers a centralized security framework to manage fine-grained access control over Hadoop, Hive and other related components such as HBase. Using the Apache Ranger administration console, users can easily manage policies controlling access to a Hive database, table, or column for a particular set of users and/or groups. For deeper control of the environment, Apache Ranger also allows for audit tracking and policy analytics.
Apache Ranger policy control consists of two major parts:
- Specification of resources for which the policy is applicable (such as Hive database/tables/ columns)
- Specification of conditions, such as users/ groups, access-types and custom-conditions, for which the access should be allowed
Hive Performance Optimization
We have already discussed three important elements of an Apache Hive implementation that need to be considered carefully to get optimal performance from Apache Hive.
- Make sure Tez is installed on the EMR cluster and used as the Hive execution engine
- Partition the data to avoid table scans
- Use ORC as the underlying storage file format
In this section, we want to discuss Tez in more detail, and mention three more performance levers that can significantly improve query performance in Hive: Vectorized Query Execution, Cost Based Optimizer and Long Live and Process (LLAP).
Apache Hive on Tez
In Urdu the word ‘Tez’ means fast, swift, intelligent. Apache Tez has became the new paradigm for Hive execution by enabling sub-second query performance that was not possible in the ‘MapReduce’ world. MapReduce is still supported for Hive execution but Tez is now the default engine when running Hive jobs in Hadoop. As mentioned before, Tez avoids disk IO by avoiding expensive shuffle and shorts, while leveraging more efficient map side joins. For a typical execution pattern, data flows from node to node of an execution graph (like Apache Spark, Tez represents the computation as a direct acyclic graph); reducer’s intermediate data is passed to the next reducer without any disk writes. Consequently, Apache Tez benefits from more memory (heap size of HiveServer) and tuning of memory parameters. For example, one best practice is to set the size of Tez containers as a multiple of the YARN container size. There are many discussions and guides on Tez performance tuning on the Web. We recommend that practitioners absorb the details, understand the underlying concepts and experiment with real data.
Vectorized Query Execution
Hive’s default query execution engine processes one row at a time. This requires multiple layers of virtual method calls within the nested loop, which is very inefficient from a CPU perspective. Vectorized query execution is a Hive feature that aims to eliminate these inefficiencies by reading the rows in batches of 1024 and applying the operation on the entire collection of records at a time rather than individually. This vector mode of execution has been proven to be an order of magnitude faster for typical query operations such as scans, filters, aggregations, and joins. In order to use vectorized query execution, you must store your data in ORC, set the format.hive.vectorized.execution.enabled property to true and run the query against the ORC-backed tables. Because vectorized query execution in EMR clusters is not currently enabled by default, it is necessary to enable this behavior in Hive manually.
The idea of cost based optimizer (CBO) in Apache Hive is very similar to that in the relational database world. We gather statistics such as number of rows in a table or partition and the histograms of a particularly interesting column (that serves as the input to the cost functions of a query optimizer), so the CBO can compare different query execution plans and choose the one with the lowest “cost”. In some cases, a very fast response to queries can be achieved by only querying stored statistics rather than firing long-running execution plans.
The CBO engine in Hive uses statistics in the Hive Metastore to produce optimal query plans. There are two types of statistics that are used for optimization: table stats (which include the uncompressed size of the table, number of rows, and number of files used to store the data) and column statistics.
The downside of the CBO is the fact that you must gather and maintain accurate statistics about your tables in order for the cost-based optimization engine to be effective. Unfortunately, the collection of table statistics is an expensive operation, but once done, all subsequent queries involving the table will benefit from collected statistics.
Long Live and Process Daemon (LLAP)
Hive has become more and more performant as the demand for performance has increased and the various solution components were added including Tez and Cost-based-optimization. LLAP, long-lived daemon which replaces direct interactions with the HDFS DataNode, takes Hive to the next level of maturity by pre-fetching and caching of column chunks among other things. A LLAP daemon runs on the worker nodes on the cluster and handles I/O, caching, and query fragment execution. Overall execution is scheduled and monitored by an existing Hive execution engine (such as Tez). The result of the work performed by an LLAP daemon can either form part of the result of a Hive query, or be passed onto external Hive tasks, depending on the query.
One of the key requirements in many environments is fine-grained column-level access control. Given that LLAP daemons can be used by other applications and the daemon is also open through optional APIs, one could implement fine-grained access control for another data processing framework using LLAP. A great of example of this is the ability to load data into Apache Spark, SQL DataFrames from Apache Hive using LLAP. With Apache Ranger, one can provide row/column level fine-grained access controls that Apache Spark by itself can’t provide.
Apache Hive has matured significantly over the years into a platform that can support the big data warehouse needs of a large enterprise. In the early years, any query workload that needed quick response time was not suitable for Hive. But currently on a properly tuned Apache Hive on Tez with LLAP implementation, sub-second response times are possible. An appropriate use case-driven, well-designed Apache Hive based data warehouse solution is worthy of serious consideration for today’s enterprise.