New in Cloudera Enterprise 6: Apache Hive 2.1
We recently released Cloudera Enterprise 6.0 featuring significant improvements across a number of core components. In this blog post, we’re going to focus on Apache Hive 2.1.
Hive’s Approach to Rebase: Stability and Quality Most Important
Prior to the release of Cloudera Enterprise 6.0, Cloudera’s supported platform included Apache Hive 1.1 augmented with numerous features, enhancements and fixes from the later Apache Hive releases—all of which were included only after rigorous quality criteria were met. As Hive forms the foundation of the data infrastructure stack in our customer base, customers have routinely highlighted quality and stability as the most important aspects when it comes to Hive. Therefore, as in pre-Enterprise 6.0 releases, we will continue to rebase to a stable upstream release in Enterprise 6.0 (Apache Hive 2.1) and then include relevant features, enhancements, and fixes from later Apache Hive releases after the requisite quality bars are met.
Usability Improvements
Apart from stability and quality, we are focusing on increasing the ease of use of our platform. To that end, we have included numerous SQL enhancements in Enterprise 6.0 which make it easier and faster for end users to transform and process data using Hive. Some of these, such as support for UNION DISTINCT
We take a holistic view of usability and consider the ability to resolve issues efficiently with Hive workloads an important dimension of usability. Cloudera advocates the use of self-service analytics, and we see our customers steadily adopting it to enable line of business users to run their analytics with minimum support from central IT. Our improvements enable both end users and cluster administrators to identify issues as they happen and provide insights for them to decide the best course of action. Specifically, we introduce HoS improvements for debugging, such as session ID, query ID, and DAG ID in the Spark UI, improved logging (removing unnecessary stack traces, better error messages, and so on) and add more metrics.
Numerous partners and customers also use Hive APIs to build applications. In the past, insufficient understanding of which APIs are internal and therefore subject to change and which ones are stable and available for public consumption, has resulted in broken applications. We are leading a community-wide effort to standardize public APIs by adding annotations to the HMS, SerDe, and StorageHandler APIs. Developers who use these APIs can now consume them with greater confidence. See HIVE-17129 for more details.
Efficiency Gains
We are also focusing on efficiency across our platform. While on-premises platform efficiency helps manage costs in the long run, the immediate benefits of in-cloud deployments are realized by reducing total cost of ownership (TCO). We introduced Hive-on-Spark two years ago to meet this goal in collaboration with Intel which is our strategic partner. We have a longstanding collaboration with Intel to optimize Cloudera’s stack on Intel architecture for our customers’ benefit.
In Enterprise 6.0, taking our strategic partnership with Intel ahead for further efficiency gains in Hive, we introduce a major performance and efficiency enhancement in HoS called Parquet Vectorization. This feature enables the HoS engine to process a vector of columns instead of one row at a time by batching data rows together into column vectors and making each operator work on such column vectors. This leads to better utilization of CPU caches and achieves high instructions per cycle by efficiently using the CPU instruction pipeline. In addition, we include numerous other performance improvements. For example, Hive often scans a given table multiple times during self joins, self-unions, or shared sub-queries. To address this, Dynamic RDD caching in HoS reuses a single scan across all these operations. Similarly, when the same subquery is used repeatedly, HoS executes this only once instead of separately for each subquery invocation. Overall, with all these enhancements, in Enterprise 6.0 Hive can be up to 2.2X faster than Hive on the latest Enterprise 5.x release. The majority of these gains can be attributed to Parquet Vectorization for Hive-on-Spark.
The Fine Print
While we have gone to great lengths to maintain backward compatibility, users will encounter a few minor backward incompatibilities in Enterprise 6.0. Some of these were unavoidable. For example, the changed semantics of UNION, while other incompatibilities were introduced to increase the pace of backporting enhancements and fixes into the new release. These incompatibilities are properly documented along with the relevant corrective actions required to fix them. As always, customers must refer to the before upgrading to Enterprise 6.0 Hive. In addition, a special Hive Change Guide (PDF) has been developed to support your upgrade to Enterprise 6.0 Hive.
Overall, this major version change includes many all-around improvements in Hive, while it minimizes breaking changes for the customers. That said, this is just the foundation that includes a handful of features and enhancements supported in Enterprise 6.0. While laying this foundation, we focused on stability, quality, usability and efficiency while maintaining backward compatibility as much as possible. Because of this foundation, future minor releases will include a rapid pace of new features and enhancements to improve end-user experience and satisfaction levels significantly with Hive 2.1, which is now available in Cloudera’s Enterprise 6.