CUBE Keyword in Apache Hive

阿新 • • 發佈：2019-01-11

From

Enhanced Aggregation, Cube, Grouping and Rollup

CUBE Keyword in Apache Hive

By Rajat VenkateshPublished June 19, 2015 Updated July 13th, 2018

Introduction

As part of a recent project – I had to experiment with CUBE functionality in Hive. This functionality was added somewhat recently to Hive (version 0.10) and is an advanced use case in Hive. Perhaps for these reasons – it is difficult to find examples other than the one in the

Hive Wiki. In this post – I am documenting some of my experiments in setting up a CUBE on TPCDS. I hope that this is useful for other users new to Hive and/or Cubes in Hive.

Data Model

For my experiments I used 500 GB scale TPCDS data set, a 10 node Hadoop cluster and Hive 0.13.1 running on Qubole Data Service (QDS).

My goal was to calculate various measures on store sales. More specifically, I wanted to calculate:

Total Extended Price
Total Sales Price
Total Net Profit
Total Wholesale Cost
Total Coupon Amt
Total List Price

These measures then need to be broken by many dimensions. For example we can drill down in the following dimensions (Levels in parentheses):

Date (Year, Quarter, Month, Day)
Store Information (Store Id)
Household Demographics (Number of Dependents, Buy Potential)
Customer Demographics (Gender, Marital Status, Education Status)
Ad Channel (TV, Event, Email)
Time (Hour, Minute)

The ER Diagram for the relevant tables is shown below. It’s a classic star schema.

Cube-Rollup-Example-crop-redux

Introduction to Cubes

This example is a typical dimensional data model found in OLAP. The data model describes the measures and the dimensions that make the data useful. Cubes are the physical implementations of dimensional data model. A cube captures the structure in the data model and organizes measures and dimensions in an optimal layout. Queries on cubes are highly efficient and can support online applications and dashboards.

Build the Cube

Preprocess the data
First, I filtered the store_sales table to contain data from 2002 onwards to keep execution times reasonable for my experiments.

# 2452276 is the id in date_dim for the row of Jan 1 2002
create table store_sales_2002_plus as select * from tpcds_orc_500.store_sales where ss_sold_date_sk >= 2452276

select count(*) from store_sales_2002_plus;
278035965

Create Cube

I created a cube to store dimensions and measures I am interested in.

create table store_sales_cube as select sum(ss_ext_sales_price) as sum_extended_price, 
             sum(ss_sales_price) as sum_sales_price, sum(ss_net_profit) as sum_net_profit,
             sum(ss_wholesale_cost) as sum_wholesale_cost, sum(ss_coupon_amt) as sum_coupon_amt,
             sum(ss_list_price) as sum_list_price, 
             d_year, d_qoy, d_moy, d_date, s_store_id,
             cd_gender, cd_marital_status, cd_education_status, grouping__id
      from store_sales_2002_plus join item on ss_item_sk = i_item_sk 
             join customer on ss_customer_sk = c_customer_sk 
             join date_dim on ss_sold_date_sk = d_date_sk 
             join customer_demographics on ss_cdemo_sk = cd_demo_sk 
             join promotion on ss_promo_sk = p_promo_sk 
             join household_demographics on ss_hdemo_sk = hd_demo_sk 
             join store on ss_store_sk = s_store_sk 
             join time_dim on ss_sold_time_sk = t_time_sk
        group by d_year, d_qoy, d_moy, d_date, s_store_id,
             cd_gender, cd_marital_status, cd_education_status
        with cube;

select count(*) from store_sales_cube;
1586304

The above query generates aggregates for all possible combinations of group by columns.
Schema of store_sales_cube is:

Column	Data Type
sum_extended_price	double
sum_sales_price	double
sum_net_profit	double
sum_wholesale_cost	double
sum_coupon_amt	double
sum_list_price	double
d_year	int
d_qoy	int
d_moy	int
d_date	timestamp
s_store_id	string
cd_gender	string
cd_marital_status	string
cd_education_status	string
grouping__id	string

A few example rows are shown below.

d_year	d_qoy	d_moy	d_date	s_store_id	s_store_name	cd_gender	cd_marital_status	cd_educational_status	grouping__id	Total Sales Price
2002	3	NULL	NULL	NULL	NULL	NULL	NULL	NULL	3	2.712...E9
NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	0	1.00...E10
NULL	NULL	NULL	NULL	NULL	NULL	M	NULL	NULL	64	5.01..E9

The first row stores measures for d_year=2002, d_qoy=3 only. It is one of the rows in the result of

select d_year, d_qoy, sum(ss_sales_price), other aggregates
from store_sales_2002_plus
join date_dim on s_sold_date_sk = d_date_sk
group by d_year, d_qoy;

The second row stores the measures for the complete data set.

The third row stores the measures for cd_gender=M only. It is one of the rows in the result of

select cd_gender, sum(ss_sales_price), other aggregates
from store_sales_2002_plus
join customer_demographics on ss_cdemo_sk = cd_demo_sk 
group by cd_gender;

Grouping ID

Let’s say an analyst is interested in finding sum_sales_price by gender (cd_gender). How does the analyst find the rows that store the measures for cd_gender ?
grouping_id is useful to select rows based on the dimensions of interest. grouping_id is a column generated by Hive when CUBE keyword is used. I specified it in the project list to use it in subsequent queries. Grouping ID is a bit vector of the dimensions in a cube and is stored as a base10 integer. It is generated by listing the dimensions from right to left in the same order as the group by column in the cube create SQL. Bit 1 is assigned to the dimension that occurs in a row.
For rows that have measures for cd_gender, the bit vector is 001000000. The table below has a couple of more examples.

Group By Columns	grouping_id	cd_marital_status	cd_gender	d_moy	d_qoy	d_year
cd_gender	64	0	1	0	0	0
d_year - d_qoy	3	0	0	0	1	1
d_year - d_moy	5	0	0	1	0	1
d_year - d_qoy - cd_marital_status	131	1	0	0	1	1

Lets look at an example of rows for cd_gender dimension.

select cd_gender, sum_sales_price
   from store_sales_cube where 
   `grouping__id` = conv("001000000", 2, 10);

cd_gender	total_sales_price
M	5.017321159230397E9
F	5.01904028465792E9

conv is a Hive function to convert a number (specified in a string) in a specified base (in this case 2) to an integer in another base (in this case 10). It takes the string, the base of the number in the string and the base of the result as arguments.

Total Sales Price for each quarter

Let us look at queries on the raw data and cube to calculate the measures. I will use the number of rows read as a measure of speed.

Query on raw data:
select d_year, d_qoy, sum(ss_sales_price)
from store_sales_2002_plus
join date_dim on s_sold_date_sk = d_date_sk
group by d_year, d_qoy;

Bytes Read: 776,170,833

Query on cube:
select d_year, d_qoy, sum(sum_sales_price)
   from store_sales_cube where 
   `grouping__id` = conv("000000011", 2, 10);
Bytes Read: 11,783,322

Query on the cube scanned 1.5% of the data compared to the query on raw data tables.

Total Sales for each quarter to married customers

Lets look at another example. The following query filters the results by another dimension – cd_marital_status

select d_year, d_qoy, cd_marital_status, sum_sales_price
   from store_sales_cube where 
   `grouping__id` = conv("010000011", 2, 10) and cd_marital_status = "M";

GROUPING Functions in other databases

Grouping functions is important to choose the right cells in a cube. Other databases have similar functions. For e.g. refer to GROUPING functions in Oracle or GROUPING_ID in SQL Server.

Summary

In summary, we looked at an example of multidimensional data generated by the CUBE keyword in Apache Hive. We also understood how to use GROUPING_ID to select the right cells in a cube.

CUBE Keyword in Apache Hive

From CUBE Keyword in Apache Hive Enhanced Aggregation, Cube, Grouping and Rollup CUBE Keyword in Apache Hive By Rajat VenkateshPublished

Exception in thread "main" org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.

在hive命令列中執行 hive和hbase整合的表 hhh_tj_atmosphere_history 可以正常執行，但是通過java jdbc hiveserver2 就不可以（通過hiveserver2訪問hive自建表就可以，就是不能訪問整合後的表）。 Running:sel

New in Cloudera Enterprise 6: Apache Hive 2.1

We recently released Cloudera Enterprise 6.0 featuring significant improvements across a number of core components. In this blog post, we’re going to

Apache Hive 基本理論與安裝指南

scratch 建表 username apach 而且 use res isp 自動一、Hive的基本理論　　Hive是在HDFS之上的架構，Hive中含有其自身的組件，解釋器、編譯器、執行器、優化器。解釋器用於對腳本進行解釋，編譯器是對高級語言代碼進行編譯，執行器

What’s new for Spark SQL in Apache Spark 1.3（中英雙語）

block htm park -h apache HA log -a -- 文章標題 What’s new for Spark SQL in Apache Spark 1.3 作者介紹 Michael Armbrust 文章正文參考文獻

Apache Hive

watermark .com mysq 數據模型執行 conf 成本 and 圖片 1． Hive 簡介1.1．什麽是HiveHive是基於Hadoop的一個數據倉庫工具，可以將結構化的數據文件映射為一張數據庫表，並提供類SQL查詢功能。本質是將SQL轉換為MapRed

Kyligence 架構師:Spark tunning in Apache Kylin

2018年7月，Kyligence 架構師在 Apache Kylin [email protected]上海活動上做的分享；介紹瞭如何調優 Kylin 的 Spark cubing 引擎內容過長，可至原文地址瀏覽:https://www.sli

002-Apache Hive

Apache Hive Apache Hive Apache Hive 資料倉庫軟體幫助在分散式儲存中讀取、寫入和管理大型資料集，並使用SQL語法查詢。 Hive是在Apache Hadoop之上構建的，它提供了以下特

Apache Hive 筆記

1． Hive 簡介 1.1．什麼是 HiveHive 是基於 Hadoop 的一個數據倉庫工具，可以將結構化的資料檔案對映為一張資料庫表，並提供類 SQL 查詢功能。本質是將 SQL 轉換為 MapReduce 程式。可以將hive理解為hadoop的一個客戶端，因為是hive去連線

Apache Hive在CentOS6上的安裝與配置

最近在上Hadoop相關的課程，有一個實驗要用到Hive這一元件，然而參考網路上的各種安裝配置教程卻老是出現問題，經過在網上找各種解決方案，終於配置成功能運行了，於是寫下這篇文章記錄一下，防止以後再踩坑。首先介紹一下Hive：hive是基於Hadoop的一個數據倉庫工具，

#Apache Spark系列技術直播# 第六講【 What's New in Apache Spark 2.4? 】

Apache Spark系列技術直播第六講【 What's New in Apache Spark 2.4? 】 Abstract(簡介): This talk will provide an overview of the major features and enhancements in Spar

《Apache Hive官方文件》首頁

原文連結譯者：BJdaxiang Apache Hive是一款資料倉庫軟體，通過SQL使得分散式儲存系統中的大的資料集的讀、寫和管理變得容易。使用者可以使用自帶的命令列工具和JDBC驅動用來連線Hive。開始Apache Hive之旅在我們的wiki上了解更多關於Hive的功能。

apache-hive-1.2.1-bin 安裝

apache-hive-1.2.1-bin 安裝更多資源:https://github.com/opensourceteams 技能標籤下載apache hive 安裝包進行apache-hive-1.2.1-bin.tar.gz安裝配置mysql儲存

Robust Message Serialization in Apache Kafka Using Apache Avro, Part 3

Part 3: Configuring Clients Earlier, we introduced Kafka Serializers and Deserializers that are capable of writing and reading Kafka records in Avro

Robust Message Serialization in Apache Kafka Using Apache Avro, Part 2

Implementing a Schema Store In Part 1, we saw the need for an Apache Avro schema provider but did not implement one. In this part we will implement a

Robust Message Serialization in Apache Kafka Using Apache Avro, Part 1

In Apache Kafka, Java applications called producers write structured messages to a Kafka cluster (made up of brokers). Similarly, Java applications c

time bushfire alerting with Complex Event Processing in Apache Flink on Amazon EMR and IoT sensor network | AWS Big Data Blog

Bushfires are frequent events in the warmer months of the year when the climate is hot and dry. Countries like Australia and the United States are

CUBE Keyword in Apache Hive

From

CUBE Keyword in Apache Hive

Introduction

Data Model

Introduction to Cubes

Build the Cube

Grouping ID

Total Sales Price for each quarter

Total Sales for each quarter to married customers

GROUPING Functions in other databases

Summary

CUBE Keyword in Apache Hive

Exception in thread "main" org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.

New in Cloudera Enterprise 6: Apache Hive 2.1

Apache Hive 基本理論與安裝指南

What’s new for Spark SQL in Apache Spark 1.3（中英雙語）

Apache Hive

Kyligence 架構師:Spark tunning in Apache Kylin

002-Apache Hive

Apache Hive 筆記

Apache Hive在CentOS6上的安裝與配置

#Apache Spark系列技術直播# 第六講【 What's New in Apache Spark 2.4? 】

《Apache Hive官方文件》首頁

apache-hive-1.2.1-bin 安裝

Robust Message Serialization in Apache Kafka Using Apache Avro, Part 3

Robust Message Serialization in Apache Kafka Using Apache Avro, Part 2

Robust Message Serialization in Apache Kafka Using Apache Avro, Part 1

time bushfire alerting with Complex Event Processing in Apache Flink on Amazon EMR and IoT sensor network | AWS Big Data Blog

Apache Hive integration with Elasticsearh

An Overview of End-to-End Exactly-Once Processing in Apache Flink (with Apache Kafka, too!)

Spring boot with Apache Hive

CUBE Keyword in Apache Hive

From

CUBE Keyword in Apache Hive

Introduction

Data Model

Introduction to Cubes

Build the Cube

Grouping ID

Total Sales Price for each quarter

Total Sales for each quarter to married customers

GROUPING Functions in other databases

Summary

相關推薦