Hadoop資料倉庫框架Hive:常用業務操作實踐
阿新 • • 發佈:2020-12-27
文章目錄
1.建立測試庫並切換到測試庫
[[email protected]ter ~]# $HIVE_HOME/bin/hive Logging initialized using configuration in jar:file:/usr/local/src/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties hive> create database dzw; OK Time taken: 1.6 seconds hive> use dzw; OK Time taken: 0.052 seconds
2.建立orders和trains表
2.1表字段分析
查看錶內容
hive> create table orders(
> order_id string,
> user_id string,
> eval_set string,
> order_number string,
> order_dow string,
> order_hour_of_day string,
> days_since_prior_order string
> )
> row format delimited fields terminated by ','
> lines terminated by '\n'
--跳過檔案行第1行
> tblproperties("skip.header.line.count"="1");
OK
Time taken: 0.112 seconds
插入資料
hive> load data local inpath '/usr/local/src/apache-hive-1.2.2-bin/data/orders.csv'
> into table orders;
Loading data to table dzw.orders
Table dzw.orders stats: [numFiles=1, totalSize=108968645]
OK
Time taken: 3.548 seconds
查詢資料
hive> select * from orders limit 10;
OK
2539329 1 prior 1 2 08
2398795 1 prior 2 3 07 15.0
473747 1 prior 3 3 12 21.0
2254736 1 prior 4 4 07 29.0
431534 1 prior 5 4 15 28.0
3367565 1 prior 6 2 07 19.0
550135 1 prior 7 1 09 20.0
3108588 1 prior 8 1 14 14.0
2295261 1 prior 9 1 16 0.0
2550362 1 prior 10 4 08 30.0
Time taken: 0.121 seconds, Fetched: 10 row(s)
欄位說明
order_id:訂單號
user_id:使用者id
eval_set:訂單的行為(歷史產生的或者訓練所需要的)
order_number:使用者購買訂單的先後順序
order_dow:order day of week ,訂單在星期幾進行購買的(0-6)
order_hour_of_day:訂單在哪個小時段產生的(0-23)
days_since_prior_order:表示後一個訂單距離前一個訂單的相隔天數
2.2 建立trains表
說明:
order_id:訂單號
product_id:商品ID
add_to_cart_order:加入購物車的位置
reordered:這個訂單是否重複購買(1 表示是 0 表示否)
建表
hive> create table trains(
> order_id string,
> product_id string,
> add_to_cart_order string,
> reordered string
> )
> row format delimited fields terminated by ','
> lines terminated by '\n';
OK
Time taken: 0.878 seconds
插入資料並查詢前10行
hive> load data local inpath '/usr/local/src/apache-hive-1.2.2-bin/data/order_products__train.csv'
> into table trains;
Loading data to table dzw.trains
Table dzw.trains stats: [numFiles=1, totalSize=24680147]
OK
Time taken: 1.633 seconds
hive> select * from trains limit 10;
OK
order_id product_id add_to_cart_order reordered
1 49302 1 1
1 11109 2 1
1 10246 3 0
1 49683 4 0
1 43633 5 1
1 13176 6 0
1 47209 7 0
1 22035 8 1
36 39612 1 0
Time taken: 0.552 seconds, Fetched: 10 row(s)
清理第一行髒資料並檢視效果
hive> insert overwrite table trains
> select * from trains where order_id !='order_id';
Query ID = root_20201225175017_e0e740eb-b3cb-4e02-8eeb-3043eb75b946
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1608254818743_0028, Tracking URL = http://master:8088/proxy/application_1608254818743_0028/
Kill Command = /usr/local/src/hadoop-2.6.5/bin/hadoop job -kill job_1608254818743_0028
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-25 17:50:28,317 Stage-1 map = 0%, reduce = 0%
2020-12-25 17:50:39,025 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.37 sec
MapReduce Total cumulative CPU time: 4 seconds 370 msec
Ended Job = job_1608254818743_0028
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://master:9000/data/hive/warehouse/dzw.db/trains/.hive-staging_hive_2020-12-25_17-50-17_429_2849462889173378151-1/-ext-10000
Loading data to table dzw.trains
Table dzw.trains stats: [numFiles=1, numRows=1384617, totalSize=24680099, rawDataSize=23295482]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 4.37 sec HDFS Read: 24684298 HDFS Write: 24680176 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 370 msec
OK
Time taken: 23.199 seconds
hive> select * from trains limit 10;
OK
1 49302 1 1
1 11109 2 1
1 10246 3 0
1 49683 4 0
1 43633 5 1
1 13176 6 0
1 47209 7 0
1 22035 8 1
36 39612 1 0
36 19660 2 1
Time taken: 0.122 seconds, Fetched: 10 row(s)
3.常見業務操作
3.1 每個使用者有多少個訂單
分析
需求: user_id order_id => user_id order_cnt
思路:這個涉及到分組的問題,依照使用者id進行分組,然後統計每組的個數
由於資料量比較大,我們就提出十條資料進行檢視,否則跑的非常慢
其實,在實際工作中,我們也應該首次小批量驗證程式碼的正確性,而不是剛開始就全量跑,否則超浪費時間。
hive> select user_id,count(order_id)
> from orders
> group by user_id
> limit 10;
Query ID = root_20201225180531_9bfc67bd-df69-425a-847d-8b753c8ef72e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1608254818743_0029, Tracking URL = http://master:8088/proxy/application_1608254818743_0029/
Kill Command = /usr/local/src/hadoop-2.6.5/bin/hadoop job -kill job_1608254818743_0029
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-12-25 18:05:41,044 Stage-1 map = 0%, reduce = 0%
2020-12-25 18:05:51,752 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.89 sec
2020-12-25 18:05:59,383 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.03 sec
MapReduce Total cumulative CPU time: 7 seconds 30 msec
Ended Job = job_1608254818743_0029
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.03 sec HDFS Read: 108976722 HDFS Write: 80 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 30 msec
OK
1 11
10 6
100 6
1000 8
10000 73
100000 10
100001 67
100002 13
100003 4
100004 9
Time taken: 29.954 seconds, Fetched: 10 row(s)
3.2 每個使用者一個訂單平均有多少商品
(1)一個訂單有多少商品
hive> select order_id,count(product_id) pro_cnt
> from trains
> group by order_id
> limit 10;
Query ID = root_20201225201808_59c0d4b7-4aab-4715-bf12-747a502a4bdc
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1608254818743_0037, Tracking URL = http://master:8088/proxy/application_1608254818743_0037/
Kill Command = /usr/local/src/hadoop-2.6.5/bin/hadoop job -kill job_1608254818743_0037
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-12-25 20:18:19,039 Stage-1 map = 0%, reduce = 0%
2020-12-25 20:18:26,685 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.35 sec
2020-12-25 20:18:34,173 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.16 sec
MapReduce Total cumulative CPU time: 5 seconds 160 msec
Ended Job = job_1608254818743_0037
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.16 sec HDFS Read: 24687727 HDFS Write: 95 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 160 msec
OK
1 8
100000 15
1000008 7
1000029 8
100003 2
1000046 32
1000080 7
1000162 22
1000197 2
1000209 4
Time taken: 26.807 seconds, Fetched: 10 row(s)
(2)每個使用者對應的商品量
hive> select od.user_id,t.pro_cnt
> from orders od
> inner join (select order_id,count(product_id) pro_cnt
> from trains
> group by order_id
> limit 100
> ) t
> on od.order_id=t.order_id
> limit 10;
Query ID = root_20201225194141_0f7f1456-3b80-424b-8f39-19763b884c1b
Total jobs = 5
Launching Job 1 out of 5
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1608254818743_0030, Tracking URL = http://master:8088/proxy/application_1608254818743_0030/
Kill Command = /usr/local/src/hadoop-2.6.5/bin/hadoop job -kill job_1608254818743_0030
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-12-25 19:41:49,190 Stage-1 map = 0%, reduce = 0%
2020-12-25 19:41:57,591 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.54 sec
2020-12-25 19:42:06,041 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.38 sec
MapReduce Total cumulative CPU time: 5 seconds 380 msec
Ended Job = job_1608254818743_0030
Launching Job 2 out of 5
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1608254818743_0031, Tracking URL = http://master:8088/proxy/application_1608254818743_0031/
Kill Command = /usr/local/src/hadoop-2.6.5/bin/hadoop job -kill job_1608254818743_0031
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2020-12-25 19:42:19,516 Stage-2 map = 0%, reduce = 0%
2020-12-25 19:42:24,912 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.89 sec
2020-12-25 19:42:32,300 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.5 sec
MapReduce Total cumulative CPU time: 2 seconds 500 msec
Ended Job = job_1608254818743_0031
Stage-8 is selected by condition resolver.
Stage-9 is filtered out by condition resolver.
Stage-3 is filtered out by condition resolver.
Execution log at: /tmp/root/root_20201225194141_0f7f1456-3b80-424b-8f39-19763b884c1b.log
2020-12-25 19:42:46 Starting to launch local task to process map join; maximum memory = 518979584
2020-12-25 19:42:48 Dump the side-table for tag: 1 with group count: 100 into file: file:/tmp/root/91d22153-3518-4d43-b107-ce821689c861/hive_2020-12-25_19-41-41_039_8931438817336216656-1/-local-10005/HashTable-Stage-5/MapJoin-mapfile01--.hashtable
2020-12-25 19:42:48 Uploaded 1 File to: file:/tmp/root/91d22153-3518-4d43-b107-ce821689c861/hive_2020-12-25_19-41-41_039_8931438817336216656-1/-local-10005/HashTable-Stage-5/MapJoin-mapfile01--.hashtable (2953 bytes)
2020-12-25 19:42:48 End of local task; Time Taken: 1.936 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 4 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1608254818743_0032, Tracking URL = http://master:8088/proxy/application_1608254818743_0032/
Kill Command = /usr/local/src/hadoop-2.6.5/bin/hadoop job -kill job_1608254818743_0032
Hadoop job information for Stage-5: number of mappers: 1; number of reducers: 0
2020-12-25 19:42:56,816 Stage-5 map = 0%, reduce = 0%
2020-12-25 19:43:05,282 Stage-5 map = 100%, reduce = 0%, Cumulative CPU 2.85 sec
MapReduce Total cumulative CPU time: 2 seconds 850 msec
Ended Job = job_1608254818743_0032
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.38 sec HDFS Read: 24687287 HDFS Write: 2696 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.5 sec HDFS Read: 6902 HDFS Write: 2696 SUCCESS
Stage-Stage-5: Map: 1 Cumulative CPU: 2.85 sec HDFS Read: 19802604 HDFS Write: 80 SUCCESS
Total MapReduce CPU Time Spent: 10 seconds 730 msec
OK
6681 6
8137 6
9099 2
10892 5
11849 5
12918 1
20729 6
23867 20
24627 10
29165 15
Time taken: 85.301 seconds, Fetched: 10 row(s)
(3)計算每個使用者對應的平均商品量
hive> select od.user_id,sum(t.pro_cnt)/count(*),avg(t.pro_cnt)
> from orders od
> inner join (select order_id,count(product_id) pro_cnt
> from trains
> group by order_id
> limit 100
> ) t
> on od.order_id=t.order_id
> group by od.user_id
> limit 10;
Query ID = root_20201225200037_7d1b3577-822a-4983-9d0a-cc33eca009e1
Total jobs = 6
Launching Job 1 out of 6
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1608254818743_0033, Tracking URL = http://master:8088/proxy/application_1608254818743_0033/
Kill Command = /usr/local/src/hadoop-2.6.5/bin/hadoop job -kill job_1608254818743_0033
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-12-25 20:00:44,401 Stage-1 map = 0%, reduce = 0%
2020-12-25 20:00:52,652 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.11 sec
2020-12-25 20:01:00,065 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.77 sec
MapReduce Total cumulative CPU time: 4 seconds 770 msec
Ended Job = job_1608254818743_0033
Launching Job 2 out of 6
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1608254818743_0034, Tracking URL = http://master:8088/proxy/application_1608254818743_0034/
Kill Command = /usr/local/src/hadoop-2.6.5/bin/hadoop job -kill job_1608254818743_0034
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2020-12-25 20:01:12,082 Stage-2 map = 0%, reduce = 0%
2020-12-25 20:01:17,429 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.86 sec
2020-12-25 20:01:25,707 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.42 sec
MapReduce Total cumulative CPU time: 2 seconds 420 msec
Ended Job = job_1608254818743_0034
Stage-9 is selected by condition resolver.
Stage-10 is filtered out by condition resolver.
Stage-3 is filtered out by condition resolver.
Execution log at: /tmp/root/root_20201225200037_7d1b3577-822a-4983-9d0a-cc33eca009e1.log
2020-12-25 20:01:41 Starting to launch local task to process map join; maximum memory = 518979584
2020-12-25 20:01:44 Dump the side-table for tag: 1 with group count: 100 into file: file:/tmp/root/91d22153-3518-4d43-b107-ce821689c861/hive_2020-12-25_20-00-37_002_3179557432139060992-1/-local-10006/HashTable-Stage-6/MapJoin-mapfile21--.hashtable
2020-12-25 20:01:44 Uploaded 1 File to: file:/tmp/root/91d22153-3518-4d43-b107-ce821689c861/hive_2020-12-25_20-00-37_002_3179557432139060992-1/-local-10006/HashTable-Stage-6/MapJoin-mapfile21--.hashtable (2953 bytes)
2020-12-25 20:01:44 End of local task; Time Taken: 3.012 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 4 out of 6
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1608254818743_0035, Tracking URL = http://master:8088/proxy/application_1608254818743_0035/
Kill Command = /usr/local/src/hadoop-2.6.5/bin/hadoop job -kill job_1608254818743_0035
Hadoop job information for Stage-6: number of mappers: 1; number of reducers: 0
2020-12-25 20:01:53,119 Stage-6 map = 0%, reduce = 0%
2020-12-25 20:02:01,607 Stage-6 map = 100%, reduce = 0%, Cumulative CPU 3.8 sec
MapReduce Total cumulative CPU time: 3 seconds 800 msec
Ended Job = job_1608254818743_0035
Launching Job 5 out of 6
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1608254818743_0036, Tracking URL = http://master:8088/proxy/application_1608254818743_0036/
Kill Command = /usr/local/src/hadoop-2.6.5/bin/hadoop job -kill job_1608254818743_0036
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 1
2020-12-25 20:02:13,693 Stage-4 map = 0%, reduce = 0%
2020-12-25 20:02:19,861 Stage-4 map = 100%, reduce = 0%, Cumulative CPU 0.78 sec
2020-12-25 20:02:27,212 Stage-4 map = 100%, reduce = 100%, Cumulative CPU 2.53 sec
MapReduce Total cumulative CPU time: 2 seconds 530 msec
Ended Job = job_1608254818743_0036
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.77 sec HDFS Read: 24687372 HDFS Write: 2696 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.42 sec HDFS Read: 6902 HDFS Write: 2696 SUCCESS
Stage-Stage-6: Map: 1 Cumulative CPU: 3.8 sec HDFS Read: 108975489 HDFS Write: 4063 SUCCESS
Stage-Stage-4: Map: 1 Reduce: 1 Cumulative CPU: 2.53 sec HDFS Read: 11337 HDFS Write: 148 SUCCESS
Total MapReduce CPU Time Spent: 13 seconds 520 msec
OK
102695 1.0 1.0
106588 5.0 5.0
10892 5.0 5.0
112108 8.0 8.0
113656 2.0 2.0
115438 8.0 8.0
115520 1.0 1.0
116188 7.0 7.0
117294 3.0 3.0
11849 5.0 5.0
Time taken: 111.316 seconds, Fetched: 10 row(s)
4.每個使用者在一週中的購買訂單的分佈
思路:列轉行
hive> select
> 'user_id'
> , sum(case when order_dow='0' then 1 else 0 end) dow0
> , sum(case when order_dow='1' then 1 else 0 end) dow1
> , sum(case when order_dow='2' then 1 else 0 end) dow2
> , sum(case when order_dow='3' then 1 else 0 end) dow3
> , sum(case when order_dow='4' then 1 else 0 end) dow4
> , sum(case when order_dow='5' then 1 else 0 end) dow5
> , sum(case when order_dow='6' then 1 else 0 end) dow6
> from orders
> where user_id in ('1','2','3')
> group by user_id;
Query ID = root_20201225222212_b6fe7699-1cff-4bd7-879f-1ca76a628b39
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1608254818743_0045, Tracking URL = http://master:8088/proxy/application_1608254818743_0045/
Kill Command = /usr/local/src/hadoop-2.6.5/bin/hadoop job -kill job_1608254818743_0045
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-12-25 22:22:22,282 Stage-1 map = 0%, reduce = 0%
2020-12-25 22:22:30,755 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.5 sec
2020-12-25 22:22:37,263 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.77 sec
MapReduce Total cumulative CPU time: 4 seconds 770 msec
Ended Job = job_1608254818743_0045
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.77 sec HDFS Read: 108982479 HDFS Write: 66 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 770 msec
OK
user_id 0 3 2 2 4 0 0
user_id 0 6 5 2 1 1 0
user_id 6 2 1 3 0 1 0
Time taken: 25.924 seconds, Fetched: 3 row(s)
5.某個時間段檢視每個使用者購買了哪些商品
將使用者資訊表和產品資訊表關聯
hive> select od.user_id,tr.product_id
> from orders od
> inner join trains tr
> on od.order_id=tr.order_id
> where od.order_hour_of_day='16'
> limit 30;
Query ID = root_20201225211222_d681738f-a98f-4646-a9a1-b87cdeae46ca
Total jobs = 2
Stage-6 is selected by condition resolver.
Stage-1 is filtered out by condition resolver.
Execution log at: /tmp/root/root_20201225211222_d681738f-a98f-4646-a9a1-b87cdeae46ca.log
2020-12-25 21:12:36 Starting to launch local task to process map join; maximum memory = 518979584
2020-12-25 21:12:42 Dump the side-table for tag: 1 with group count: 131209 into file: file:/tmp/root/07adb778-9b83-4c21-8ca0-a04988cf34fd/hive_2020-12-25_21-12-22_556_466250092893260946-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile21--.hashtable
2020-12-25 21:12:43 Uploaded 1 File to: file:/tmp/root/07adb778-9b83-4c21-8ca0-a04988cf34fd/hive_2020-12-25_21-12-22_556_466250092893260946-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile21--.hashtable (17754343 bytes)
2020-12-25 21:12:43 End of local task; Time Taken: 7.667 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1608254818743_0041, Tracking URL = http://master:8088/proxy/application_1608254818743_0041/
Kill Command = /usr/local/src/hadoop-2.6.5/bin/hadoop job -kill job_1608254818743_0041
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2020-12-25 21:12:53,310 Stage-3 map = 0%, reduce = 0%
2020-12-25 21:13:01,842 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 3.53 sec
MapReduce Total cumulative CPU time: 3 seconds 530 msec
Ended Job = job_1608254818743_0041
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 3.53 sec HDFS Read: 97802 HDFS Write: 282 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 530 msec
OK
24 31222
52 46149
52 3798
52 24135
52 16797
52 43352
52 14032
52 39275
52 8048
52 30450
52 27839
52 30720
83 2186
83 16254
83 26856
128 25005
128 11512
128 40198
128 30949
128 43643
128 43713
128 18152
128 43798
128 48220
128 15984
176 24852
176 31958
176 20995
176 44632
176 13639
Time taken: 41.451 seconds, Fetched: 30 row(s)
6.想知道距離現在最近 或者最遠的時間
思路:先建表匯入資料
在應用聚合函式min(),max()進行篩選
hive> create table `udata`(
> `user_id` string,
> `item_id` string,
> `rating` string,
> `timestamp` string
> )
> row format delimited fields terminated by '\t'
> lines terminated by '\n';
OK
Time taken: 0.06 seconds
hive> load data local inpath '/usr/local/src/apache-hive-1.2.2-bin/data/ml-100k/u.data'
> into table udata;
Loading data to table dzw.udata
Table dzw.udata stats: [numFiles=1, totalSize=1979173]
OK
Time taken: 0.317 seconds
hive> select * from udata limit 10;
OK
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
Time taken: 0.092 seconds, Fetched: 10 row(s)
hive> select min(`timestamp`) min_tim, max(`timestamp`) max_min
> from udata;
Query ID = root_20201225214757_845d7cc9-80fd-41c3-b554-2c9eda0332ee
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1608254818743_0043, Tracking URL = http://master:8088/proxy/application_1608254818743_0043/
Kill Command = /usr/local/src/hadoop-2.6.5/bin/hadoop job -kill job_1608254818743_0043
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-12-25 21:48:03,904 Stage-1 map = 0%, reduce = 0%
2020-12-25 21:48:10,313 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.64 sec
2020-12-25 21:48:17,542 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.09 sec
MapReduce Total cumulative CPU time: 3 seconds 90 msec
Ended Job = job_1608254818743_0043
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.09 sec HDFS Read: 1986910 HDFS Write: 20 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 90 msec
OK
874724710 893286638
Time taken: 21.516 seconds, Fetched: 1 row(s)
7.判斷使用者在那一天比較活躍
hive> select
> user_id, collect_list(cast(days as int)) as day_list
> from
> (select
> user_id
> , (cast(893286638 as bigint) - cast(`timestamp` as bigint)) / (24*60*60) * rating as days
> from udata
> ) t
> group by user_id
> limit 10;
Query ID = root_20201225215327_0f6867b5-be7b-4e08-870c-fdee5a8f41c0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1608254818743_0044, Tracking URL = http://master:8088/proxy/application_1608254818743_0044/
Kill Command = /usr/local/src/hadoop-2.6.5/bin/hadoop job -kill job_1608254818743_0044
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-12-25 21:53:33,180 Stage-1 map = 0%, reduce = 0%
2020-12-25 21:53:41,420 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.68 sec
2020-12-25 21:53:47,711 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.24 sec
MapReduce Total cumulative CPU time: 4 seconds 240 msec
Ended Job = job_1608254818743_0044
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.24 sec HDFS Read: 1989361 HDFS Write: 4098 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 240 msec
OK
1 [682,158,682,843,271,1054,204,682,341,636,843,667,1060,853,758,632,682,1054,569,853,91,843,1054,682,40,843,843,843,338,843,632,632,1054,948,379,210,421,1054,632,853,843,203,758,682,170,341,1054,208,511,569,338,208,511,758,682,682,843,1054,379,341,834,1054,170,843,210,843,682,1054,835,569,210,203,1044,632,853,1060,632,853,569,843,682,843,758,511,948,1054,848,853,843,204,417,170,758,421,170,511,843,417,569,632,848,632,948,610,170,338,682,632,183,170,632,511,814,632,853,1060,843,569,682,421,843,1054,569,569,843,1054,853,511,208,853,843,843,511,632,81,632,948,511,853,1060,843,170,632,848,843,843,843,948,1060,170,208,1054,569,511,341,848,914,263,511,511,569,263,758,682,758,163,626,1054,210,511,758,417,626,632,1060,1054,853,682,1018,341,843,135,948,632,511,341,1060,569,848,421,170,1060,843,569,948,758,853,379,682,170,682,758,843,853,1060,843,853,122,853,843,848,835,341,853,421,421,170,1054,1060,853,843,203,1044,1054,843,843,948,569,341,1060,843,1054,569,203,682,1054,948,511,632,843,341,682,1060,511,853,1054,843,569,682,341,948,421,632,189,1054,682,1054,135,1054,170,271,948,1054,204,843,682,1054,843,1060,626,853,421]
10 [712,712,534,891,712,890,534,891,891,890,712,597,712,712,712,712,890,653,712,712,712,891,712,712,534,891,712,490,891,712,534,712,891,891,712,712,712,712,712,712,891,891,712,891,891,712,890,712,890,891,712,712,534,712,712,712,712,891,712,712,712,712,712,712,712,712,712,712,712,712,534,890,890,712,712,712,712,712,712,891,891,891,891,890,712,712,534,712,712,712,712,890,712,891,891,891,712,712,534,891,891,712,890,891,712,712,712,712,712,891,712,712,712,712,712,712,712,712,891,712,891,891,712,534,891,534,712,712,712,712,534,712,712,534,712,712,891,712,534,712,712,891,712,712,891,712,712,712,891,712,712,891,712,712,712,712,890,712,712,712,890,712,712,534,891,712,891,712,890,712,712,712,891,712,891,890,712,891,890,712,712,712,712,534]
100 [88,44,66,22,88,88,44,44,88,66,66,66,66,88,110,66,66,66,66,88,66,66,110,44,44,88,88,88,88,66,66,66,22,88,22,22,66,66,88,22,66,66,66,88,88,88,66,88,88,22,44,44,44,44,66,88,66,88,110]
101 [560,560,560,560,560,747,373,560,560,373,373,373,560,560,373,560,373,747,373,747,560,747,373,747,186,934,373,186,373,373,747,560,560,747,560,747,747,747,373,560,747,560,747,560,560,373,747,747,373,373,560,373,373,560,373,560,560,560,186,560,747,747,560,560,560,560,373]
102 [220,155,155,51,155,347,331,103,155,103,604,155,331,155,231,231,216,103,103,155,155,207,3,6,441,10,3,155,437,207,103,155,51,13,207,155,155,103,331,155,155,155,155,493,207,104,533,155,155,103,347,155,6,150,155,103,10,103,103,103,207,604,207,207,155,6,207,103,155,103,441,155,441,103,103,220,6,103,6,10,155,6,155,103,103,103,103,155,347,155,155,155,155,155,155,103,103,51,103,103,103,331,10,10,155,194,103,10,103,144,220,155,155,155,136,155,51,13,805,51,155,6,6,331,155,188,103,103,331,220,103,347,103,103,51,155,10,207,10,51,103,220,155,103,103,231,231,207,103,103,155,155,6,51,657,220,231,155,155,103,6,441,10,155,103,43,805,155,155,10,103,103,10,6,155,144,6,115,10,6,155,155,103,284,231,155,6,155,155,6,207,155,155,103,207,6,155,51,103,6,155,155,480,10,103,207,263,331,103,220,136,179,207,155,103,155]
103 [595,595,446,595,595,744,446,446,744,744,595,595,446,446,297,148,446,446,595,595,744,446,446,595,595,446,595,744,446]
104 [56,111,112,223,111,168,111,168,55,55,167,167,55,112,167,280,56,167,223,55,223,168,167,55,55,112,223,167,167,167,111,55,167,224,55,111,112,111,167,279,168,167,112,168,56,168,167,168,223,223,223,111,167,55,223,168,280,224,224,223,112,111,224,278,112,223,167,55,168,167,167,167,223,55,167,111,167,55,56,224,112,280,223,224,111,279,56,111,111,55,55,111,112,278,56,167,167,224,112,56,168,167,168,167,112,112,111,168,111,167,111]
105 [188,235,188,94,188,141,141,188,188,235,94,141,94,94,235,141,188,235,188,94,94,141,141]
106 [547,547,547,435,410,547,547,217,410,547,547,548,410,410,684,547,684,326,548,547,547,547,548,547,410,435,548,410,435,212,410,684,684,547,547,410,326,547,326,547,547,684,217,544,547,410,410,435,547,410,548,435,547,410,684,547,684,411,435,410,410,410,685,547]
107 [70,46,93,46,93,23,117,117,93,93,46,23,46,117,70,46,70,93,70,23,70,70]
Time taken: 21.771 seconds, Fetched: 10 row(s)
8.使用者購買的數量大於100的商品
hive> select
> user_id, count(distinct product_id) pro_cnt
> from
> (
> -- 訂單訓練資料 場景 整合兩個新老系統資料
> select
> a.user_id,b.product_id
> from orders as a
> left join trains b
> on a.order_id=b.order_id
> union all
> -- 訂單歷史資料
> select
> a.user_id,b.product_id
> from orders as a
> left join prior b
> on a.order_id=b.order_id
> ) t
> group by user_id
> having pro_cnt >= 100
> limit 10;
或者
with user_pro_cnt_tmp as (
select * from
(-- 訂單訓練資料
select
a.user_id,b.product_id
from orders as a
left join trains b
on a.order_id=b.order_id
union all
-- 訂單歷史資料
select
a.user_id,b.product_id
from orders as a
left join prior b
on a.order_id=b.order_id
) t
)
select
user_id
, count(distinct product_id) pro_cnt
from user_pro_cnt_tmp
group by user_id
having pro_cnt >= 100
limit 10;
結果
100001 211
100010 119
100038 177
10009 125
100092 152
100114 115
100123 170
100146 154
100173 106
100187 154