Hive 小練習
1.日誌的格式如下:
pin|-|request_tm|-url|-|sku_id|-|amount
分隔符為‘|-|’,
資料樣例為:
假設本地資料檔案為sample.txt,先將其匯入到hive的test庫的表t_sample中,並計算每個使用者的總消費金額,寫出詳細過程包括表結構。
create external table t_sample
(pin string,
request_tm string,
url string,
sku_id string,
amount string)
row format delimited fields terminated by '|-|';
先清洗資料,改變其分割符變為‘\t’,存入本地檔案jd.txt中(注意split()方法也是需要轉義的)
create external table t_sample
(pin string,
request_tm string,
url string,
sku_id int,
amount double)
row format delimited fields terminated by "\t";
load data local inpath '/opt/module/datas/sample.txt' into table t_sample;
select * from t_sample;
select pin,sum(amount) from t_sample group by pin;
2.訂單詳情表ord_det(order_id訂單號,sku_id商品編號,sale_qtty銷售數量,dt日期分割槽),計算2016年1月1日商品銷量的Top100,並按銷量降級排序
建表語句:
123456 111111 100
234567 222222 200
345678 333333 300
456789 444444 400
567890 555555 500
create table ord_det(order_id string,sku_id string,sale_qtty int)
partitioned by (dt string)
row format delimited fields terminated by "\t";
load data local inpath '/opt/module/datas/ord_det.txt' into table ord_det partition(dt='20160101');
select sku_id,sum(sale_qtty) sale_count from ord_det where dt="20160101" group by sku_id order by sale_count desc limit 100;
3.場景題:北京市(資料量很大)學生成績分析.
成績的資料格式:時間,學校,年級,姓名,科目,成績樣例資料如下:
2013,北大,1,裘容絮,語文,97
2013,北大,1,慶眠拔,語文,52
2013,北大,1,烏灑籌,語文,85
2012,清華,0,欽堯,英語,61
2015,北理工,3,冼殿,物理,81
2016,北科,4,況飄索,化學,92
2014,北航,2,孔須,數學,70
2012,清華,0,王脊,英語,59
2014,北航,2,方部盾,數學,49
2018,北航,2,東門雹,數學,77
2018,北大,1,裘容絮,語文,97
2018,北大,1,慶眠拔,語文,52
2013,北大,1,烏灑籌,語文,85
2017,清華,0,欽堯,英語,61
2015,北理工,3,冼殿,物理,81
2017,北科,4,況飄索,化學,92
2014,北航,2,孔須,數學,70
2018,清華,0,王脊,英語,59
2014,北航,2,方部盾,數學,49
2014,北航,2,東門雹,數學,77
... ...
問題:
(1)如何設計儲存這些資料的表,寫出建表語句:
create table score_ori
(year int,
school string,
class string,
name string,
subject string,
score double)
row format delimited fields terminated by ",";
匯入資料:
load data local inpath "/opt/module/datas/score.txt" into table score_ori;
開啟動態分割槽:
set hive.exec.dynamic.partition=true;
建立分割槽語句:
create table score_partition
(school string,
class string,
name string,
subject string,
score double)
partitioned by (year string)
row format delimited fields terminated by "\t";
查詢匯入資料:
insert into table score_partition partition (year) select school,class,name,subject,score,year from score_ori;
(2)選出今年每個學校,每個年級,分數前三的科目.
select *
from(
select
school,
class,
subject,
score,
row_number() over(partition by school,class,subject order by score desc) rank_code
from score_partition
where year="2018"
) t
where t.rank_code <= 3;
(3)今年 清華 1年級 總成績大於200分的學生 以及學生數.
select
school,class,name,
sum(score) as total_score,
count(1) over (partition by school,class) nct
from
score_partition
where
year="2018" and school="清華" and class=1
group by
school,class,name
having
total_score>100;
4、有一張很大的表:TRLOG,資料如下:
PLATFORM USER_ID CLICK_TIME CLICK_URL
WEB 12332321 2013-03-21 13:48:31.324 /home/
WEB 12332321 2013-03-21 13:48:32.954 /selectcat/er/
WEB 12332321 2013-03-21 13:48:46.365 /er/viewad/12.html
WEB 12332321 2013-03-21 13:48:53.651 /er/viewad/13.html
建立原始表:
CREATE TABLE trlog
(platform string,
user_id int,
click_time string,
click_url string)
row format delimited fields terminated by "\t";
匯入資料:
load data local inpath "/opt/module/datas/log.txt" into table trlog;
CREATE TABLE allog
(platform string,
user_id int,
seq int,
from_url string,
to_url string)
row format delimited fields terminated by "\t";
查詢匯入資料:
insert into table allog
select
platform,
user_id,
row_number() over(partition by user_id order by click_time) seq,
lag(click_url,1) over(partition by user_id order by click_time) as from_url,
click_url as to_url
from
trlog;
結果展示:
select * from allog;
+-----------+-----------+------+---------------------+---------------------+--+
| platform | user_id | seq | from_url | to_url |
+-----------+-----------+------+---------------------+---------------------+--+
| WEB | 12332321 | 1 | NULL | /selectcat/er/ |
| WEB | 12332321 | 2 | /home/ | /er/viewad/12.html |
| WEB | 12332321 | 3 | /selectcat/er/ | /er/viewad/13.html |
| WEB | 12332321 | 4 | /er/viewad/12.html | Exit! |
+-----------+-----------+------+---------------------+---------------------+--+