Hive 在多維統計分析中的應用 & 技巧總結
多維統計一般分兩種,我們看看 Hive 中如何解決:
1、同屬性的多維組合統計
(1)問題: 有如下資料,欄位內容分別為:url, catePath0, catePath1, catePath2, unitparams
https://cwiki.apache.org/confluence 0 1 8 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 1 23 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 1 25 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} https://cwiki.apache.org/confluence 0 5 18 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 5 118 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 3 98 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 3 8 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 5 81 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 9 8 {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
(2)需求: 計算 catePath0, catePath1, catePath2 這三種維度組合下,各個 url 對應的 pv、uv,如:
0 1 23 1 1 0 1 25 1 1 0 1 8 1 1 0 1 ALL 3 3 0 3 8 1 1 0 3 98 1 1 0 3 ALL 2 1 0 5 118 1 1 0 5 18 1 1 0 5 81 1 1 0 5 ALL 3 2 0 ALL ALL 8 3 ALL ALL ALL 8 3
(3)解決思路: hive 中同屬性多維統計問題通常用 union all 組合出各種維度然後 group by 進行求解:
create EXTERNAL table IF NOT EXISTS t_log ( url string, c0 string, c1 string, c2 string, unitparams string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' location '/tmp/decli/1'; select * from ( select host, c0, c1, c2 from t_log t0 LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9 union all select host, c0, c1, 'ALL' c2 from t_log t0 LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9 union all select host, c0, 'ALL' c1, 'ALL' c2 from t_log t0 LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9 union all select host, 'ALL' c0, 'ALL' c1, 'ALL' c2 from t_log t0 LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9 ) test; select c0, c1, c2, count(host) PV, count(distinct(host)) UV from ( select host, c0, c1, c2 from t_log t0 LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9 union all select host, c0, c1, 'ALL' c2 from t_log t0 LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9 union all select host, c0, 'ALL' c1, 'ALL' c2 from t_log t0 LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9 union all select host, 'ALL' c0, 'ALL' c1, 'ALL' c2 from t_log t0 LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9 ) test group by c0, c1, c2;
2、不同屬性的多維組合統計
這種場景下我們一般選擇 Multi Table/File Inserts,下面選自《programming hive》P124
Making Multiple Passes over the Same Data Hive has a special syntax for producing multiple aggregations from a single pass through a source of data, rather than rescanning it for each aggregation. This change can save considerable processing time for large input data sets. We discussed the details previously in Chapter 5. For example, each of the following two queries creates a table from the same source table, history: hive> INSERT OVERWRITE TABLE sales > SELECT * FROM history WHERE action='purchased'; hive> INSERT OVERWRITE TABLE credits > SELECT * FROM history WHERE action='returned'; This syntax is correct, but inefficient. The following rewrite achieves the same thing, but using a single pass through the source history table: hive> FROM history > INSERT OVERWRITE sales SELECT * WHERE action='purchased' > INSERT OVERWRITE credits SELECT * WHERE action='returned';
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY pv_users.age;
https://cwiki.apache.org/confluence/display/Hive/Tutorial
注意事項以及一些小技巧:
1、hive union all 的用法:不支援 top level,以及各個select欄位名稱、屬性必須嚴格一致
2、結果的順序問題,可以自己加字元控制排序
3、多重insert和union all一樣也只掃描一次,但因為要insert到多個分割槽,所以做了很多其他的事情,導致消耗的時間非常長,其會產生多個job,union all 本身只有一個job
關於 insert overwrite 產生多 job 並行執行的問題:
set hive.exec.parallel=true; //開啟任務並行執行 set hive.exec.parallel.thread.number=16; //同一個sql允許最大並行度,預設為8。 http://superlxw1234.iteye.com/blog/1703713
4、當前HIVE 不支援 not in 中包含查詢子句的語法,形如如下的HQ語句是不被支援的: 查詢在key欄位在a表中,但不在b表中的資料 select a.key from a where key not in(select key from b) 該語句在hive中不支援 可以通過left outer join進行查詢,(假設B表中包含另外的一個欄位 key1 select a.key from a left outer join b on a.key=b.key where b.key1 is null
5、left out join 不能連續3個以上使用,必須2個一組,2個一組包裝起來使用。
select p.ssi,p.pv,p.uv,p.nuv,p.visits,'2012-06-19 17:00:00' from (
select * from (
select * from (select ssi,count(1) pv,sum(visits) visits from FactClickAnalysis
where logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi ) p1
left outer join
(
select ssi,count(1) uv from (select ssi,cookieid from FactClickAnalysis
where logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi,cookieid ) t1 group by ssi
) p2 on p1.ssi=p2.ssi
) p3
left outer join
(
select ssi, count(1) nuv from FactClickAnalysis
where logTime = insertTime and logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi
) p4 on p3.ssi=p4.ssi
) p
6、hive本地執行mr
http://superlxw1234.iteye.com/blog/1703546
7、hive動態分割槽建立過多遇到的一個錯誤
http://superlxw1234.iteye.com/blog/1677938
8、hive中巧用正則表示式的貪婪匹配
http://superlxw1234.iteye.com/blog/1751216
9、hive匹配全中文欄位
用java中匹配中文的正則即可:
name rlike '^[\u4e00-\u9fa5]+$'
判斷一個欄位是否全數字:
select mobile from woa_login_log_his where pt = '2012-01-10' and mobile rlike '^\d+$' limit 50;
10、hive中使用sql window函式 LAG/LEAD/FIRST/LAST
http://superlxw1234.iteye.com/blog/1600323
http://www.shaoqun.com/a/18839.aspx
11、hive優化之------控制hive任務中的map數和reduce數
http://superlxw1234.iteye.com/blog/1582880
12、hive中轉義$等特殊字元
http://superlxw1234.iteye.com/blog/1568739
13、日期處理:
檢視N天前的日期:
select from_unixtime(unix_timestamp('20111102','yyyyMMdd') - N*86400,'yyyyMMdd') from t_lxw_test1 limit 1;
獲取兩個日期之間的天數/秒數/分鐘數等等:
select ( unix_timestamp('2011-11-02','yyyy-MM-dd')-unix_timestamp('2011-11-01','yyyy-MM-dd') ) / 86400 from t_lxw_test limit 1;
14、刪除 Hive 臨時檔案 hive.exec.scratchdir
http://hi.baidu.com/youziguo/item/1dd7e6315dcc0f28b2c0c576
REF:
http://superlxw1234.iteye.com/blog/1536440 http://liubingwwww.blog.163.com/blog/static/3048510720125201749323/ http://blog.csdn.net/azhao_dn/article/details/6921429