1. 程式人生 > 其它 >Hive 在多維統計分析中的應用 & 技巧總結

Hive 在多維統計分析中的應用 & 技巧總結

多維統計一般分兩種,我們看看 Hive 中如何解決:

1、同屬性的多維組合統計

(1)問題: 有如下資料,欄位內容分別為:url, catePath0, catePath1, catePath2, unitparams

https://cwiki.apache.org/confluence 0 1 8 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 1 23 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 1 25 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} https://cwiki.apache.org/confluence 0 5 18 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 5 118 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 3 98 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 3 8 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 5 81 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 9 8 {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}

(2)需求: 計算 catePath0, catePath1, catePath2 這三種維度組合下,各個 url 對應的 pv、uv,如:

0 1 23 1 1 0 1 25 1 1 0 1 8 1 1 0 1 ALL 3 3 0 3 8 1 1 0 3 98 1 1 0 3 ALL 2 1 0 5 118 1 1 0 5 18 1 1 0 5 81 1 1 0 5 ALL 3 2 0 ALL ALL 8 3 ALL ALL ALL 8 3

(3)解決思路: hive 中同屬性多維統計問題通常用 union all 組合出各種維度然後 group by 進行求解:

create EXTERNAL table IF NOT EXISTS t_log (
	url string, c0 string, c1 string, c2 string, unitparams string
)  ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' location '/tmp/decli/1';

select * from (
		select host, c0, c1, c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, c0, c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, c0, 'ALL' c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, 'ALL' c0, 'ALL' c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
) test;

select c0, c1, c2, count(host) PV, count(distinct(host)) UV from (
		select host, c0, c1, c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, c0, c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, c0, 'ALL' c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, 'ALL' c0, 'ALL' c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
) test group by c0, c1, c2;

2、不同屬性的多維組合統計

這種場景下我們一般選擇 Multi Table/File Inserts,下面選自《programming hive》P124

Making Multiple Passes over the Same Data Hive has a special syntax for producing multiple aggregations from a single pass through a source of data, rather than rescanning it for each aggregation. This change can save considerable processing time for large input data sets. We discussed the details previously in Chapter 5. For example, each of the following two queries creates a table from the same source table, history: hive> INSERT OVERWRITE TABLE sales     > SELECT * FROM history WHERE action='purchased'; hive> INSERT OVERWRITE TABLE credits     > SELECT * FROM history WHERE action='returned'; This syntax is correct, but inefficient. The following rewrite achieves the same thing, but using a single pass through the source history table: hive> FROM history     > INSERT OVERWRITE sales   SELECT * WHERE action='purchased'     > INSERT OVERWRITE credits SELECT * WHERE action='returned';

FROM pv_users
    INSERT OVERWRITE TABLE pv_gender_sum
        SELECT pv_users.gender, count_distinct(pv_users.userid)
        GROUP BY pv_users.gender

    INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
        SELECT pv_users.age, count_distinct(pv_users.userid)
        GROUP BY pv_users.age;

https://cwiki.apache.org/confluence/display/Hive/Tutorial

注意事項以及一些小技巧:

1、hive union all 的用法:不支援 top level,以及各個select欄位名稱、屬性必須嚴格一致

2、結果的順序問題,可以自己加字元控制排序

3、多重insert和union all一樣也只掃描一次,但因為要insert到多個分割槽,所以做了很多其他的事情,導致消耗的時間非常長,其會產生多個job,union all 本身只有一個job

關於 insert overwrite 產生多 job 並行執行的問題:

set hive.exec.parallel=true;   //開啟任務並行執行 set hive.exec.parallel.thread.number=16; //同一個sql允許最大並行度,預設為8。 http://superlxw1234.iteye.com/blog/1703713

4、當前HIVE 不支援 not in 中包含查詢子句的語法,形如如下的HQ語句是不被支援的:  查詢在key欄位在a表中,但不在b表中的資料 select a.key from a where key not in(select key from b)  該語句在hive中不支援 可以通過left outer join進行查詢,(假設B表中包含另外的一個欄位 key1  select a.key from a left outer join b on a.key=b.key where b.key1 is null

5、left out join 不能連續3個以上使用,必須2個一組,2個一組包裝起來使用。

select p.ssi,p.pv,p.uv,p.nuv,p.visits,'2012-06-19 17:00:00' from (
	select * from (
		select * from (select ssi,count(1) pv,sum(visits) visits from FactClickAnalysis  
		where logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi ) p1
		left outer join 
		(
		select ssi,count(1) uv from (select ssi,cookieid from FactClickAnalysis 
		where logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi,cookieid ) t1 group by ssi 
		) p2 on p1.ssi=p2.ssi
	) p3
	left outer join
	(
		select ssi, count(1) nuv from FactClickAnalysis 
		where logTime = insertTime and logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi 
	) p4 on p3.ssi=p4.ssi
) p

6、hive本地執行mr

http://superlxw1234.iteye.com/blog/1703546

7、hive動態分割槽建立過多遇到的一個錯誤

http://superlxw1234.iteye.com/blog/1677938

8、hive中巧用正則表示式的貪婪匹配

http://superlxw1234.iteye.com/blog/1751216

9、hive匹配全中文欄位

用java中匹配中文的正則即可:

name rlike '^[\u4e00-\u9fa5]+$'

判斷一個欄位是否全數字:

select mobile from woa_login_log_his where pt = '2012-01-10' and mobile rlike '^\d+$' limit 50;  

10、hive中使用sql window函式 LAG/LEAD/FIRST/LAST

http://superlxw1234.iteye.com/blog/1600323

http://www.shaoqun.com/a/18839.aspx

11、hive優化之------控制hive任務中的map數和reduce數

http://superlxw1234.iteye.com/blog/1582880

12、hive中轉義$等特殊字元

http://superlxw1234.iteye.com/blog/1568739

13、日期處理:

檢視N天前的日期:

select from_unixtime(unix_timestamp('20111102','yyyyMMdd') - N*86400,'yyyyMMdd') from t_lxw_test1 limit 1;  

獲取兩個日期之間的天數/秒數/分鐘數等等:

select ( unix_timestamp('2011-11-02','yyyy-MM-dd')-unix_timestamp('2011-11-01','yyyy-MM-dd') ) / 86400  from t_lxw_test limit 1;  

14、刪除 Hive 臨時檔案 hive.exec.scratchdir

http://hi.baidu.com/youziguo/item/1dd7e6315dcc0f28b2c0c576

REF:

http://superlxw1234.iteye.com/blog/1536440 http://liubingwwww.blog.163.com/blog/static/3048510720125201749323/ http://blog.csdn.net/azhao_dn/article/details/6921429

http://superlxw1234.iteye.com/category/228899