Hive 在多維統計分析中的應用 & 技巧總結

阿新 • • 發佈：2022-04-28

多維統計一般分兩種，我們看看 Hive 中如何解決：

1、同屬性的多維組合統計

（1）問題：有如下資料，欄位內容分別為：url, catePath0, catePath1, catePath2, unitparams

https://cwiki.apache.org/confluence 0 1 8 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 1 23 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 1 25 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} https://cwiki.apache.org/confluence 0 5 18 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 5 118 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 3 98 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 3 8 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 5 81 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 9 8 {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}

（2）需求：計算 catePath0, catePath1, catePath2 這三種維度組合下，各個 url 對應的 pv、uv，如：

0 1 23 1 1 0 1 25 1 1 0 1 8 1 1 0 1 ALL 3 3 0 3 8 1 1 0 3 98 1 1 0 3 ALL 2 1 0 5 118 1 1 0 5 18 1 1 0 5 81 1 1 0 5 ALL 3 2 0 ALL ALL 8 3 ALL ALL ALL 8 3

（3）解決思路： hive 中同屬性多維統計問題通常用 union all 組合出各種維度然後 group by 進行求解：

create EXTERNAL table IF NOT EXISTS t_log (
	url string, c0 string, c1 string, c2 string, unitparams string
)  ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' location '/tmp/decli/1';

select * from (
		select host, c0, c1, c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, c0, c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, c0, 'ALL' c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, 'ALL' c0, 'ALL' c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
) test;

select c0, c1, c2, count(host) PV, count(distinct(host)) UV from (
		select host, c0, c1, c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, c0, c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, c0, 'ALL' c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all
		select host, 'ALL' c0, 'ALL' c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
) test group by c0, c1, c2;

2、不同屬性的多維組合統計

這種場景下我們一般選擇 Multi Table/File Inserts，下面選自《programming hive》P124

Making Multiple Passes over the Same Data Hive has a special syntax for producing multiple aggregations from a single pass through a source of data, rather than rescanning it for each aggregation. This change can save considerable processing time for large input data sets. We discussed the details previously in Chapter 5. For example, each of the following two queries creates a table from the same source table, history: hive> INSERT OVERWRITE TABLE sales > SELECT * FROM history WHERE action='purchased'; hive> INSERT OVERWRITE TABLE credits > SELECT * FROM history WHERE action='returned'; This syntax is correct, but inefficient. The following rewrite achieves the same thing, but using a single pass through the source history table: hive> FROM history > INSERT OVERWRITE sales SELECT * WHERE action='purchased' > INSERT OVERWRITE credits SELECT * WHERE action='returned';

FROM pv_users
    INSERT OVERWRITE TABLE pv_gender_sum
        SELECT pv_users.gender, count_distinct(pv_users.userid)
        GROUP BY pv_users.gender

    INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
        SELECT pv_users.age, count_distinct(pv_users.userid)
        GROUP BY pv_users.age;

https://cwiki.apache.org/confluence/display/Hive/Tutorial

注意事項以及一些小技巧：

1、hive union all 的用法：不支援 top level，以及各個select欄位名稱、屬性必須嚴格一致

2、結果的順序問題，可以自己加字元控制排序

3、多重insert和union all一樣也只掃描一次，但因為要insert到多個分割槽，所以做了很多其他的事情，導致消耗的時間非常長，其會產生多個job，union all 本身只有一個job

關於 insert overwrite 產生多 job 並行執行的問題：

set hive.exec.parallel=true; //開啟任務並行執行 set hive.exec.parallel.thread.number=16; //同一個sql允許最大並行度，預設為8。 http://superlxw1234.iteye.com/blog/1703713

4、當前HIVE 不支援 not in 中包含查詢子句的語法，形如如下的HQ語句是不被支援的: 查詢在key欄位在a表中，但不在b表中的資料 select a.key from a where key not in(select key from b) 該語句在hive中不支援可以通過left outer join進行查詢,（假設B表中包含另外的一個欄位 key1 select a.key from a left outer join b on a.key=b.key where b.key1 is null

5、left out join 不能連續3個以上使用，必須2個一組，2個一組包裝起來使用。

select p.ssi,p.pv,p.uv,p.nuv,p.visits,'2012-06-19 17:00:00' from (
	select * from (
		select * from (select ssi,count(1) pv,sum(visits) visits from FactClickAnalysis  
		where logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi ) p1
		left outer join 
		(
		select ssi,count(1) uv from (select ssi,cookieid from FactClickAnalysis 
		where logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi,cookieid ) t1 group by ssi 
		) p2 on p1.ssi=p2.ssi
	) p3
	left outer join
	(
		select ssi, count(1) nuv from FactClickAnalysis 
		where logTime = insertTime and logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi 
	) p4 on p3.ssi=p4.ssi
) p

6、hive本地執行mr

http://superlxw1234.iteye.com/blog/1703546

7、hive動態分割槽建立過多遇到的一個錯誤

http://superlxw1234.iteye.com/blog/1677938

8、hive中巧用正則表示式的貪婪匹配

http://superlxw1234.iteye.com/blog/1751216

9、hive匹配全中文欄位

用java中匹配中文的正則即可：

name rlike '^[\u4e00-\u9fa5]+$'

判斷一個欄位是否全數字：

select mobile from woa_login_log_his where pt = '2012-01-10' and mobile rlike '^\d+$' limit 50;

10、hive中使用sql window函式 LAG/LEAD/FIRST/LAST

http://superlxw1234.iteye.com/blog/1600323

http://www.shaoqun.com/a/18839.aspx

11、hive優化之------控制hive任務中的map數和reduce數

http://superlxw1234.iteye.com/blog/1582880

12、hive中轉義$等特殊字元

http://superlxw1234.iteye.com/blog/1568739

13、日期處理：

檢視N天前的日期：

select from_unixtime(unix_timestamp('20111102','yyyyMMdd') - N*86400,'yyyyMMdd') from t_lxw_test1 limit 1;

獲取兩個日期之間的天數/秒數/分鐘數等等：

select ( unix_timestamp('2011-11-02','yyyy-MM-dd')-unix_timestamp('2011-11-01','yyyy-MM-dd') ) / 86400 from t_lxw_test limit 1;

14、刪除 Hive 臨時檔案 hive.exec.scratchdir

http://hi.baidu.com/youziguo/item/1dd7e6315dcc0f28b2c0c576

REF：

http://superlxw1234.iteye.com/blog/1536440 http://liubingwwww.blog.163.com/blog/static/3048510720125201749323/ http://blog.csdn.net/azhao_dn/article/details/6921429

http://superlxw1234.iteye.com/category/228899

Hive 在多維統計分析中的應用 & 技巧總結

Hive 在多維統計分析中的應用 & 技巧總結

如何使用Smartbi，免費進行多維資料分析

SQLserver中cube：多維資料集例項詳解

python 實現提取log檔案中的關鍵句子,並進行統計分析

在python中建立指定大小的多維陣列方式

python實現在多維陣列中挑選符合條件的全部元素

工藝品製作（多維陣列應用）

主成分分析PCA資料降維原理及python應用（葡萄酒案例分析）

統計列表中[1,20]出現一次和多次的數

實戰：基於NumPy的股價統計分析應用

單頁應用 & 多頁應用的區別

協程在io多路複用中的應用

給你一個整數陣列&#160;arr，請你幫忙統計陣列中每個數的出現次數

Java使用陣列、二維陣列、HashMap統計字串中每個字元出現的次數

前端面試題8----統計字串中出現最多的字元

定位某一項值在多維資料中的位置

ThinkPHP 中對 volist 標籤巢狀使用可實現多維陣列的輸出。

彙編的角度分析C語言多維陣列

在PHP中靈活使用foreach+list處理多維陣列的方法

python 選取矩陣中某些行的數值_Julia在數值分析中的應用（一）高斯消去法

Hive 在多維統計分析中的應用 & 技巧總結

相關推薦