pig的一些例項(我常用的語法)
在pig中, dump和store會分別完成兩個MR,不會一起進行
1:載入名用正則表示式:
LOAD'/user/wizad/data/wizad/raw/2014-0{6,7-0,7-1,7-2,7-3,8}*/3_1/adwords*'
或者定義引用:%default cleanedLog/user/wizad/data/wizad/cleaned/2014-11-{0[3-9],1[0-8]}/*/part*正確,
而%default cleanedLog/user/wizad/data/wizad/cleaned/2014-11-{0[3-9],[10-18]}/*/part*(這語法居然錯了, 用hadoop fs -ls/user/wizad/data/wizad/cleaned/2014-11-{0[3-9],[10-18]}/ 發現[10-18]不能使用,是錯誤的,所以只能用1[0-8]。原因是[]只能在10之內。我試了一年0[10-18]查的是01和08兩個檔案。而0[100-108] 查的10,11,18三個檔案。所以只能在10之內使用。使用時格式為{[10-18]}也是一樣的!)
注意:檔名讀入不支援所有的正則表示式,是hadoop支援什麼雲可是用什麼。hadoop2.0支援,
?
*
[abc]或者[^abc]
[a-z]或者[^a-z]
\c:轉移字元表達,\d標示0到9的數字
{ab,cd}
2:filter的幾種簡單用法:
按值過濾
FILTERclickDate_all BY log_type=='2';
FILTERmapping_table BY mapping_ad_network_id=='3' AND mapping_type=='5';
test=FILTER allRow BY (ad_id=='14997' OR ad_id=='14998' OR ad_id=='14999') ANDlog_type==2;
test=FILTERallRow BY (INDEXOF(ad_id,'14997')==0 OR INDEXOF(ad_id,'14998')==0 OR INDEXOF(ad_id,'14999')==0)AND log_type==2;
配合size函式
FILTERcount_imei BY (SIZE(cimei)>14 AND SIZE(cimei)<17);
2:正則表示式
FILTERcimei2 BY NOT cimei MATCHES '^[0-9]*$';
FILTERcmac2 BY cmac MATCHES'/[A-F\d]{2}:[A-F\d]{2}:[A-F\d]{2}:[A-F\d]{2}:[A-F\d]{2}:[A-F\d]{2}/';
3:排序
ORDER province_count BY $2 DESC;
注意order多個檔案,比如hdfs上part00000和part00001,order後只生成一個檔案,因為合併成一個檔案的操作只能用一個reduce完成,所以結果可能生成很大的檔案
4:CONCAT
可用於生成獨立的一列,如count了的一個數,前面加一列名稱
FOREACHorigin_cleaned_data GENERATE CONCAT('<-_','->') AS cou,guid,log_type;
read_social_14=FOREACH metadata_social_14 GENERATE CONCAT('14','=='),guid_social;
all_id=FOREACH allRow GENERATE id,CONCAT('_','-') as cc;
5:過濾空值,將空值改成取值unknown。
條件表示式“(判斷式)?a:b”的應用:直接對列操作
origin_historical= FOREACH origin_cleaned_data GENERATE wizad_ad_id,guid,log_type,
((province_region_id== '') ? 'unknown' : province_region_id)
另外注意:pig判斷取值為null,是用is null(is not null)或者== null(!= null)
6:切分成不同子集,按值:
SPLIT geelyTuiGuang INTO android IFos_id==1,ios IF os_id==2;
SPLIT ios INTO ios6 IF(INDEXOF(os_version,'7')!=0),ios7 IF INDEXOF(os_version,'7')==0;
SPLITallCleaned INTO log_42 IF (
((chararray)$34=='1'OR (chararray)$34=='2' OR (chararray)$34=='3' OR (chararray)$34=='1' OR(chararray)$34=='4')
AND
(INDEXOF((chararray)$35,'.')>0)
AND
((chararray)$36=='1'OR (chararray)$36=='')
),
log_43IF (
((chararray)$34=='1'OR (chararray)$34=='2')
AND
((chararray)$35=='1'OR (chararray)$35=='2' OR (chararray)$35=='3' OR (chararray)$35=='1' OR(chararray)$35=='4')
AND
(INDEXOF((chararray)$36,'.')>0)
);
7:replace函式替換值
FOREACH ios6 GENERATE imei,mac_address ascmac,REPLACE(idfa,'null','');
8:資料流過濾
en_guid =STREAM duimei THROUGH `awk-F"," '{if($3 == "null") print$1","$2","; else print $0}'`;
9:強制轉換:
cleaned_data_42=FOREACH log_42 GENERATE
(chararray)$1 AS wizad_ad_id:chararray,
(chararray)$2 AS guid:chararray,
(chararray)$6 AS log_type:chararray,
(chararray)$18AS imei:chararray,
(chararray)$22AS idfa:chararray,
(chararray)$23AS mac_address:chararray
10內建函式REGEX_EXTRACT,使用正則表示式:
allAdId=FOREACH allRow GENERATE REGEX_EXTRACT((chararray)$3,'(.*) (.*)',1) AStime,REGEX_EXTRACT((chararray)$0,'(.*)_(.*)',1) AS adn,$6 AS ad_id;
allAdId=FOREACH allRow GENERATE REGEX_EXTRACT(create_time,'(.*) (.*)',1) AStime,ad_id;
11.SUBSTRING(aa,0,n)提取0到n-1個字元:
split jn_data into same_prov if(SUBSTRING(province,0,2) == SUBSTRING(province_ad,0,2)), diff_prov if(SUBSTRING(province,0,2)
!= SUBSTRING(province_ad,0,2));
時間型別提取分鐘,做計算
log_data= foreach click_log generate log_type,guid,ip,SUBSTRING(create_time,0,13) astime,SUBSTRING(create_time,14,16) as minute2,os_id,os_version,device_type;
12,ABS時間相差5分鐘計算:
minute_compare= foreach join_data generatelog_type,cookie_id,guid,(int)minute1,(int)minute2,time_extract::os_version,log_data::os_version;
same_users= filter minute_compare by (ABS(minute1-minute2) <= 5);
13,統計個數
grp_diff_city= group diff_city all;
count_diff_city= foreach grp_diff_city generate COUNT_STAR($1);
dump count_same_city;
14,join by多個列(欄位)
join_data= join time_extract by (ip,time,os_id), log_data by (ip,time,os_id);
從左向右依次比較