Hive使用正則表示式讀取資料
阿新 • • 發佈:2018-11-11
上一篇部落格中hive中載入的資料都是比較規整的(Hive的基本操作:https://blog.csdn.net/Chris_MZJ/article/details/83713882),欄位與
欄位之間都是分割好的,每一個欄位都不是髒資料,並且每一個欄位都是有意義的但是在真實場景中不見得這個盡人意。比如hive要讀取以下格式的tomcat的執行日誌:
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-upper.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-nav.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /asf-logo.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-button.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-middle.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /asf-logo.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-middle.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-button.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-nav.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-upper.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-button.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-upper.png HTTP/1.1" 304 -
我們可以看到,日誌中的資料是不規則的,不能憑某一個分隔符就能將資料分割成欄位,為此,可以通過使用正則表示式來讀取資料,建立一個logtbl表:
CREATE TABLE logtbl ( host STRING, identity STRING, t_user STRING, time STRING, request STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(.*)\\] \"(.*)\" (-|[0-9]*) (-|[0-9]*)" ) STORED AS TEXTFILE;
input.regex" = "([^ ]) ([^ ]) ([^ ]) \[(.)\] "(.)" (-|[0-9]) (-|[0-9]*)"就是通過正則表示式來讀取不規則資料,
org.apache.hadoop.hive.serde2.RegexSerDe是引用的jar包。
建立logerr檔案:
[[email protected] ~]# vim logerr 192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-upper.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-nav.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /asf-logo.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-button.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-middle.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /asf-logo.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-middle.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-button.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-nav.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-upper.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-button.png HTTP/1.1" 304 - 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-upper.png HTTP/1.1" 304 -
將logerr載入到logtbl中並查詢:
hive> load data local inpath "/root/logerr" into table logtbl;
Loading data to table test.logtbl
Table test.logtbl stats: [numFiles=1, totalSize=1739]
OK
Time taken: 2.085 seconds
hive> select * from logtbl;
OK
logtbl.host logtbl.identity logtbl.t_user logtbl.time logtbl.request logtbl.referer logtbl.agent
192.168.57.4 - - 29/Feb/2016:18:14:35 +0800 GET /bg-upper.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:35 +0800 GET /bg-nav.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:35 +0800 GET /asf-logo.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:35 +0800 GET /bg-button.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:35 +0800 GET /bg-middle.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET / HTTP/1.1 200 11217
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET / HTTP/1.1 200 11217
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /tomcat.css HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /tomcat.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /asf-logo.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /bg-middle.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /bg-button.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /bg-nav.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /bg-upper.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET / HTTP/1.1 200 11217
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /tomcat.css HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /tomcat.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET / HTTP/1.1 200 11217
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /tomcat.css HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /tomcat.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /bg-button.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /bg-upper.png HTTP/1.1 304 -
Time taken: 0.21 seconds, Fetched: 22 row(s)
hive>
通過正則表示式,讀取到了logerr中的不規則資料。
注意:檔案中原始髒資料不會變的,只是hive在讀的時候,將髒資料清理掉再顯示出來。
例如,建立包含下面這條資料的檔案:
[[email protected] ~]# vim err
192.168.57.4 - - 123
然後將err檔案載入到上面已經建立好的logtbl表中:
hive> load data local inpath "/root/err" into table logtbl;
Loading data to table test.logtbl
Table test.logtbl stats: [numFiles=2, totalSize=1760]
OK
Time taken: 0.465 seconds
hive>
這一步沒有問題,因為load就是將資料拷貝到工作目錄區中,接下來查詢表中的記錄:
hive> select * from logtbl;
OK
logtbl.host logtbl.identity logtbl.t_user logtbl.time logtbl.request logtbl.referer logtbl.agent
NULL NULL NULL NULL NULL NULL NULL
192.168.57.4 - - 29/Feb/2016:18:14:35 +0800 GET /bg-upper.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:35 +0800 GET /bg-nav.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:35 +0800 GET /asf-logo.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:35 +0800 GET /bg-button.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:35 +0800 GET /bg-middle.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET / HTTP/1.1 200 11217
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET / HTTP/1.1 200 11217
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /tomcat.css HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /tomcat.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /asf-logo.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /bg-middle.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /bg-button.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /bg-nav.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /bg-upper.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET / HTTP/1.1 200 11217
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /tomcat.css HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /tomcat.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET / HTTP/1.1 200 11217
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /tomcat.css HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /tomcat.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /bg-button.png HTTP/1.1 304 -
192.168.57.4 - - 29/Feb/2016:18:14:36 +0800 GET /bg-upper.png HTTP/1.1 304 -
Time taken: 0.126 seconds, Fetched: 23 row(s)
hive>
可以看到第一條記錄全是NULL,那是因為hive因為根據正則表示式的模板來讀資料,讀不懂192.168.57.4 - - 123這條資料,所以欄位全是NULL。
由此可以得出:hive根據正則表示式讀資料的時候,是讀時檢查資料格式對不對,而不是寫時檢查。