1. 程式人生 > >Hive使用正則表示式讀取資料

Hive使用正則表示式讀取資料

上一篇部落格中hive中載入的資料都是比較規整的(Hive的基本操作:https://blog.csdn.net/Chris_MZJ/article/details/83713882),欄位與

欄位之間都是分割好的,每一個欄位都不是髒資料,並且每一個欄位都是有意義的但是在真實場景中不見得這個盡人意。比如hive要讀取以下格式的tomcat的執行日誌:

192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-upper.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-nav.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /asf-logo.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-button.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-middle.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /asf-logo.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-middle.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-button.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-nav.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-upper.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-button.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-upper.png HTTP/1.1" 304 -

我們可以看到,日誌中的資料是不規則的,不能憑某一個分隔符就能將資料分割成欄位,為此,可以通過使用正則表示式來讀取資料,建立一個logtbl表:

 CREATE TABLE logtbl (
    host STRING,
    identity STRING,
    t_user STRING,
    time STRING,
    request STRING,
    referer STRING,
    agent STRING)
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
  WITH SERDEPROPERTIES (
    "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(.*)\\] \"(.*)\" (-|[0-9]*) (-|[0-9]*)"
  )
  STORED AS TEXTFILE;

input.regex" = "([^ ]) ([^ ]) ([^ ]) \[(.)\] "(.)" (-|[0-9]) (-|[0-9]*)"就是通過正則表示式來讀取不規則資料,

org.apache.hadoop.hive.serde2.RegexSerDe是引用的jar包。

建立logerr檔案:

[[email protected] ~]# vim logerr

192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-upper.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-nav.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /asf-logo.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-button.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-middle.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /asf-logo.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-middle.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-button.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-nav.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-upper.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-button.png HTTP/1.1" 304 -
192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-upper.png HTTP/1.1" 304 -

將logerr載入到logtbl中並查詢:

hive> load data local inpath "/root/logerr" into table logtbl;
Loading data to table test.logtbl
Table test.logtbl stats: [numFiles=1, totalSize=1739]
OK
Time taken: 2.085 seconds
hive> select * from logtbl;
OK
logtbl.host	logtbl.identity	logtbl.t_user	logtbl.time	logtbl.request	logtbl.referer	logtbl.agent
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /bg-upper.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /bg-nav.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /asf-logo.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /bg-button.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /bg-middle.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET / HTTP/1.1	200	11217
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET / HTTP/1.1	200	11217
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.css HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /asf-logo.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-middle.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-button.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-nav.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-upper.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET / HTTP/1.1	200	11217
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.css HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET / HTTP/1.1	200	11217
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.css HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-button.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-upper.png HTTP/1.1	304	-
Time taken: 0.21 seconds, Fetched: 22 row(s)
hive> 

通過正則表示式,讀取到了logerr中的不規則資料。

注意:檔案中原始髒資料不會變的,只是hive在讀的時候,將髒資料清理掉再顯示出來。

例如,建立包含下面這條資料的檔案:

[[email protected] ~]# vim err

192.168.57.4 - - 123

然後將err檔案載入到上面已經建立好的logtbl表中:

hive> load data local inpath "/root/err" into table logtbl;
Loading data to table test.logtbl
Table test.logtbl stats: [numFiles=2, totalSize=1760]
OK
Time taken: 0.465 seconds
hive> 

這一步沒有問題,因為load就是將資料拷貝到工作目錄區中,接下來查詢表中的記錄:

hive> select * from logtbl;
OK
logtbl.host	logtbl.identity	logtbl.t_user	logtbl.time	logtbl.request	logtbl.referer	logtbl.agent
NULL	NULL	NULL	NULL	NULL	NULL	NULL
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /bg-upper.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /bg-nav.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /asf-logo.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /bg-button.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:35 +0800	GET /bg-middle.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET / HTTP/1.1	200	11217
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET / HTTP/1.1	200	11217
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.css HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /asf-logo.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-middle.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-button.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-nav.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-upper.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET / HTTP/1.1	200	11217
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.css HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET / HTTP/1.1	200	11217
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.css HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /tomcat.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-button.png HTTP/1.1	304	-
192.168.57.4	-	-	29/Feb/2016:18:14:36 +0800	GET /bg-upper.png HTTP/1.1	304	-
Time taken: 0.126 seconds, Fetched: 23 row(s)
hive> 

可以看到第一條記錄全是NULL,那是因為hive因為根據正則表示式的模板來讀資料,讀不懂192.168.57.4 - - 123這條資料,所以欄位全是NULL。

由此可以得出:hive根據正則表示式讀資料的時候,是讀時檢查資料格式對不對,而不是寫時檢查。