使用Data Lake Analytics快速分析OSS上的日誌檔案
阿新 • • 發佈:2018-12-14
背景
Data Lake Analytics (後文簡稱 DLA)是Serverless化的雲上互動式查詢分析服務,使用者可以通過標準的SQL語句對儲存在OSS, OTS, RDS等介質上的資料進行快速地查詢分析。
日誌檔案在大資料分析中的地位舉足輕重。對於一個服務來說,其日誌檔案往往記錄了其執行的所有詳細資訊。無論是故障排除,狀態監控,還是預測告警,都離不開對日誌檔案的查詢分析。由於OSS的高性價比,越來越多的使用者傾向把大量的日誌檔案儲存在OSS中。DLA可以無需移動OSS上的日誌檔案,直接對其做查詢分析。
本文將介紹如何使用DLA對常見格式的日誌檔案做查詢。
使用DLA查詢日誌檔案
DLA可以分析的日誌檔案需要滿足下面的條件:
- 日誌檔案是純文字的格式,每行可以對映為表中的一條記錄;
- 每行的內容有固定的模式,可以用一個正則表示式去匹配
目前對日誌檔案的支援還僅限於OSS資料來源,因此需要使用者預先將日誌檔案保存於OSS中。
對日誌檔案建表時,最麻煩的一步就是寫正則表示式。下面將以常見的日誌檔案為例,給出每種檔案型別的正則表示式供大家參考。
Apache WebServer 日誌
檔案內容
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 127.0.0.1 - - [26/May/2009:00:00:00 +0000] "GET /someurl/?track=Blabla(Main) HTTP/1.1" 200 5864 - "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19"
正則表示式
([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?
建表語句
CREATE EXTERNAL TABLE webserver_log( host STRING, identity STRING, userName STRING, time STRING, request STRING, status STRING, size INT, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?" ) STORED AS TEXTFILE LOCATION 'oss://mybucket/datasets/path/to/webserver.log';
查詢結果
mysql> select * from webserver_log;
+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
| host | identity | userName | time | request | status | size | referer | agent |
+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
| 127.0.0.1 | - | frank | [10/Oct/2000:13:55:36 -0700] | "GET /apache_pb.gif HTTP/1.0" | 200 | 2326 | NULL | NULL |
| 127.0.0.1 | - | - | [26/May/2009:00:00:00 +0000] | "GET /someurl/?track=Blabla(Main) HTTP/1.1" | 200 | 5864 | - | "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19" |
+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
Ngnix訪問日誌
以NGINX文件中提到的format為例:https://docs.nginx.com/nginx/admin-guide/monitoring/logging/#
檔案內容
127.0.0.1 - - [14/May/2018:21:58:04 +0800] "GET /?stat HTTP/1.1" 200 182 "-" "aliyun-sdk-java/2.6.0(Linux/2.6.32-220.23.2.ali927.el5.x86_64/amd64;1.6.0_24)" "-"
127.0.0.1 - - [14/May/2018:21:58:04 +0800] "GET /?prefix=&delimiter=%2F&max-keys=100&encoding-type=url HTTP/1.1" 200 7202 "https://help.aliyun.com/product/70174.html" "aliyun-sdk-java/2.6.0(Linux/2.6.32-220.23.2.ali927.el5.x86_64/amd64;1.6.0_24)" "-"
正則表示式
([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) (\".*?\") (-|[0-9]*) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) (\".*?\") (-|[0-9]*) (-|[0-9]*)
建表語句
CREATE EXTERNAL TABLE ngnix_log(
remote_address STRING,
identity STRING,
remote_user STRING,
time_local STRING,
request STRING,
status STRING,
body_bytes_sent INT,
http_referer STRING,
http_user_agent STRING,
gzip_ratio STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))? ([^ \"]*|\"[^\"]*\")"
)
STORED AS TEXTFILE
LOCATION 'oss://mybucket/datasets/path/to/ngnix_log';
查詢結果
mysql> select * from ngnix_log;
+----------------+----------+-------------+------------------------------+-----------------------------------------------------------------------+--------+-----------------+----------------------------------------------+---------------------------------------------------------------------------------+------------+
| remote_address | identity | remote_user | time_local | request | status | body_bytes_sent | http_referer | http_user_agent | gzip_ratio |
+----------------+----------+-------------+------------------------------+-----------------------------------------------------------------------+--------+-----------------+----------------------------------------------+---------------------------------------------------------------------------------+------------+
| 127.0.0.1 | - | - | [14/May/2018:21:58:04 +0800] | "GET /?stat HTTP/1.1" | 200 | 182 | "-" | "aliyun-sdk-java/2.6.0(Linux/2.6.32-220.23.2.ali927.el5.x86_64/amd64;1.6.0_24)" | "-" |
| 127.0.0.1 | - | - | [14/May/2018:21:58:04 +0800] | "GET /?prefix=&delimiter=%2F&max-keys=100&encoding-type=url HTTP/1.1" | 200 | 7202 | "https://help.aliyun.com/product/70174.html" | "aliyun-sdk-java/2.6.0(Linux/2.6.32-220.23.2.ali927.el5.x86_64/amd64;1.6.0_24)" | "-" |
+----------------+----------+-------------+------------------------------+-----------------------------------------------------------------------+--------+-----------------+----------------------------------------------+---------------------------------------------------------------------------------+------------+
Aapache Log4j 日誌
以Hadoop預設生成的日誌檔案為例。
檔案內容
2018-11-27 17:45:23,128 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Minimum allocation = <memory:1024, vCores:1>
2018-11-27 17:45:23,128 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Maximum allocation = <memory:8192, vCores:4>
2018-11-27 17:45:23,154 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration: max alloc mb per queue for root is undefined
2018-11-27 17:45:23,154 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration: max alloc vcore per queue for root is undefined
正則表示式
^(\\d{4}-\\d{2}-\\d{2})\\s+(\\d{2}.\\d{2}.\\d{2}.\\d{3})\\s+(\\S+)\\s+(\\S+)\\s+(.*)$
建表語句
CREATE EXTERNAL TABLE log4j_log(
date STRING,
time STRING,
level STRING,
class STRING,
details STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(\\d{4}-\\d{2}-\\d{2})\\s+(\\d{2}.\\d{2}.\\d{2}.\\d{3})\\s+(\\S+)\\s+(\\S+)\\s+(.*)$"
)
STORED AS TEXTFILE
LOCATION 'oss://oss-cn-beijing-for-openanalytics-test-2/datasets/jinluo/nginx/log4j_sample.log';
查詢結果
mysql> select * from log4j_log;
+------------+--------------+-------+--------------------------------------------------------------------------------------------------+-------------------------------------------------+
| date | time | level | class | details |
+------------+--------------+-------+--------------------------------------------------------------------------------------------------+-------------------------------------------------+
| 2018-11-27 | 17:45:23,128 | INFO | org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: | Minimum allocation = <memory:1024, vCores:1> |
| 2018-11-27 | 17:45:23,128 | INFO | org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: | Maximum allocation = <memory:8192, vCores:4> |
| 2018-11-27 | 17:45:23,154 | INFO | org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration: | max alloc mb per queue for root is undefined |
| 2018-11-27 | 17:45:23,154 | INFO | org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration: | max alloc vcore per queue for root is undefined |
+------------+--------------+-------+--------------------------------------------------------------------------------------------------+-------------------------------------------------+
總結
對於寫日誌檔案的正則表示式:
- 正則表示式中的每個欄位用 () 作為邊界,日誌中通常每個欄位以空格分隔。
- 建表語句中定義的列的個數要和正則表示式中的欄位數完全匹配。
- 通常,數字可以用 ([0-9]*) 或者 (-|[0-9]*) 匹配,字串用(1*) 或者 (".*?") 匹配