hive中使用正則表示式不當導致執行奇慢無比

阿新 • • 發佈：2019-02-19

業務保障部有一個需求，需要用hive實時計算上一小時的資料，比如現在是12點，我需要計算11點的資料，而且必須在1小時之後執行出來，但是他們用hive實現的時候發現就單個map任務執行都超過了1小時，根本沒法滿足需求，後來打電話讓我幫忙優化一下，以下是優化過程：

1、hql語句：

CREATE TABLE weibo_mobile_nginx AS SELECT
	split(split(log, '`') [ 0 ], '\\|')[ 0 ] HOST,
	split(split(log, '`') [ 0 ], '\\|')[ 1 ] time,
	substr(
		split(
			split(split(log, '`') [ 2 ], '\\?')[ 0 ], ' '
		)[ 0 ], 2
	)request_type,
	split(
		split(split(log, '`') [ 2 ], '\\?')[ 0 ], ' '
	)[ 1 ] interface,
	regexp_extract(
		log,
		’.*& ua =[^ _ ]* __([^ _ ]*)__([^ _ ]*)__([^ _ ]*)__<span style="font-family: Arial, Helvetica, sans-serif;">[^&]*</span>’,
		3
	)version,
	regexp_extract(
		log,
		’.*& ua =[^ _ ]* __([^ _ ]*)__([^ _ ]*)__([^ _ ]*)__.* ',1) systerm,regexp_extract(log,’.*&networktype=([^&%]*).*',
		1
	)net_type,
	split(log, '`')[ 4 ] STATUS,
	split(log, '`')[ 5 ] client_ip,
	split(log, '`')[ 6 ] uid,
	split(log, '`')[ 8 ] request_time,
	split(log, '`')[ 12 ] request_uid,
	split(log, '`')[ 13 ] http_host,
	split(log, '`')[ 15 ] upstream_response_time,
	split(log, '`')[ 16 ] idc
FROM
	ods_wls_wap_base_orig
WHERE
	dt = '20150311'
AND HOUR = '08'
AND(
	split(log, '`')[ 13 ]= 'api.weibo.cn'
	OR split(log, '`')[ 13 ]= 'mapi.weibo.cn’);

其實這個hql很簡單，從一個只有一列資料的表ods_wls_wap_base_orig中獲取資料，然後對每一行資料進行split或者正則表示式匹配得到需要的欄位資訊，最後通過輸出的資料建立weibo_mobile_nginx表。

其中表ods_wls_wap_base_orig的一行資料格式如下：

web043.mweibo.yhg.sinanode.com|[11/Mar/2015:00:00:01 +0800]`-`"GET /2/remind/unread_count?v_f=2&c=android&wm=9847_0002&remind_version=0&with_settings=1&unread_message=1&from=1051195010&lang=zh_CN&skin=default&with_page_group=1&i=4acbdd0&s=6b2cd11c&gsid=4uQ15a2b3&ext_all=0&idc=&ua=OPPO-R8007__weibo__5.1.1__android__android4.3&oldwm=9893_0028 HTTP/1.1"`"R8007_4.3_weibo_5.1.1_android"`200`[121.60.78.23]`3226234350`"-"`0.063`351`-`121.60.78.23`1002792675011956002`api.weibo.cn`-`0.063`yhg20150311 00

只有1列，列名是log。

2、既然hql實現很慢，我第一次優化的嘗試就是寫mapreduce

map程式碼如下：

public class Map extends Mapper<LongWritable, Text, Text, Text> {

  private Text outputKey = new Text();
  private Text outputValue = new Text();

  Pattern p_per_client = Pattern
      .compile(".*&ua=[^_]*__([^_]*)__([^_]*)__([^_]*)__[^&]*");
  Pattern net_type_parent = Pattern.compile(".*&networktype=([^&%]*).*");

  public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {

    String[] arr = value.toString().split("`");
    if (arr[13].equals("api.weibo.cn") || arr[13].equals("mapi.weibo.cn")) {
      Matcher matcher = p_per_client.matcher(value.toString());
      String host = "";
      String time = "";
      String request_type = "";
      String interface_url = "";
      String version = "";
      String systerm = "";
      String net_type = "";
      String status = "";
      String client_ip = "";
      String uid = "";
      String request_time = "0";
      String request_uid = "";
      String http_host = "";
      String upstream_response_time = "0";
      String idc = "";

      host = arr[0].split("\\|")[0];
      time = arr[0].split("\\|")[1];
      request_type = arr[2].split("\\?")[0].split(" ")[0].substring(1);
      interface_url = arr[2].split("\\?")[0].split(" ")[1];

      if (matcher.find()) {
        version = matcher.group(1);
        systerm = matcher.group(2);
      }

      Matcher matcher_net = net_type_parent.matcher(value.toString());
      if (matcher_net.find()) {
        net_type = matcher_net.group(1);
      }

      status = arr[4];
      client_ip = arr[5];
      uid = arr[6];
      if (!arr[8].equals("-")) {
        request_time = arr[8];
      }
      request_uid = arr[12];
      http_host = arr[13];
      if (!arr[15].equals("-")) {
        upstream_response_time = arr[15];
      }
      idc = arr[16];

      outputKey.set(host + "\t" + time + "\t" + request_type + "\t"
          + interface_url + "\t" + version + "\t" + systerm + "\t" + net_type
          + "\t" + status + "\t" + client_ip + "\t" + uid + "\t" + request_uid
          + "\t" + http_host + "\t" + idc);
      outputValue.set(request_time + "\t" + upstream_response_time);

      context.write(outputKey, outputValue);
    }

  }

java程式碼其實也很簡單，這裡不多說。打包提交job，結果map最慢的運行了40分鐘，平均map執行時間達到30分鐘，雖然整個job在1小時內完成了，但是也很慢，這個問題看來不是用java改寫就能好的問題。

3、最後檢測正則表示式

改用java實現的mapreduce執行也很慢，看來問題還是其他原因，我看了一下hql中的正則表示式，修改了幾個地方：

原來的：

regexp_extract(
                log,
                ’.*& ua =[^ _ ]* __([^ _ ]*)__([^ _ ]*)__([^ _ ]*)__[^&]*’,
                3
        )version,
        regexp_extract(
                log,
                ’.*& ua =[^ _ ]* __([^ _ ]*)__([^ _ ]*)__([^ _ ]*)__.* ',1)
        systerm,
regexp_extract(log,’.*&networktype=([^&%]*).*',
                1
        )net_type,

修改後：

	regexp_extract(
		log,
		'&ua=[^_]*__[^_]*__([^_]*)__[^_]*__',
		1
	)version,
	regexp_extract(
		log,
		'&ua=[^_]*__[^_]*__[^_]*__([^_]*)__',
		1
	)systerm,
	regexp_extract(
		log,
		'&networktype=([^&%]*)',
		1
	)net_type,

其實匹配目標很明確，所以我把正則表示式前後的".*"去掉了，同時去掉了沒必要的group，索引都改成了1。

java程式碼的正則表示式也進行了修改：

Pattern p_per_client = Pattern
      .compile("&ua=[^_]*__[^_]*__([^_]*)__([^_]*)__");
  Pattern net_type_parent = Pattern.compile("&networktype=([^&%]*).");

分別提交測試了一下，速度ss的，修改後的hql和mapreduce整個作業6分鐘執行完成，平均map執行時間2分鐘，速度提升很大，滿足了他們的速度要求。

總結：

1、正則表示式最前面包含“.*”，這樣在匹配的時候需要從第一個字元開始匹配，速度非常非常慢，如果我們匹配的目標很明確的情況下，應該去掉“.*”

2、以後遇到這種問題的時候，一定要看看正則表示式是不是寫得有問題，切記切記。

hive中使用正則表示式不當導致執行奇慢無比

hive中使用正則表示式不當導致執行奇慢無比

php 中正則表示式詳解

Python中正則表示式re.match的用法

WPF中正則表示式的部分應用

Python中正則表示式常用函式sub,search,findall,split等使用

Python中正則表示式對單個字元，多個字元，匹配邊界等使用

js中正則表示式test()方法

Hive使用正則表示式讀取資料

java中正則表示式的瞭解與實踐記錄

Java中正則表示式相關類Pattern和Matcher的使用

python中正則表示式1

Java中正則表示式

python中正則表示式的使用

Lua中正則表示式的使用整理

Python中正則表示式介紹

1000行程式碼徒手寫正則表示式引擎【1】--JAVA中正則表示式的使用

javascript中正則表示式的基礎語法

Mysql中正則表示式Regexp常見用法

用一個例項講解rename命令中正則表示式的使用

Java中正則表示式匹配的語法規則

hive中使用正則表示式不當導致執行奇慢無比

相關推薦