1. 程式人生 > 實用技巧 >hive 自定義UDF (轉)

hive 自定義UDF (轉)

(轉自)https://www.cnblogs.com/yfb918/p/10644262.html

hive之Json解析(普通Json和Json陣列)

一、資料準備

現準備原始json資料(test.json)如下:

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}

現在將資料匯入到hive中,並且最終想要得到這麼一個結果:

可以使用:內建函式(get_json_object)或者自定義函式完成

二、get_json_object(string json_string, string path)

返回值:String

說明:解析json的字串json_string,返回path指定的內容。如果輸入的json字串無效,那麼返回NUll,這個函式每次只能返回一個數據項。

0: jdbc:hive2://hadoop3:10000> select get_json_object('{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}','$.movie');

1、建立json表並將資料匯入

0: jdbc:hive2://master:10000> create table json(data string);
No rows affected (0.572 seconds)
0: jdbc:hive2://master:10000> load data local inpath '/home/hadoop/json.txt' into table json;
No rows affected (1.046 seconds)

0: jdbc:hive2://master:10000> select get_json_object(data,'$.movie') as movie from json;

三、json_tuple(jsonStr, k1, k2, ...)

引數為一組鍵k1,k2,。。。。。和json字串,返回值的元組。該方法比get_json_object高效,因此可以在一次呼叫中輸入多次鍵

0: jdbc:hive2://master:10000> select b.b_movie,b.b_rate,b.b_timeStamp,b.b_uid from json a lateral view 
json_tuple(a.data,'movie','rate','timeStamp','uid') b as b_movie,b_rate,b_timeStamp,b_uid;

注意點:

  json_tuple相當於get_json_object的優勢就是一次可以解析多個Json欄位。但是如果我們有個Json陣列,這兩個函式都無法處理

四、Json陣列解析

1、使用Hive自帶的函式解析Json陣列

Hive的內建的explode函式,explode()函式接收一個 array或者map 型別的資料作為輸入,然後將 array 或 map 裡面的元素按照每行的形式輸出。其可以配合 LATERAL VIEW 一起使用。

hive> select explode(array('A','B','C'));
OK
A
B
C
Time taken: 4.879 seconds, Fetched: 3 row(s)
hive> select explode(map('A',10,'B',20,'C',30));
OK
A       10
B       20
C       30
Time taken: 0.261 seconds, Fetched: 3 row(s)

這個explode函式和我們解析json資料是有關係的,我們可以使用explode函式將json數組裡面的元素按照一行一行的形式輸出:

hive> SELECT explode(split(regexp_replace(regexp_replace('[{"website":"www.baidu.com","name":"百度"},{"website":"google.com","name":"谷歌"}]', '\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;'));
OK
{"website":"www.baidu.com","name":"百度"}
{"website":"google.com","name":"谷歌"}
Time taken: 0.14 seconds, Fetched: 2 row(s)

說明:

SELECT explode(split(
    regexp_replace(
        regexp_replace(
            '[
                {"website":"www.baidu.com","name":"百度"},
                {"website":"google.com","name":"谷歌"}
            ]', 
            '\\[|\\]',''),  --將 Json 陣列兩邊的中括號去掉
            
                 '\\}\\,\\{'    --將 Json 陣列元素之間的逗號換成分號
                ,'\\}\\;\\{'),
                
                 '\\;'));    --以分號作為分隔符

結合 get_json_object 或 json_tuple 來解析裡面的欄位:

hive> select json_tuple(json, 'website', 'name') from (SELECT explode(split(regexp_replace(regexp_replace('[{"website":"www.baidu.com","name":"百},{"website":"google.com","name":"谷歌"}]', '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')) as json) test;
OK
www.baidu.com   百度
google.com      谷歌
Time taken: 0.283 seconds, Fetched: 2 row(s)

2、自定義函式解析JSON陣列

雖然可以使用Hive自帶的函式類解析Json陣列,但是使用起來有些麻煩。Hive提供了強大的自定義函式(UDF)的介面,我們可以使用這個功能來編寫解析JSON陣列的UDF。具體測試過程如下:

 <dependencies>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>2.1.1</version>
        </dependency>
    </dependencies>
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.json.JSONArray;
import org.json.JSONException;
import java.util.ArrayList;


@Description(name = "json_array",
        value = "_FUNC_(array_string) - Convert a string of a JSON-encoded array to a Hive array of strings.")
public class JsonArray extends UDF{
        public ArrayList<String> evaluate(String jsonString) {
            if (jsonString == null) {
                return null;
            }
            try {
                JSONArray extractObject = new JSONArray(jsonString);
                ArrayList<String> result = new ArrayList<String>();
                for (int ii = 0; ii < extractObject.length(); ++ii) {
                    result.add(extractObject.get(ii).toString());
                }
                return result;
                } catch (JSONException e) {
                return null;
                } catch (NumberFormatException e) {
                return null;
            }
        }

}

將上面的程式碼進行編譯打包,jar包名為:HiveJsonTest-1.0-SNAPSHOT.jar

hive> add jar /mnt/HiveJsonTest-1.0-SNAPSHOT.jar;
Added [/mnt/HiveJsonTest-1.0-SNAPSHOT.jar] to class path
Added resources: [/mnt/HiveJsonTest-1.0-SNAPSHOT.jar]
hive> create temporary function json_array as 'JsonArray';
OK
Time taken: 0.111 seconds
hive> select explode(json_array('[{"website":"www.baidu.com","name":"百度"},{"website":"google.com"name":"谷歌"}]'));
OK
{"website":"www.baidu.com","name":"百度"}
{"website":"google.com","name":"谷歌"}
Time taken: 10.427 seconds, Fetched: 2 row(s)
hive> select json_tuple(json, 'website', 'name') from (SELECT explode(json_array('[{"website":"www.baidu.com","name":"百度"},{"website":"google.com","name":"谷歌"}]')) as json) test;
OK
www.baidu.com   百度
google.com      谷歌
Time taken: 0.265 seconds, Fetched: 2 row(s)

3、自定義函式解析json物件

package com.laotou;

import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.json.JSONException;
import org.json.JSONObject;
import org.json.JSONTokener;

/**
 *
 *  add jar jar/bdp_udf_demo-1.0.0.jar;
 * create temporary function getJsonObject as 'com.laotou.JsonObjectParsing';
 * Json物件解析UDF
 * @Author: 
 * @Date: 2019/8/9
 */
public class JsonObjectParsing extends UDF {
    public static String evaluate(String jsonStr, String keyName) throws JSONException {
        if(StringUtils.isBlank(jsonStr) || StringUtils.isBlank(keyName)){
            return null;
        }
        JSONObject jsonObject = new JSONObject(new JSONTokener(jsonStr));
        Object objValue = jsonObject.get(keyName);
        if(objValue==null){
            return null;
        }
        return objValue.toString();
    }
}

3、1準備資料

3、2測試