1. 程式人生 > >hive筆記-自定義UDF

hive筆記-自定義UDF

1、定義自己的UDF函式

package com.hihi.hive;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class HelloWord  extends UDF {
    public Text evaluate(final Text s) {
        if (s == null) { return null; }
        return new Text("HelloWord:" + s.toString().toLowerCase());
    }
}
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>study-hadoop</groupId>
  <artifactId>hive</artifactId>
  <version>1.0</version>

  <properties>
    <projcet.build.sourceEncoding>UTF-8</projcet.build.sourceEncoding>
    <hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
    <hive.version>1.1.0-cdh5.7.0</hive.version>
  </properties>

  <repositories>
    <repository>
      <id>cloudera</id>
      <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
  </repositories>

  <dependencies>
      <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-common</artifactId>
          <version>${hadoop.version}</version>
      </dependency>

    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>${hive.version}</version>
    </dependency>

    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.10</version>
    </dependency>
  </dependencies>
</project>

2、將程式碼打包,放上後臺

[root@hadoop001 jar]# rz
rz waiting to receive.
Starting zmodem transfer.  Press Ctrl+C to cancel.
Transferring hive-1.0.jar...
  100%       2 KB       2 KB/sec    00:00:01       0 Errors  

[root@hadoop001 jar]# pwd
/home/hadoop/jar

3、進入hive執行命令,這命令能建立一個臨時函式,該函式只適用於當前會話
hive> add jar /home/hadoop/jar/hive-1.0.jar;
Added [/home/hadoop/jar/hive-1.0.jar] to class path
Added resources: [/home/hadoop/jar/hive-1.0.jar]
hive> create temporary function my_hello as 'com.hihi.hive.HelloWord';
OK
Time taken: 0.016 seconds

hive> select ename, my_hello(ename) from emp_dept_partition limit 3;
OK
SMITH   HelloWord:SMITH
JONES   HelloWord:JONES
SCOTT   HelloWord:SCOTT
Time taken: 0.124 seconds, Fetched: 3 row(s)

hive> list jars;
/home/hadoop/jar/hive-1.0.jar

4、查詢元資料發現不存在關於函式的資料
mysql> select * from funcs;
Empty set (0.00 sec)

5、重新建立會話,由於剛剛建立的是臨時函式,所以發現報錯
hive> select ename, my_hello(ename) from emp_dept_partition limit 3;
FAILED: SemanticException [Error 10011]: Line 1:14 Invalid function 'my_hello'

6、嘗試建立一個永久的函式
hive> add jar /home/hadoop/jar/hive-1.0.jar;
Added [/home/hadoop/jar/hive-1.0.jar] to class path
Added resources: [/home/hadoop/jar/hive-1.0.jar]
hive> create function my_hello as 'com.hihi.hive.HelloWord';
OK
Time taken: 0.016 seconds

7、查詢元資料,發現有改函式的資訊,但func_ru表中卻沒有資料。
mysql> select * from funcs;
+---------+-------------------------+-------------+-------+-----------+-----------+------------+------------+
| FUNC_ID | CLASS_NAME              | CREATE_TIME | DB_ID | FUNC_NAME | FUNC_TYPE | OWNER_NAME | OWNER_TYPE |
+---------+-------------------------+-------------+-------+-----------+-----------+------------+------------+
|       6 | com.hihi.hive.HelloWord |  1515675864 |     1 | my_hello  |         1 | NULL       | USER       |
+---------+-------------------------+-------------+-------+-----------+-----------+------------+------------+
1 row in set (0.00 sec)
mysql> select * from func_ru;
Empty set (0.00 sec)

8、重新進入會話,發現呼叫函式還是失敗
hive> select ename, my_hello(ename) from emp_dept_partition limit 3;
FAILED: SemanticException [Error 10011]: Line 1:14 Invalid function 'my_hello'

9、嘗試從HDFS匯入jar包
CREATE FUNCTION my_hello AS 'com.hihi.hive.HelloWord' USING JAR 'hdfs://hadoop001:9000/jar/hive-1.0.jar';

10、檢視元資料,發現func_ru現在有關於函式my_hello的資料,那是不是每次呼叫函式,就讀取元資料重新載入jar包並建立函式呢?
mysql> select * from func_ru;
+---------+---------------+----------------------------------------+-------------+
| FUNC_ID | RESOURCE_TYPE | RESOURCE_URI                           | INTEGER_IDX |
+---------+---------------+----------------------------------------+-------------+
|      11 |             1 | hdfs://hadoop001:9000/jar/hive-1.0.jar |           0 |
+---------+---------------+----------------------------------------+-------------+
1 row in set (0.00 sec)

mysql> select * from funcs;
+---------+-------------------------+-------------+-------+-----------+-----------+------------+------------+
| FUNC_ID | CLASS_NAME              | CREATE_TIME | DB_ID | FUNC_NAME | FUNC_TYPE | OWNER_NAME | OWNER_TYPE |
+---------+-------------------------+-------------+-------+-----------+-----------+------------+------------+
|      11 | com.hihi.hive.HelloWord |  1515676179 |     1 | my_hello  |         1 | NULL       | USER       |
+---------+-------------------------+-------------+-------+-----------+-----------+------------+------------+
1 row in set (0.00 sec)

11、重新登陸會話,先檢查jar包是否被載入再呼叫函式,會發現呼叫函式的時候會重新載入jar包,載入jar包的規則記錄在元資料庫的func_ru表格中
hive> list jar;
hive> select ename, my_hello(ename) from emp_dept_partition limit 3;
converting to local hdfs://hadoop001:9000/jar/hive-1.0.jar
Added [/tmp/9da42cea-1284-46f1-9969-74dc80ed05fe_resources/hive-1.0.jar] to class path
Added resources: [hdfs://hadoop001:9000/jar/hive-1.0.jar]
OK
SMITH   HelloWord:SMITH
JONES   HelloWord:JONES
SCOTT   HelloWord:SCOTT
Time taken: 1.252 seconds, Fetched: 3 row(s)

hive> list jar;
/tmp/9da42cea-1284-46f1-9969-74dc80ed05fe_resources/hive-1.0.jar

不足點:show functions的命令並不會顯示該命令,而且每次使用都要重新載入jar包的話還是挺麻煩的。所以後續會繼續尋找其解決方法。

【來自@若澤大資料】