impala udf函式實現中文擷取

阿新 • • 發佈：2019-01-26

目前，impala 的substr函式及substring函式都不支援中文的擷取，因此，需要通過udf函式實現。具體的實現效果需要與substr的英文效果相同。具體如下： SUBSTR("abcde",3)=cde SUBSTR("abcde",-2)=de SUBSTR("abcde",3,2)=cd SUBSTR("abcde",-4,2)=bc 對於impala udf函式，可以採用C++ 和 Java , 但出於效率考慮，一般採用c++（https://www.cloudera.com/documentation/enterprise/5-5-x/topics/search_prepare_install_search.html 以及 http://blog.csdn.net/yu616568/article/details/52746332）。此處要求支援2個引數和3個引數，因此，在udf中，需要新增2個過載方法。具體步驟： 1、下載impala-udf-devel 這個包。方法：

> git clone https://github.com/laserson/impala-udf-devel.git > cd impala-udf-devel/ > cmake . 2、編輯在impala-udf-devel 目錄下編輯檔案udf-substr.cc udf-substr.h 兩個檔案，可以先將udf下的兩個udf.cc udf.h 檔案拷貝到父目錄，具體如下： udf-substr.cc #include "udf-substr.h" #include <string> #include <cmath> using namespace std;

const unsigned char kFirstBitMask = 128; // 1000000 const unsigned char kSecondBitMask = 64; // 0100000 const unsigned char kThirdBitMask = 32; // 0010000 const unsigned char kFourthBitMask = 16; // 0001000 const unsigned char kFifthBitMask = 8; // 0000100 int utf8_char_len(char firstByte) { std::string::difference_type offset = 1; if(firstByte & kFirstBitMask) // This means the first byte has a value greater than 127, and so is beyond the ASCII range.

{ if(firstByte & kThirdBitMask) // This means that the first byte has a value greater than 224, and so it must be at least a three-octet code point. { if(firstByte & kFourthBitMask) // This means that the first byte has a value greater than 240, and so it must be a four-octet code point. offset = 4; else offset = 3; } else { offset = 2; } } return offset; } int getStringLength( const StringVal& str){ int index = 0; int cnt = 0; int i = 0; std::string content((const char *)str.ptr,str.len); while(index < str.len){ i = utf8_char_len(content[index]); index = index + i; cnt++; } return cnt; } StringVal udf_substr(FunctionContext* context, const StringVal& str , const IntVal& start , const IntVal& offsets){ if(str.is_null || start.val == 0 || abs(start.val) > str.len || offsets.val < 0){ return ""; } int startp = 0; if(start.val < 0){ startp = getStringLength(str) + start.val; }else{ startp = start.val - 1; //c++ 從0開始索引 } std::string content((const char *)str.ptr,str.len); int endp = offsets.val>str.len?str.len:offsets.val ; int i = 0 ; int j = 0 ; while( j < str.len){ if(j < startp){ i = utf8_char_len(content[j]); j = j + i ; startp = startp + i-1; continue; }else if(j < endp + startp){ i = utf8_char_len(content[j]); j = j + i ; endp = endp + i-1 ; continue; }else{ break; } } endp = endp + startp > str.len ? max(j,str.len) - startp : endp; //endp = str.len <= startp? startp+1:endp; if(startp > str.len){ return ""; } std::string res(content.substr(startp , endp)); StringVal result(context, endp); memcpy(result.ptr, res.c_str(), endp); return result; } StringVal udf_substr(FunctionContext* context, const StringVal& str , const IntVal& start ){ if(str.is_null || abs(start.val) > str.len || start.val == 0){ return ""; } int startp = 0; if(start.val < 0){ startp = getStringLength(str) + start.val; }else{ startp = start.val - 1; } // int startp = start.val - 1; //c++ 從0開始索引 // int startp = start.val >0?start.val -1: str.len + start.val ; std::string content((const char *)str.ptr,str.len); int endp = str.len; int i = 0 ; int j = 0 ; if(endp> str.len){ return ""; } while( j < str.len){ if(j < startp){ i = utf8_char_len(content[j]); j = j + i ; startp = startp + i-1; continue; }else if(j < endp + startp){ i = utf8_char_len(content[j]); j = j + i ; endp = endp + i-1 ; continue; }else{ break; } } // endp = endp + startp > str.len ? max(j,str.len) - startp : endp; //endp = str.len <= startp? startp+1:endp; endp = endp + startp > str.len ? max(j,str.len) - startp : endp; std::string res(content.substr(startp , endp)); StringVal result(context, endp); memcpy(result.ptr, res.c_str(), endp); return result; } 3、修改CMakeList 檔案。如下所示：主要修改這一段： # Build the UDA/UDFs into a shared library. You can have multiple UDFs per # file, and/or specify multiple files here. add_library(substr_udf SHARED udf-substr.cc) # The resulting LLVM IR module will have the same name as the .cc file if (CLANG_EXECUTABLE) COMPILE_TO_IR(udf-substr.cc) # COMPILE_TO_IR(my-udf-file-2.cc) endif(CLANG_EXECUTABLE) 4、編譯執行cmake . , 然後再執行make。在build 目錄下會生成一個libsubstr_udf.so檔案 5、上傳so檔案到hdfs目錄（具體目錄看情況） > hadoop fs -put libsubstr_udf.so hdfs://namenode-or-nameservice/tmp/nrpt/ 6、在impala-shell 或者hue 的impala 查詢介面新增udf函式 create function substr_udf(string,INT,INT) returns string location 'hdfs://namenode-or-nameservice/tmp/nrpt/libsubstr_udf.so' symbol='substr_udf'; //支援3個引數 create function substr_udf(string,INT) returns string location 'hdfs://namenode-or-nameservice/tmp/nrpt/libsubstr_udf.so' symbol='substr_udf'; //支援2個引數 7、如果要解除安裝或者刪除某個udf ，執行： drop function substr_udf(string,INT,INT) 8、測試效果： select udf_substr("adbw我當時aad",-4); //輸出：時aad select udf_substr("adbw我當時aad",-6,3); // 輸出：我當時 select udf_substr("adbw我當時aad",6); // 輸出：當時aad select udf_substr("adbw我當時aad",6,3); // 輸出：當時a select udf_substr("adbw我當時aad",16,3); // 輸出：""

impala udf函式實現中文擷取

impala udf函式實現中文擷取

spark-sql使用UDF函式實現ip對映省份，資料寫出到mysql引數設定。

hive 使用udf函式實現資料匯入到mysql

PHP技術分享--實現中文字串擷取無亂碼的函式(適用於utf-8)

Impala UDF 建立完的函式的persistent狀態怎樣變為true

ASP中文URL解碼URLDecode函式實現

CString型別中文字串截取出現亂碼的問題/實現中文字串擷取無亂碼的方法

Spark使用UDF函式之WordCount實現

PHP的strtolower()和strtoupper()函式在安裝非中文系統的伺服器下可能會導致將漢字轉換為亂碼，請寫兩個替代的函式實現相容Unicode文字的字串大小寫轉換

spark scala-實現udf函式

UDF函式：對字串實現sha256加密,返回64位十六進位制字串

mysql利用資料庫函式實現用拼音搜尋中文的實現

php實現中文反轉字符串的方法

簡單測試--C#實現中文漢字轉拼音首字母

MySQL實現中文拼音排序

pdfmake實現中文支持，解決中文亂碼問題

Sphinx + Coreseek 實現中文分詞搜索

Sphinx 死文克斯聯合Mysql實現中文全文索引

搭建ELASTICSEARCH實現中文分詞搜索功能

IKAnalyzer結合Lucene實現中文分詞

impala udf函式實現中文擷取

相關推薦