Spark（十三）SparkSQL的自定義函式UDF與開窗函式

阿新 • • 發佈：2021-08-03

一自定義函式UDF

在Spark中，也支援Hive中的自定義函式。自定義函式大致可以分為三種：

UDF(User-Defined-Function)，即最基本的自定義函式，類似to_char,to_date等
UDAF（User- Defined Aggregation Funcation），使用者自定義聚合函式，類似在group by之後使用的sum,avg等
UDTF(User-Defined Table-Generating Functions),使用者自定義生成函式，有點像stream裡面的flatMap

自定義一個UDF函式需要繼承UserDefinedAggregateFunction類，並實現其中的8個方法

示例

import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types.{DataType, StringType, StructField, StructType}

object GetDistinctCityUDF extends UserDefinedAggregateFunction{
  /**
    * 輸入的資料型別
    * */
  override def inputSchema: StructType = StructType(
    StructField("status",StringType,true) :: Nil
  )
  /**
    * 快取欄位型別
    * */
  override def bufferSchema: StructType = {
    StructType(
      Array(
        StructField("buffer_city_info",StringType,true)
      )
    )
  }
/**
  * 輸出結果型別
  * */
  override def dataType: DataType = StringType
/**
  * 輸入型別和輸出型別是否一致
  * */
  override def deterministic: Boolean = true
/**
  * 對輔助欄位進行初始化
  * */
  override def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer.update(0,"")
  }
/**
  *修改輔助欄位的值
  * */
  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    //獲取最後一次的值
    var last_str = buffer.getString(0)
    //獲取當前的值
    val current_str = input.getString(0)
    //判斷最後一次的值是否包含當前的值
    if(!last_str.contains(current_str)){
      //判斷是否是第一個值，是的話走if賦值，不是的話走else追加
      if(last_str.equals("")){
        last_str = current_str
      }else{
        last_str += "," + current_str
      }
    }
    buffer.update(0,last_str)

  }
/**
  *對分割槽結果進行合併
  * buffer1是機器hadoop1上的結果
  * buffer2是機器Hadoop2上的結果
  * */
  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    var buf1 = buffer1.getString(0)
    val buf2 = buffer2.getString(0)
    //將buf2裡面存在的資料而buf1裡面沒有的資料追加到buf1
    //buf2的資料按照，進行切分
    for(s <- buf2.split(",")){
      if(!buf1.contains(s)){
        if(buf1.equals("")){
          buf1 = s
        }else{
          buf1 += s
        }
      }
    }
    buffer1.update(0,buf1)
  }
/**
  * 最終的計算結果
  * */
  override def evaluate(buffer: Row): Any = {
    buffer.getString(0)
  }
}

註冊自定義的UDF函式為臨時函式

def main(args: Array[String]): Unit = {
    /**
      * 第一步 建立程式入口
      */
    val conf = new SparkConf().setAppName("AralHotProductSpark")
    val sc = new SparkContext(conf)
    val hiveContext = new HiveContext(sc)
　　//註冊成為臨時函式
    hiveContext.udf.register("get_distinct_city",GetDistinctCityUDF)
　　//註冊成為臨時函式
    hiveContext.udf.register("get_product_status",(str:String) =>{
      var status = 0
      for(s <- str.split(",")){
        if(s.contains("product_status")){
          status = s.split(":")(1).toInt
        }
      }
    })
}

二開窗函式

row_number() 開窗函式是按照某個欄位分組，然後取另一欄位的前幾個的值，相當於分組取topN

如果SQL語句裡面使用到了開窗函式，那麼這個SQL語句必須使用HiveContext來執行，HiveContext預設情況下在本地無法建立。

開窗函式格式：

row_number() over (partitin by XXX order by XXX)

java:

 SparkConf conf = new SparkConf();
   conf.setAppName("windowfun");
   JavaSparkContext sc = new JavaSparkContext(conf);
   HiveContext hiveContext = new HiveContext(sc);
   hiveContext.sql("use spark");
   hiveContext.sql("drop table if exists sales");
   hiveContext.sql("create table if not exists sales (riqi string,leibie string,jine Int) "
      + "row format delimited fields terminated by '\t'");
   hiveContext.sql("load data local inpath '/root/test/sales' into table sales");
   /**
    * 開窗函式格式：
    * 【 rou_number() over (partitin by XXX order by XXX) 】
    */
   DataFrame result = hiveContext.sql("select riqi,leibie,jine "
             + "from ("
            + "select riqi,leibie,jine,"
           + "row_number() over (partition by leibie order by jine desc) rank "
            + "from sales) t "
         + "where t.rank<=3");
   result.show();
   sc.stop();

scala:

 val conf = new SparkConf()
 conf.setAppName("windowfun")
 val sc = new SparkContext(conf)
 val hiveContext = new HiveContext(sc)
 hiveContext.sql("use spark");
 hiveContext.sql("drop table if exists sales");
 hiveContext.sql("create table if not exists sales (riqi string,leibie string,jine Int) "
  + "row format delimited fields terminated by '\t'");
 hiveContext.sql("load data local inpath '/root/test/sales' into table sales");
 /**
  * 開窗函式格式：
  * 【 rou_number() over (partitin by XXX order by XXX) 】
  */
 val result = hiveContext.sql("select riqi,leibie,jine "
       + "from ("
    + "select riqi,leibie,jine,"
    + "row_number() over (partition by leibie order by jine desc) rank "
    + "from sales) t "
   + "where t.rank<=3");
 result.show();
 sc.stop()

轉自：https://www.cnblogs.com/frankdeng/p/9301712.html

Spark（十三）SparkSQL的自定義函式UDF與開窗函式

一自定義函式UDF 在Spark中，也支援Hive中的自定義函式。自定義函式大致可以分為三種：

【Abp VNext】實戰入門（十三）：自定義專案所需種子資料

技術標籤：ABP.net coreabp vnextc#種子資料初始化前言：專案開發過程中難免會涉及到一些業務相關的基礎資料，我們稱之為種子資料，主要為了方便測試功能或者展示效果；

Spark(十三)【SparkSQL自定義UDF/UDAF函式】

目錄一.UDF(一進一出)二.UDAF(多近一出)spark2.X 實現方式案例①繼承UserDefinedAggregateFunction，實現其中的方法②建立函式物件，註冊函式，在sql中使用spark3.X實現方式案例①繼承Aggregator [-IN, BUF, OUT]，宣

Flink例項（十三）:Flink DataStream 八大分割槽策略與自定義分割槽器

分割槽策略決定了一條資料如何傳送給下游。Flink中預設提供了八大分割槽策略(也叫分割槽器)。

PHP快速上手（08）：自定義函式

技術標籤：PHP 函式的概念函式就是可以完成某些工作的程式碼塊，它存在的目的是封裝程式，實現反覆呼叫。

Flink例項（114）：自定義時間和視窗的操作符（十三）自定義視窗分配器之設定視窗開始與結束時刻

1.自定義視窗分配器（flink1.11.2） package com.atguigu.exercise.ETL.caiutil import java.text.SimpleDateFormat

el-tree載入完成後預設觸發點選事件非預設選中（下）支援自定義節點

前面那篇選中預設節點，有朋友留言說能不能支援自定義節點，自己想了想認為可行，思路主要利用el-tree 的current-node-key 和highlight-current屬性，如圖

Flink 從 0 到 1 學習之（5）如何自定義 Data Source ？

前言我給大家介紹了 Flink Data Source 以及簡短的介紹了一下自定義 Data Source，這篇文章更詳細的介紹下，並寫一個 demo 出來讓大家理解。

Flink 從 0 到 1 學習之（6）如何自定義 Data Sink ？

前言前篇文章介紹了 Flink Data Sink，也介紹了 Flink 自帶的 Sink，那麼如何自定義自己的 Sink 呢？這篇文章將寫一個 demo 教大家將從 Kafka Source 的資料 Sink 到 MySQL 中去。

Pytest系列（8） - 使用自定義標記mark

一、前言 pytest 可以支援自定義標記，自定義標記可以把一個 web 專案劃分多個模組，然後指定模組名稱執行

Vue基礎：插槽（slot），自定義事件內容分發（$emit('事件名',引數);

一，slot（插槽）　　通俗的說：就是元件巢狀元件，被巢狀的元件插到相應的插座上

Flink例項（106）：自定義時間和視窗的操作符（十二）自定義視窗分配器周、月

自定義 WindowAssigner 如果我們定義按天、小時、分鐘的滾動視窗都很容易實現。

python基礎複習（28）--呼叫自定義模組

技術標籤：python #abc.py呼叫mymodulecalc.py #abc.py #呼叫模組 #import mymodulecalc #print(mymodulecalc.mymin(1,2),mymodulecalc.mymax(1,2))

VUE專案學習（六）：自定義間隔進度條

技術標籤：VUE專案入門vuehtmlcsshtml5css3 VUE專案學習（六）：自定義間隔進度條網上有很多種進度條，但是我沒找到這種間隔進度條，索性自己弄一個

jpa學習教程（一）---原生自定義sql的寫法

在spring boot中寫入jpa。程式碼如下： @Repository public interface LikeAppMenuViewJpaRepo extends JpaRepository<LikeAppMenuViewEntity, String> {

VUE3.0學習系列隨筆（五）：自定義元件使用

技術標籤：VUE3.0學習隨筆VUE2.0學習隨筆vuehtmlhtml5 VUE3.0學習系列隨筆（五）：自定義元件使用

RSS訂閱原創開源UReport 整合到產品中實踐簡要：（四）UReport 自定義mysql資料庫表的儲存器

技術標籤：spring bootureport 自定義儲存器一、預設報表儲存器： UReport2預設提供的名為“伺服器檔案系統”的報表儲存機制，實際上是實現了UReport2提供的com.bstek.ureport.provider.report.ReportProvider介

SpringCloud Ribbon（二）之自定義負載均衡規則

技術標籤：Spring Cloud 一、Ribbon負載均衡規則一個服務對應一個LoadBalancer，一個LoadBalancer只有一個Rule，LoadBalancer記錄服務的註冊地址，Rule提供從服務的註冊地址中找出一個地址的規則。

Spring Security學習（一）：自定義登入認證

技術標籤：Spring Securityjava 一、前言本篇部落格主要記錄Spring Security自定義登入認證，以及在前後端分離的情況下認證成功或失敗返回json資料的流程。

ROS2學習之旅（16）——建立自定義ROS2 msg和srv檔案

ROS2自定義訊息型別... 1.建立功能包在本文中，將在自己的包中建立自定義的.msg和.srv檔案，然後在另外的包中使用它們，這兩個包應該在同一個工作空間中。

Spark（十三）SparkSQL的自定義函式UDF與開窗函式

一 自定義函式UDF

二開窗函式

相關推薦

一自定義函式UDF