Spark SQL 函式操作

阿新 • • 發佈：2018-12-31

Spark 內建函式

使用Spark SQL中的內建函式對資料進行分析，Spark SQL API不同的是，DataFrame中的內建函式操作的結果是返回一個Column物件，而DataFrame天生就是”A distributed collection of data organized into named columns.”,這就為資料的複雜分析建立了堅實的基礎並提供了極大的方便性，例如說，我們在操作DataFrame的方法中可以隨時呼叫內建函式進行業務需要的處理，這之於我們構建附件的業務邏輯而言是可以極大的減少不必須的時間消耗（基於上就是實際模型的對映），讓我們聚焦在資料分析上，這對於提高工程師的生產力而言是非常有價值的Spark 1.5.x開始提供了大量的內建函式，
還有max、mean、min、sum、avg、explode、size、sort_array、day、to_date、abs、acros、asin、atan
總體上而言內建函式包含了五大基本型別：

聚合函式，例如countDistinct、sumDistinct等；
集合函式，例如sort_array、explode等
日期、時間函式，例如hour、quarter、next_day
數學函式，例如asin、atan、sqrt、tan、round等；
開窗函式，例如rowNumber等
字串函式，concat、format_number、rexexp_extract
其它函式，isNaN、sha、randn、callUDF

Hive 下的單行單詞統計

select t.wd ,count(t.wd) as count from (select explode(split(line," ")) as wd from word) t group by t.wd;

在編寫程式程式碼的時候如果呼叫函式那麼需要注意的是要匯入functions

import org.apache.spark.sql.types.{DataTypes, StructType}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext, functions}

object SparkSQLFunctionOps {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("SparkSQLFunctionOps").setMaster("local");
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    val linesRDD = sc.textFile("E:/test/scala/sql-rdd-source.txt")
    val rowRDD = linesRDD.map(line => {
      val splits = line.split(",")
      Row(splits(0).trim.toInt,splits(1).trim,splits(2).trim.toInt,splits(3).trim.toInt)
    })
    val structType = StructType(Array(
      DataTypes.createStructField("id",DataTypes.IntegerType,true),
      DataTypes.createStructField("name",DataTypes.StringType,true),
      DataTypes.createStructField("age",DataTypes.IntegerType,true),
      DataTypes.createStructField("height",DataTypes.IntegerType,true)
    ))

    val df = sqlContext.createDataFrame(rowRDD,structType)
    df.registerTempTable("person")
    df.show()
    /**
      * 接下來對df中的資料進行查詢
      * 第一個查詢年齡的最大值,平均值
      * height的總身高
      * */
    sqlContext.sql("select avg(age) from person").show()
    sqlContext.sql("select max(age) from person").show()
    sqlContext.sql("select sum(height) from person").show()
    sc.stop()
  }
}

Java 版本

public class SparkSQLFunctionJava {
public static void main(String[] args) {
    SparkConf conf = new SparkConf( ).setAppName(SparkSQLFunctionJava.class.getSimpleName()).setMaster("local");
    JavaSparkContext sc = new JavaSparkContext(conf);
    SQLContext sqlContext = new SQLContext(sc);
   JavaRDD<String> linesRDD = sc.textFile("E:/test/scala/sql-rdd-source.txt");

    JavaRDD<Row> rowRDD = linesRDD.map(new Function<String, Row>() {
        @Override
        public Row call(String s) throws Exception {
            String splits[] = s.split(",");
            return  RowFactory.create(Integer.valueOf(splits[0].trim()),splits[1].trim(),Integer.valueOf(splits[2].trim()),Integer.valueOf(splits[3].trim()));
        }
    });
    StructField structFields[] = new StructField[4];
    structFields[0] = DataTypes.createStructField("id",DataTypes.IntegerType,true);
    structFields[1] = DataTypes.createStructField("name",DataTypes.StringType,true);
    structFields[2] = DataTypes.createStructField("age",DataTypes.IntegerType,true);
    structFields[3] = DataTypes.createStructField("height",DataTypes.IntegerType,true);
    StructType structType = new StructType(structFields);
    DataFrame dataFrame = sqlContext.createDataFrame(rowRDD, structType);
    dataFrame.show();
    dataFrame.registerTempTable("person");
    /**
     * 接下來對df中的資料進行查詢
     * 第一個查詢年齡的最大值,平均值
     * height的總身高
     * */
    sqlContext.sql("select avg(age) from person").show();
    sqlContext.sql("select max(age) from person").show();
    sqlContext.sql("select sum(height) from person").show();
    sc.close();
}
}

修改Spark執行日誌級別

cp log4j.properties.template log4j.properties
vim conf/log4j.properties 將INFO 修改為ERROR級別
需要重啟Spark叢集，使其生效

HIVE 設定列結構顯示

set hive.cli.print.header=true;

Spark SQL 開窗函式

1、Spark 1.5.x版本以後，在Spark SQL和DataFrame中引入了開窗函式，比如最經典的就是我們的row_number()，可以讓我們實現分組取topn的邏輯。

2、做一個案例進行topn的取值（利用Spark的開窗函式），不知道是否還有印象，我們之前在最早的時候，做過topn的計算，當時是非常麻煩的。但是現在用了Spark SQL之後，非常方便。

/**
  * Spark SQL 開窗函式之 row_number
  * */
object SparkSQLOpenWindowFunction {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("SparkSQLOpenWindowFunction").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlContext = new HiveContext(sc)
    val linesRDD = sc.textFile("E:/test/scala/topn.txt")
    val rowRDD = linesRDD.map(line =>{
      val splits = line.split(" ")
      Row(splits(0).trim,splits(1).trim.toInt)
    })
    val structType = DataTypes.createStructType(Array(
      DataTypes.createStructField("class",DataTypes.StringType,true),
      DataTypes.createStructField("score",DataTypes.IntegerType,true)
    ))
    val df = sqlContext.createDataFrame(rowRDD,structType)
    df.registerTempTable("stu_score")
    /**
      * 查詢操作
      * 先按照class 進行分組,然後對每個class分組中的資料求出TOP3
      * */
     val topNDF = sqlContext.sql("select temp.* from (select *, row_number() over(partition by class order by score desc) rank from stu_score ) temp where temp.rank < 4")
     topNDF.show()
    sc.stop()
  }
}

Java版本

public class SparkSQLOpenWindowFunctionJava {
public static void main(String[] args) {
    SparkConf conf = new SparkConf().setAppName(SparkSQLOpenWindowFunctionJava.class.getSimpleName()).setMaster("local");
    JavaSparkContext sc = new JavaSparkContext(conf);
    HiveContext sqlContext = new HiveContext(sc);
    JavaRDD<String> linesRDD = sc.textFile("E:/test/scala/topn.txt");

    JavaRDD<Row> rowRDD = linesRDD.map(new Function<String, Row>() {
        @Override
        public Row call(String s) throws Exception {
            return RowFactory.create(s.split(" ")[0].trim(),Integer.valueOf(s.split(" ")[1].trim()));
        }
    });
    StructField structFields[] = new StructField[2];
    structFields[0] = DataTypes.createStructField("class",DataTypes.StringType,true);
    structFields[1] = DataTypes.createStructField("score",DataTypes.IntegerType,true);
    StructType structType = new StructType(structFields);
    DataFrame df = sqlContext.createDataFrame(rowRDD, structType);
    df.registerTempTable("stu_score");
    /**
     * 查詢操作
     * 先按照class 進行分組,然後對每個class分組中的資料求出TOP3
     * */
    DataFrame dataFrame = null;
    dataFrame = sqlContext.sql("select temp.* from (select *, row_number() over(partition by class order by score desc) rank from stu_score ) temp where temp.rank < 4");
    dataFrame.show();
    sc.close();
}
}

在spark-sql 下執行，

先在hive 中，建一張表，create table topn (class string,score int) row format delimited fields terminated by ’ ‘;
然後把資料匯入表中， load data local inpath ‘/opt/data/spark/topn.txt’
然後就可以利用開窗函式進行分組排序了，select temp.x from (select *,row_number() over(partition class order by score desc) rank from topn) temp where temp.rank < 4;

UDF自定義函式

1、UDF：User Defined Function。使用者自定義函式。
我們通常所說的UDF自定義函式，就是一對一的關係：一個輸入引數和一個輸出引數
建立UDF的步驟：
1、先建立一個自定義的函式func
2、使用sqlContext.udf().register(“起個名字”, func _)
3、在我們的sql中直接使用就行了

object SparkSQLUDFOps {
def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("SparkSQLUDFOps").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    val linesRDD = sc.textFile("E:/test/scala/sql-rdd-source.txt")
    val rowRDD = linesRDD.map(line => {
        val splits = line.split(", ")
        Row(splits(0).trim.toInt, splits(1).trim, splits(2).trim.toInt, splits(3).trim.toInt)
    })
    val structType = StructType(Array(
        DataTypes.createStructField("id", DataTypes.IntegerType, true),
        DataTypes.createStructField("name", DataTypes.StringType, true),
        DataTypes.createStructField("age", DataTypes.IntegerType, true),
        DataTypes.createStructField("height", DataTypes.IntegerType, true)
    ))
    val df = sqlContext.createDataFrame(rowRDD, structType)
    df.registerTempTable("person")
    //2、註冊自定義的UDF
    /**
      * 這是兩種註冊的方式
      */
    sqlContext.udf.register("myLen", myLen _)
    sqlContext.udf.register("len", (str:String, len:Int) => str.length > len)
    //3、使用起來
    sqlContext.sql("select id, name, myLen(name) as len from person where len(name, 5)").show()
    sc.stop()
}

//1、建立一個自定義的函式，用於求出字串的長度
def myLen(str: String) = str.length

}

Spark SQL 函式操作

Spark 內建函式使用Spark SQL中的內建函式對資料進行分析，Spark SQL API不同的是，DataFrame中的內建函式操作的結果是返回一個Column物件，而DataFrame天生就是”A distributed collection

Spark SQL基本操作以及函式的使用

引語：本篇部落格主要介紹了Spark SQL中的filter過濾資料、去重、集合等基本操作，以及一些常用日期函式，隨機函式，字串操作等函式的使用，並列編寫了示例程式碼，同時還給出了程式碼當中用到的一些資料，放在最文章最後。 SparkSQL簡介 Spark SQL是Sp

PL/SQL函式操作例項與說明

PL/SQL函式與過程相同，不同之處在於函式有一個返回值。建立函式建立一個獨立函式可以使用CREATE FUNCTION語句建立。CREATE OR REPLACE PROCEDURE語句簡化語法如下：

Spark SQL系列------2. Spark SQL Aggregate操作的實現

在Spark 1.6上，TungstenAggregateIterator實現了一個分割槽的Iterator。在實際執行的時候分2中情況： 1.要Aggregate的分割槽資料並不是特別大，在記憶體中就可以實現Aggregate了 2.要Aggregate的分割槽資料比較

利用Spark sql操作Hdfs資料與Mysql資料，sql視窗函式的使用

需求說明：對熱門商品進行統計根據商品的點選資料，統計出各個區域的銷量排行TOPK 產品輸入：開始時間與結束時間

Spark SQL中Dataframe join操作含null值的列

dataframe util pre table log n-n dram blog between 當在Spark SQL中對兩個Dataframe使用join時，當作為連接的字段的值含有null值。由於null表示的含義是未知，既不知道有沒有，在SQL中null值與任何

Spark SQL筆記整理（二）：DataFrame編程模型與操作案例

代碼最重要的 ssi func nbu 產生 michael array image DataFrame原理與解析 Spark SQL和DataFrame 1、Spark SQL是Spark中的一個模塊，主要用於進行結構化數據的處理。它提供的最核心的編程抽象，就是Data

Spark SQL 內建函式列表

文章目錄 • 1 ! • 2 % • 3 & • 4 * • 5 + •

Spark SQL內建函式

Spark SQL內建函式官網API：http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions%24 平常在使用mysql的時候，我們在寫SQL的時候會使用到MySQL為我們提供的一些內建函

二、spark SQL互動scala操作示例

一、安裝spark spark SQL是spark的一個功能模組，所以我們事先要安裝配置spark，參考： https://www.cnblogs.com/lay2017/p/10006935.html 二、資料準備演示操作將從一個類似json檔案裡面讀取資料作為資料來源，並初始化為dat

Spark（三十）Spark SQl內建函式

def concat_ws(sep: String, exprs: Column*): Column =？作用：將多個欄位與一個字串拼接起來引數：sep拼接的字串，exprs：多個列返回值：Column def collect_set(e: Col

Spark SQL如何實現mysql的union操作

簡介今天聊了聊一個小小的基礎題，union和union all的區別： union all是直接連線，取到得是所有值，記錄可能有重複 union 是取唯一值，記錄沒有重複 1、UNION 的語法如下： [SQL 語句 1] &nb

Spark sql操作Hive

這裡說的是最簡便的方法，通過Spark sql直接操作hive。前提是hive-site.xml等配置檔案已經在Spark叢集配置好。 val logger = LoggerFactory.getLogger(SevsSpark4.getClass) def main(args:

MySql：SQL常用操作、函式、事物和索引

MySQL是一個關係型資料庫管理系統，在開始學習MySQL資料庫前，讓我們先了解下RDBMS的一些術語：資料庫: 資料庫是一些關聯表的集合。資料表: 表是資料的矩陣，在一個數據庫中的表看起來像一個簡單的電子表格。列：一列(資料元素) 包含了相同的資料，例如郵政編碼

Spark常用函式講解之Action操作+例項

RDD：彈性分散式資料集，是一種特殊集合 ‚ 支援多種來源 ‚ 有容錯機制 ‚ 可以被快取 ‚ 支援並行操作，一個RDD代表一個分割槽裡的資料集RDD有兩種操作運算元： Transformatio

Spark SQL 筆記(7)—— DataFrame API操作案例

1 測試資料 stu.txt 1|Anaa|111111|[email protected] 2|Bob|22222|[email protected] 3|Candy|333333

spark sql: 操作hive表

目標：實現類似於navicat的功能=> 寫hql語句，在idea下使用spark sql 一鍵執行，而不用到shell視窗下執行命令步驟：寫sql檔案 (resources目錄)—> 讀取內容 --> 以 ‘;’ 解析每條命令 --

第68課：Spark SQL通過JDBC操作MySQL

內容： 1.SparkSQL操作關係資料庫意義 2.SparkSQL操作關係資料庫一、通過SparkSQL操作關係資料庫意義 1.SparkSQL可以通過jdbc從傳統關係型資料庫中讀寫資料，讀取資料後直接生成DataFrame，然後在加上藉助

Spark-SQL之DataFrame操作大全

　　Spark SQL中的DataFrame類似於一張關係型資料表。在關係型資料庫中對單表或進行的查詢操作，在DataFrame中都可以通過呼叫其API介面來實現。可以參考，Scala提供的DataFrame AP

【七】Spark SQL命令和Spark shell命令操作hive中的表

1.把hive的配置檔案hive-site.xml複製到spark/conf下。 2.啟動的時候帶上MySQL的連線驅動 Spark-shell命令使用 spark-shell是通過得到sparksession然後呼叫sql方法執行hive的sql。 cd /app/

Spark SQL 函式操作

Spark 內建函式

Spark SQL 開窗函式

UDF自定義函式

相關推薦