spark sql

阿新 • • 發佈：2020-07-13

1）dataframe和dataset

dataframe和dataset是spark sql中的程式設計模型。他們兩個程式設計模型我們可以理解為一張mysql的二維表，表頭，表名，表字段，欄位型別，資料。RDD其實也可以理解二維表，但是RDD相較於dataframe和dataset來說卻少了東西，RDD只有資料。

dataset是spark1.6之後開出的API。dataframe是1.3的時候出現的。早期的dataframe還叫做SchemaRDD，與其他的RDD比較，它多出了一個schema(表名，表字段，欄位型別)，這就是所謂的元資料資訊。

2）spark sql的優化-cache

3）spark sql的優化-資料傾斜，group by (解決方案：先區域性聚合-》再全域性聚合)

//進行有字首的區域性聚合
val sql2=
  """
    |select t2.prefix_word ,count(1) countz from (
    |select
    |concat_ws("_",cast(floor(rand() * 4) as string),t1.word) as prefix_word
    |from (
    |select
    |explode(split(line," ")) word
    |from test
    |) t1 ) t2
    |group by t2.prefix_word
 
  """.stripMargin
hiveContext.sql(sql2).show()

//去字首全域性聚合
val sql3=
"""
  |select t4.up_word,sum(t4.countz) from (
  |select substr(t3.prefix_word,instr(t3.prefix_word,'_') + 1) up_word,t3.countz
  |from (
  |select t2.prefix_word ,count(1) countz from (
  |select
  |concat_ws("_",cast(floor(rand() * 4) as string),t1.word) as prefix_word
 
  |from (
  |select
  |explode(split(line," ")) word
  |from test
  |) t1 ) t2
  |group by t2.prefix_word ) t3 ) t4
  |group by t4.up_word
""".stripMargin

spark sql

位元組跳動在Spark SQL上的核心優化實踐 | 位元組跳動技術沙龍

Spark 系列（八）—— Spark SQL 之 DataFrame 和 Dataset

Spark 系列（九）—— Spark SQL 之 Structured API

Spark 系列（十）—— Spark SQL 外部資料來源

Spark 系列（十一）—— Spark SQL 聚合函式 Aggregations

Spark-SQL讀不到Hive資料庫的新坑指北

Spark SQL常見4種資料來源詳解

Spark SQL操作JSON欄位的小技巧

Spark入門（六）Spark SQL shell啟動方式(元資料儲存在mysql)

Spark SQL 入門建立DataFrame報錯：org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://local

DataFrame DataSet Spark SQL學習

【趙強老師】什麼是Spark SQL？

Spark SQL : DataFrame repartition、coalesce 對比

spark sql

Spark SQL Parser到Unresolved LogicPlan

Spark SQL(4)-Unresolved Plan到Analyzed Plan

Spark SQL(5) CacheManage

Spark SQL(6) OptimizedPlan

Spark SQL(5-2) CacheManage之InMemoryRelation

spark sql/hive小檔案問題

spark sql

相關推薦