1. 程式人生 > 實用技巧 >spark sql

spark sql

1)dataframe和dataset

dataframe和dataset是spark sql中的程式設計模型。他們兩個程式設計模型我們可以理解為一張mysql的二維表,表頭,表名,表字段,欄位型別,資料。RDD其實也可以理解二維表,但是RDD相較於dataframe和dataset來說卻少了東西,RDD只有資料。

dataset是spark1.6之後開出的API。dataframe是1.3的時候出現的。早期的dataframe還叫做SchemaRDD,與其他的RDD比較,它多出了一個schema(表名,表字段,欄位型別),這就是所謂的元資料資訊。

2)spark sql的優化-cache

3)spark sql的優化-資料傾斜,group by (解決方案:先區域性聚合-》再全域性聚合)

//進行有字首的區域性聚合
val sql2=
"""
|select t2.prefix_word ,count(1) countz from (
|select
|concat_ws("_",cast(floor(rand() * 4) as string),t1.word) as prefix_word
|from (
|select
|explode(split(line," ")) word
|from test
|) t1 ) t2
|group by t2.prefix_word
""".stripMargin
hiveContext.sql(sql2).show()
//去字首全域性聚合
val sql3=
"""
|select t4.up_word,sum(t4.countz) from (
|select substr(t3.prefix_word,instr(t3.prefix_word,'_') + 1) up_word,t3.countz
|from (
|select t2.prefix_word ,count(1) countz from (
|select
|concat_ws("_",cast(floor(rand() * 4) as string),t1.word) as prefix_word
|from (
|select
|explode(split(line," ")) word
|from test
|) t1 ) t2
|group by t2.prefix_word ) t3 ) t4
|group by t4.up_word
""".stripMargin