1. 程式人生 > >培訓系列12--spark dataframe 註冊成hive 的臨時表

培訓系列12--spark dataframe 註冊成hive 的臨時表

一、獲得最初的資料並形成dataframe

val ny= sc.textFile("data/new_york/")
val header=ny.first
val filterNY =ny.filter(listing=>{
listing.split(",").size==14 && listing!=header
})
val nyMap= filterNY.map(listing=>{
val listingInfo=listing.split(",")
(listingInfo(0).toInt,listingInfo(2),listingInfo(9).toFloat,listingInfo(4))

})
nyMap.take(20).foreach(println)

val nyDF=nyMap.toDF("Room_ID","Room_Type","Price","Neighborhood")

nyDF.show

二、註冊成臨時表

nyDF.registerTempTable("listings")

//如果不成功可以執行

cp /etc/hive/conf/hive-site.xml /etc/spark/conf

三、通過hql對臨時表做select,聚集函式,group by,sort  by  where等等操作

spark.sqlContext.sql("select * from listings limit 10").show

執行sql以後的結果也是dataframe,所以可以用show這些操作

scala> spark.sqlContext.sql("select Neighborhood,count(Price) as sump ,round(avg(Price),2) as avgp from listings where Room_Type!='' group by Neighborhood order by sump desc" ).show
+------------------+-----+------+
| Neighborhood| sump| avgp|
+------------------+-----+------+
| Williamsburg|77973|148.53|
|Bedford-Stuyvesant|54667|108.34|
| Harlem|50551|122.45|
| East Village|43979|196.45|
| Upper West Side|42466| 215.9|
| Bushwick|37670| 87.97|
| Upper East Side|36950|205.24|
| Hell's Kitchen|34202|217.42|
| Crown Heights|28033|106.34|
| Chelsea|25551|258.95|
| Lower East Side|22982|184.73|
| Midtown|22746|286.51|
| East Harlem|22393|133.19|
| Greenpoint|21167|144.98|
| West Village|20370|285.79|
|Washington Heights|18276| 99.08|
| Astoria|15586|110.39|
| Clinton Hill|12244|191.38|
| Flatbush|11430| 96.74|
| Park Slope|11294|173.55|
+------------------+-----+------+
only showing top 20 rows

這些都是hql的基礎sql語法,不需要過多的講解了