Spark：如何替換sc.parallelize(List(item1,item2)).collect().foreach(row=>{})為並行？

阿新 • • 發佈：2018-03-04

tty ima tle items class tab 個數 min 集合

代碼場景：

1）設定的幾種數據場景，遍歷所有場景：依次統計滿足每種場景條件下的數據，並把統計結果存入hive；

2）已有代碼如下：

    case class IndoorOTTCalibrateBuildingVecotrLegend(oid: Int, minHeight: Int, maxHeight: Int, minGridIDCount: Int, maxGridIDCount: Int, heightType: Int) extends Serializable

    //  實例化建築物區間段：按照柵格的個數（面積）、樓的高度（商場等場景）來劃分場景
    val buildingHeightLegends = List(
      IndoorOTTCalibrateBuildingVecotrLegend( 
1, 1, 30, 1, 21, BuildingCalibrateHeightType.HeightType1.toString.toInt),
      IndoorOTTCalibrateBuildingVecotrLegend(2, 1, 30, 21, 45, BuildingCalibrateHeightType.HeightType2.toString.toInt),
      IndoorOTTCalibrateBuildingVecotrLegend(3, 1, 30, 45, 100, BuildingCalibrateHeightType.HeightType3.toString.toInt),
      IndoorOTTCalibrateBuildingVecotrLegend( 
4, 30, 50, 1, 21, BuildingCalibrateHeightType.HeightType4.toString.toInt),
      IndoorOTTCalibrateBuildingVecotrLegend(5, 30, 50, 21, 45, BuildingCalibrateHeightType.HeightType5.toString.toInt),
      IndoorOTTCalibrateBuildingVecotrLegend(6, 30, 50, 45, 100, BuildingCalibrateHeightType.HeightType6.toString.toInt),
      IndoorOTTCalibrateBuildingVecotrLegend( 
7, 50, 5000, 1, 100, BuildingCalibrateHeightType.HeightType7.toString.toInt)
    )

    spark.sparkContext.parallelize(buildingHeightLegends).collect().foreach(buildingHeightLegend => {
      generateSampleBySenceType(spark, p_city, p_hour_start, p_hour_end, p_fpb_day, p_day_sample, linkLossCalibrateParameter, buildingHeightLegend)
    })

備註：

在generateSampleBySenceType()函數內部包含有:

spark.sql(s"""
|xxx
|where t10.heihgt>=${buildingHieghtLegend.MinHeight} and t10.height<${buildingHieghtLegend.MaxHeight}
|and t10.gridcount<=${buildingHieghtLegend.MinGridIDCount} and  t10.gridcount>${buildingHieghtLegend.MaxGridIDCount}
|""".stripMargin)

如果把代碼修改：

    val buildingHeightLegends_df = spark.sqlContext.createDataFrame(buildingHeightLegends)
    buildingHeightLegends_df.createOrReplaceTempView("temp_buildingheightlegends")
    
    sql(s"""|select * from temp_buildingheightlegends""".stripMargin).repartition(buildingHeightLegends.length).foreachPartition(rows => {
      for (row <- rows) {
        val buildingHeightLegend = new IndoorOTTCalibrateBuildingVecotrLegend(
          row.getAs[Int]("oid"),
          row.getAs[Int]("minheight"),
          row.getAs[Int]("maxheight"),
          row.getAs[Int]("mingrididcount"),
          row.getAs[Int]("maxgrididcount"),
          row.getAs[Int]("heighttype"))
        generateSampleBySenceType(spark, p_city, p_hour_start, p_hour_end, p_fpb_day, p_day_sample, linkLossCalibrateParameter, buildingHeightLegend)
      }
    })

則會提示：generateSampleBySenceType()內部sql代碼位置拋出SparkSession為NULL的異常。

修改方案：

把buildingHeightLegends註冊為臨時表temp_buildingHeightLegends，去掉外層的foreach，之後在generateSampleBySenceType()內部把temp_buildingHeightLegends與其他結果集合進行cross join：

測試代碼如下：

-- 場景表
CREATE TABLE [dbo].[test_senceitems](
    [sencetype] [int] NULL,
    [minheight] [int] NULL,
    [maxheight] [int] NULL,
    [mingridcount] [int] NULL,
    [maxgridcount] [int] NULL
)
INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (1, 1, 30, 1, 21)
INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (2, 1, 30, 21, 45)
INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (3, 1, 30, 45, 100)
INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (4, 30, 50, 1, 21)
INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (5, 30, 50, 21, 45)
INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (6, 30, 50, 45, 100)
INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (7, 50, 5000, 1, 100)

-- 業務過濾統計表
CREATE TABLE [dbo].[test_grid](
    [gridid] [nvarchar](50) NULL,
    [height] [int] NULL,
    [gridcount] [int] NULL
) 

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N‘g1‘, 8, 23)
INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N‘g2‘, 3, 87)
INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N‘g3‘, 4, 34)
INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N‘g4‘, 30, 54)
INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N‘g5‘, 32, 32)
INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N‘g6‘, 32, 20)
INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N‘g7‘, 120, 34)
INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N‘g8‘, 89, 54)
INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N‘g9‘, 9, 16)

替換generateSampleBySenceType()內部sql(s"""|""".stripMargin)代碼類似如下：

select t10.*,t11.* 
from test_grid t10 
cross join test_senceitems t11
where t10.height>=t11.minheight and t10.height<t11.maxheight
and t10.gridcount>=t11.mingridcount and t10.gridcount<t11.maxgridcount

技術分享圖片

Spark：如何替換sc.parallelize(List(item1,item2)).collect().foreach(row=>{})為並行？

tty ima tle items class tab 個數 min 集合代碼場景： 1）設定的幾種數據場景，遍歷所有場景：依次統計滿足每種場景條件下的數據，並把統計結果存入hive； 2）已有代碼如下： case class IndoorOTTCali

Spark：sc.textFiles() 與 sc.wholeTextFiles() 的區別

一行 nal each span 由於 clean 16px ret tex 由於業務需要，需要將大量文件按照目錄分類的方式存儲在HDFS上，這時從HDFS上讀取文件就需要使用 sparkcontext.wholeTextFiles()，眾所周知，sc.textFi

Spark：大數據的電花火石！

protected lin ble mapred 學習協調架構 avi 好的什麽是Spark？可能你非常多年前就使用過Spark，反正當年我四六級單詞都是用的星火系列。沒錯，星火系列的洋名就是Spark。當然這裏說的Spark指的是Apache Spark。Apac

劍指offer：替換空格

ret urn onos ray 字符數組 amp 替換字符 rar log 解題思路替換字符串，是在原來的字符串上做替換，還是新開辟一個字符串做替換？需要與面試官確認在當前字符串替換，怎麽替換才更有效率（不考慮java裏現有的replace方法）。從前往

【redis，1】java操作redis：將string、list、map、自己定義的對象保存到redis中

n) spa 存儲 div ber sys mil 操作 write 一、操作string 、list 、map 對象 1、引入jar： jedis-2.1.0.jar 2、代碼 /

Java SE之正則表達式三：替換

package 表達式表達 cas AI 正則表達 date replace all /** * * @author Zen Johnny * @date 2018年4月29日下午4:31:07 * */ package demo.regex; publ

Python封裝函數：實現刪除一個list裏面的重復,且元素順序要與原list順序對應

列表 list封裝函數：實現刪除一個list裏面的重復,且元素順序要與原list順序對應代碼：def info(l):l1 = l[:]for i in range(len(l)):v = l.count(l[i])if l1.count(l[i]) > 1:for j in range(1, v):

Office Web addin 踩坑計：替換後臺網站為MVC框架時遇到的問題

調試運行 pro add 9.png info img 但是 office 新建 Office Web Addin 模板程序的後臺本質上是一個網站，你在調試的時候可以發現它的進程是一個32位的IE進程所以可以把它替換成Asp.net的網站。替換方法： 1.點擊WordR

redis系列：通過demo學習list命令

art 隊列 tps 創建介紹 count stat 其他圈子前言這一篇文章將講述Redis中的list類型命令，同樣也是通過demo來講述，其他部分這裏就不在贅述了。項目Github地址：https://github.com/rainbowda/learnWay

Spark：求出分組內的TopN

lac args read setprop ber rgs cas arr repl 制作測試數據源： c1 85 c2 77 c3 88 c1 22 c1 66 c3 95 c3 54 c2 91 c2 66 c1 54 c1 65 c2 41

面試題5：替換空格

計算 style happy off tar inter 因此長度內容 // 面試題5：替換空格// 題目：請實現一個函數，把字符串中的每個空格替換成"%20"。例如輸入“We are happy.”，// 則輸出“We%20are%20happy.”。解題思路：簡

劍指Offer（書）：替換空格

amp class har 補充 buffer space style col new 題目：請實現一個函數，將一個字符串中的每個空格替換成“%20”。例如，當字符串為We Are Happy.則經過替換之後的字符串為We%20Are%20Happy。分析：通常來說，這樣

for增強循環：for(object obj:List) 反編譯理解

iterator sys 代碼反編譯使用 iter obj 遍歷 int for增強循環 for(String string: list){ System.out.println(string); } 反編譯之後代碼 Itera

layer-list：Android中layer-list使用詳解

layout nbsp 分享 sel 效果圖技術分享 ner select ati 使用layer-list可以將多個drawable按照順序層疊在一起顯示，默認情況下，所有的item中的drawable都會自動根據它附上view的大小而進行縮放， layer-list

python基礎數據類型： int bool str list tuple dict

超過 split 次數替換空格 rip 大小寫字符串搜索 dac 一. int bit_length() 計算十進制轉化成二進制的有效位數 1 v = 11 2 data = v.bit_length() 3 print(data) View

Java 後臺介面為List<String> 報錯：【java.util.List】:Specified class is an interface

public Response<?> add(HttpServletRequest request,List<String> fkContentList){}報錯：簡單修改： public Response<?> add

python ：基礎資料型別list , tuple , dict, set方法彙總

#基礎資料型別方法（1）list常用方法彙總‘ （1.1）新增類 append(*args,**kwarsg) # 向列表的尾部追加元素 extend(iterable) #向列表的尾部追加可迭代物件元素 list = [] list_add = [1,2,

劍指Offer2：替換空格

思路：運用Python的正則表示式re模組中的re.sub。對於一個輸入的字串，利用正則表示式，來實現字串替換處理的功能並返回處理後的字串 re.sub(pattern, repl, string, count=0, flags=0) pattern：表示正則表示式中的模式字串；

JS練習：替換式圖片自動輪播

程式碼： <!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title></title> <script>

單表千億電信大資料場景，使用Spark+CarbonData替換Impala案例

【背景介紹】國內某移動局點使用Impala元件處理電信業務詳單，每天處理約100TB左右詳單，詳單表記錄每天大於百億級別，在使用impala過程中存在以下問題: 詳單採用Parquet格式儲存，資料表使用時間+MSISDN號碼做分割槽，使用Impala查詢，利用不上分割槽的查

Spark：如何替換sc.parallelize(List(item1,item2)).collect().foreach(row=>{})為並行？

相關推薦