SparkRdd實現單詞統計原始碼分析

阿新 • • 發佈：2020-11-05

SparkRdd實現單詞統計原始碼分析

1 手寫單詞統計

//設定任務名字 local本地模式
val conf=new SparkConf().setAppName("WC").setMaster("local")
//通向spark叢集的入口
val sc =new SparkContext(conf)
// sc.textFile(args(0)).flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._2,false).saveAsTextFile(args(1))

2 本地Debug除錯資訊

3 本地debug除錯

(1) MapPartitionsRDD[7] at sortBy at SparkWordCount.scala:21 []

| ShuffledRDD[6] at sortBy at SparkWordCount.scala:21 []

+-(1) MapPartitionsRDD[5] at sortBy at SparkWordCount.scala:21 []

| ShuffledRDD[4] at reduceByKey at SparkWordCount.scala:21 []

+-(1) MapPartitionsRDD[3] at map at SparkWordCount.scala:21 []

| MapPartitionsRDD[2] at flatMap at SparkWordCount.scala:21 []

| MapPartitionsRDD[1] at textFile at SparkWordCount.scala:20 []

| file:///c:/tools/test/data/a.txt HadoopRDD[0] at textFile at SparkWordCount.scala:20 []

原始碼分析

1.sc.textFile("")產生rdd（HadoopRDD[0]，MapPartitionsRDD[1]）

//textFile會產生兩個rdd

/**

* Read a text file from HDFS, a local file system (available on all nodes), or any

* Hadoop-supported file system URI, and return it as an RDD of Strings.

*/

def textFile(

 path: String,

 minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {

 assertNotStopped()

 hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],

 minPartitions).map(pair => pair._2.toString)

}

第一個RDD HadoopRDD 主要將路徑和資料，廣播變數，檔案輸入型別（InputFormat）

/** Get an RDD for a Hadoop file with an arbitrary InputFormat

*

* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each

* record, directly caching the returned RDD or directly passing it to an aggregation or shuffle

* operation will create many references to the same object.

* If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first

* copy them using a `map` function.

*/

def hadoopFile[K, V](

 path: String,

 inputFormatClass: Class[_ <: InputFormat[K, V]],

 keyClass: Class[K],

 valueClass: Class[V],

 minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {

 assertNotStopped()

 // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.

 val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))

 val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)

 new HadoopRDD(

 this,

 confBroadcast,

 Some(setInputPathsFunc),

 inputFormatClass,

 keyClass,

 valueClass,

 minPartitions).setName(path)

}

第二個RDD 呼叫map方法把values取出

/**

* Return a new RDD by applying a function to all elements of this RDD.

*/

def map[U: ClassTag](f: T => U): RDD[U] = withScope {

 val cleanF = sc.clean(f)

 new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))

}

2 .flatMap(_.split(" "))產生rdd（MapPartitionsRDD[2]）

/**

* Return a new RDD by first applying a function to all elements of this

* RDD, and then flattening the results.

*/

呼叫sacla的map方法檢測一些資訊

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {

 val cleanF = sc.clean(f)

 new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))

}

3 .map((_,1))產生rdd（MapPartitionsRDD[3]）

/**

* Return a new RDD by applying a function to all elements of this RDD.

*/

資料整理（value，1）將資料返回

def map[U: ClassTag](f: T => U): RDD[U] = withScope {

 val cleanF = sc.clean(f)

 new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))

}

4 .reduceByKey(_+_)產生rdd（ShuffledRDD[4]）

/**

* Merge the values for each key using an associative reduce function. This will also perform

* the merging locally on each mapper before sending results to a reducer, similarly to a

* "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/

* parallelism level.

*/

new ShuffledRDD 聚合計算（1 區域性聚合 2 上游取出 整體聚合）

def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {

 reduceByKey(defaultPartitioner(self), func)

}

/**

* Merge the values for each key using an associative reduce function. This will also perform

* the merging locally on each mapper before sending results to a reducer, similarly to a

* "combiner" in MapReduce.

*/

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {

 combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)

}

/**

* :: Experimental ::

* Generic function to combine the elements for each key using a custom set of aggregation

* functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C

* Note that V and C can be different -- for example, one might group an RDD of type

* (Int, Int) into an RDD of type (Int, Seq[Int]). Users provide three functions:

*

* - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)

* - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)

* - `mergeCombiners`, to combine two C's into a single one.

*

* In addition, users can control the partitioning of the output RDD, and whether to perform

* map-side aggregation (if a mapper can produce multiple items with the same key).

*/

@Experimental

def combineByKeyWithClassTag[C](

 createCombiner: V => C,

 mergeValue: (C, V) => C,

 mergeCombiners: (C, C) => C,

 partitioner: Partitioner,

 mapSideCombine: Boolean = true,

 serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {

 require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0

 if (keyClass.isArray) {

 if (mapSideCombine) {

 throw new SparkException("Cannot use map-side combining with array keys.")

 }

 if (partitioner.isInstanceOf[HashPartitioner]) {

 throw new SparkException("Default partitioner cannot partition array keys.")

 }

 }

 val aggregator = new Aggregator[K, V, C](

 self.context.clean(createCombiner),

 self.context.clean(mergeValue),

 self.context.clean(mergeCombiners))

 if (self.partitioner == Some(partitioner)) {

 self.mapPartitions(iter => {

 val context = TaskContext.get()

 new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))

 }, preservesPartitioning = true)

 } else {

 new ShuffledRDD[K, V, C](self, partitioner)

 .setSerializer(serializer)

 .setAggregator(aggregator)

 .setMapSideCombine(mapSideCombine)

 }

5 .sortBy(_._2,false) 產生三個rdd（MapPartitionsRDD[5] ，ShuffledRDD[6]，MapPartitionsRDD[7]）

/**

* Return this RDD sorted by the given key function.

*/



def sortBy[K](

 f: (T) => K,

 ascending: Boolean = true,

 numPartitions: Int = this.partitions.length)

 (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {

 this.keyBy[K](f)

 .sortByKey(ascending, numPartitions)

 .values

}

/**

* Return a new RDD by applying a function to all elements of this RDD.

*/

def map[U: ClassTag](f: T => U): RDD[U] = withScope {

 val cleanF = sc.clean(f)

 new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))

}

/**

* Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling

* `collect` or `save` on the resulting RDD will return or output an ordered list of records

* (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in

* order of the keys).

*/

// TODO: this currently doesn't work on P other than Tuple2!

new ShuffledRDD

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)

 : RDD[(K, V)] = self.withScope

{

 val part = new RangePartitioner(numPartitions, self, ascending)

 new ShuffledRDD[K, V, V](self, part)

 .setKeyOrdering(if (ascending) ordering else ordering.reverse)

}

MapPartitionsRDD

def map[U: ClassTag](f: T => U): RDD[U] = withScope {

 val cleanF = sc.clean(f)

 new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))

}

SparkRdd實現單詞統計原始碼分析

SparkRdd實現單詞統計原始碼分析 1 手寫單詞統計 //設定任務名字 local本地模式 val conf=new SparkConf().setAppName(\"WC\").setMaster(\"local\")

【筆記】拉勾Java工程師高薪訓練營-第一階段開源框架原始碼解析-模組一持久層框架涉及實現及MyBatis原始碼分析-任務二：Mybatis基礎回顧及高階應用

以下筆記是我看完視訊之後總結整理的，部分較為基礎的知識點也做了補充，如有問題歡迎溝通。

【筆記】拉勾Java工程師高薪訓練營-第一階段開源框架原始碼解析-模組一持久層框架涉及實現及MyBatis原始碼分析-任務三：Mybatis原始碼剖析

以下筆記是我看完視訊之後總結整理的，部分較為基礎的知識點也做了補充，如有問題歡迎溝通。由於本任務大多為程式碼講解，相關內容在講義裡面都有，所以這裡基本上沒有什麼內容。網上有更多更好的材料可以參考學習，

20201128 IoC容器設計實現及Spring原始碼分析 - 拉勾教育

環境資訊 Spring 版本：5.1.12.RELEASE 第一部分 Spring 概述 Spring 是分層的 full-stack（全棧）輕量級開源框架，以 IoC 和 AOP 為核心，提供了展現層 Spring

併發-Future實現原理和原始碼分析

參考： https://blog.csdn.net/u014730165/article/details/84065996 https://www.jianshu.com/p/69a6ae850736

深入SpringAOP實現流程（原始碼分析）

技術標籤：原始碼分析springaop 系列文章目錄 SpringAOP實現深入SpringAOP實現流程（原始碼分析）

CopyOnWriteArrayList實現原理及原始碼分析

CopyOnWriteArrayList實現原理及原始碼分析　　CopyOnWriteArrayList是Java併發包中提供的一個併發容器，它是個執行緒安全且讀操作無鎖的ArrayList，寫操作則通過建立底層陣列的新副本來實現，是一種讀寫分離的併

HashMap實現原理及原始碼分析

HashMap實現原理及原始碼分析雜湊表（hash table）也叫散列表，是一種非常重要的資料結構，應用場景及其豐富，許多快取技術（比如memcached）的核心其實就是在記憶體中維護一張大的雜湊表，而HashMap的實現原理也

MyBatis系列之Mybatis原始碼解讀持久層框架設計實現及mybatis原始碼分析

MyBatis系列之Mybatis原始碼解讀持久層框架設計實現及mybatis原始碼分析 MyBatis系列之Mybatis原始碼解讀持久層框架設計實現及mybatis原始碼分析

JDK中的BitMap實現之BitSet原始碼分析

前提本文主要內容是分析JDK中的BitMap實現之java.util.BitSet的原始碼實現，基於JDK11編寫，其他版本的JDK不一定合適。

7. SOFAJRaft原始碼分析—如何實現一個輕量級的物件池？

前言我在看SOFAJRaft的原始碼的時候看到了使用了物件池的技術，看了一下感覺要吃透的話還是要新開一篇文章來講，內容也比較充實，大家也可以學到之後運用到實際的專案中去。

3. Sentinel原始碼分析— QPS流量控制是如何實現的？

終於在這周內寫了一篇原始碼解析，每週一篇即使再忙也不能打破 Sentinel原始碼解析系列：

6.原始碼分析---和dubbo相比SOFARPC是如何實現負載均衡的？

官方目前建議使用的負載均衡包括以下幾種： random（隨機演演算法） localPref（本地優先演演算法）

詳解go基於viper實現配置檔案熱更新及其原始碼分析

go第三方庫 github.com/spf13/viper 實現了對配置檔案的讀取並注入到結構中，好用方便。

多執行緒高併發程式設計(12) -- 阻塞演演算法實現ArrayBlockingQueue原始碼分析（1）

一.前言　　前文探究了非阻塞演演算法的實現ConcurrentLinkedQueue安全佇列，也說明瞭阻塞演演算法實現的兩種方式，使用一把鎖（出隊和入隊同一把鎖ArrayBlockingQueue）和兩把鎖（出隊和入隊各一把鎖LinkedBlockin

promise 實現 async await 原始碼及原理分析

async-await 原始碼 https://github.com/lfp1024/promise async-await const _async = (func) => { const p = new Promise((resolve, reject) => {

Go-原始碼分析與tick實現定時任務

tick實現定時任務示例程式碼 package main import ( "fmt" "math/rand" "time"

leaflet結合turf.js實現繪製圖形緩衝分析buffer(附原始碼下載)

前言 leaflet 入門開發系列環境知識點了解： leaflet api文件介紹，詳細介紹 leaflet 每個類的函式以及屬性等等

若依管理系統原始碼分析-分頁的實現以及post請求時的分頁

場景官方示例分頁實現前端呼叫實現 // 一般在查詢引數中定義分頁變數 queryParams: {

若依管理系統原始碼分析-匯出Excel時怎樣通過註解中readConverterExp實現格式化匯出列的顯示格式(0和1格式化為是否)

場景在資料庫中儲存某些個是否的欄位時，有時會使用tinint進行儲存這樣會在實體類中生成布林型別的變數

SparkRdd實現單詞統計 原始碼分析

SparkRdd實現單詞統計 原始碼分析

1 手寫單詞統計

2 本地Debug除錯資訊

3 本地debug除錯

原始碼分析

1.sc.textFile("")產生rdd（HadoopRDD[0]，MapPartitionsRDD[1]）

2 .flatMap(_.split(" "))產生rdd（MapPartitionsRDD[2]）

3 .map((_,1))產生rdd（MapPartitionsRDD[3]）

4 .reduceByKey(_+_)產生rdd（ShuffledRDD[4]）

5 .sortBy(_._2,false) 產生三個rdd（MapPartitionsRDD[5] ，ShuffledRDD[6]，MapPartitionsRDD[7]）

相關推薦

SparkRdd實現單詞統計原始碼分析

SparkRdd實現單詞統計原始碼分析