Spark中map與flatMap

阿新 • • 發佈：2018-10-06

必須 ret iter ext within serve 函數 range fail

map將函數作用到數據集的每一個元素上，生成一個新的分布式的數據集(RDD)返回

map函數的源碼：

def map(self, f, preservesPartitioning=False):
        """
        Return a new RDD by applying a function to each element of this RDD.

        >>> rdd = sc.parallelize(["b", "a", "c"])
        >>> sorted(rdd.map(lambda x: (x, 1)).collect())
        [(‘a‘, 1), (‘b‘, 1), (‘c‘, 1)]
         
"""
        def func(_, iterator):
            return map(fail_on_stopiteration(f), iterator)
        return self.mapPartitionsWithIndex(func, preservesPartitioning)

map將每一條輸入執行func操作並對應返回一個對象，形成一個新的rdd，如源碼中的rdd.map(lambda x: (x, 1) --> [(‘a‘, 1), (‘b‘, 1), (‘c‘, 1)]

flatMap會先執行map的操作，再將所有對象合並為一個對象，

返回值是一個Sequence

flatMap源碼：

def flatMap(self, f, preservesPartitioning=False):
        """
        >>> rdd = sc.parallelize([2, 3, 4])
        >>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
        [1, 1, 1, 2, 2, 3]
        >>> sorted(rdd.flatMap(lambda x: [(x, x), (x, x)]).collect())
        [(2, 2), (2, 2), (3, 3), (3, 3), (4, 4), (4, 4)]
         
"""
        def func(s, iterator):
            return chain.from_iterable(map(fail_on_stopiteration(f), iterator))
        return self.mapPartitionsWithIndex(func, preservesPartitioning)

註意：flatMap將輸入執行func操作時，對象必須是可叠代的

map與flatMap的區別：

 1 from pyspark import SparkConf, SparkContext
 2 
 3 conf = SparkConf()
 4 sc = SparkContext(conf=conf)
 5 
 6 
 7 def func_map():
 8     data = ["hello world", "hello fly"]
 9     data_rdd = sc.parallelize(data)
10     map_rdd = data_rdd.map(lambda s: s.split(" "))
11     print("map print:{}".format(map_rdd.collect()))
12 
13 
14 def func_flat_map():
15     data = ["hello world", "hello fly"]
16     data_rdd = sc.parallelize(data)
17     flat_rdd = data_rdd.flatMap(lambda s: s.split(" "))
18     print("flatMap print:{}".format(flat_rdd.collect()))
19 
20 
21 func_map()
22 func_flat_map()
23 sc.stop()

執行結果：

map print:[[‘hello‘, ‘world‘], [‘hello‘, ‘fly‘]]                                
flatMap print:[‘hello‘, ‘world‘, ‘hello‘, ‘fly‘]

可以看出，map對 "hello world", "hello fly"這兩個對象分別映射為[‘hello‘, ‘world‘], [‘hello‘, ‘fly‘]，而flatMap在map的基礎上做了一個合並操作，將這兩個對象合並為一個[‘hello‘, ‘world‘, ‘hello‘, ‘fly‘]，這就造就了flatMap在詞頻統計方面的優勢。

Spark中map與flatMap

必須 ret iter ext within serve 函數 range fail map將函數作用到數據集的每一個元素上，生成一個新的分布式的數據集(RDD)返回 map函數的源碼： def map(self, f, preservesPartitioning=Fal

Spark中map與flatMap

Spark中map與flatMap

Spark 中 map 與 flatMap 的比較

spark中map與mapPartitions區別

Spark之中map與flatMap的區別

Spark中map和flatMap的區別

spark中map和flatmap之間的區別

Spark學習筆記 --- Spark中Map和FlatMap轉換的區別

spark 中map 和flatmap 的區別

RxJava 中的map與flatMap

spark RDD操作map與flatmap的區別

spark RDD 的map與flatmap區別說明

大數據spark中ml與mllib 的區別你分清了嗎？

java8中 map和flatmap的共同點和區別，以及兩者的例項解析

Spark中Map、Shulffe、Reduce的含義解釋

Java 中 Map與JavaBean之間的相互轉化

fastjson中Map與JSONObject互換，List與JOSNArray互換的實現

Hadoop Job 中 Map 與 Reduce 數量控制

STL中map與優先順序佇列

Spark中map、mapPartitions、foreach、foreachPartitions運算元

Spark 中關於Parquet的應用與性能初步測試

Spark中map與flatMap

相關推薦