Spark Python API函式：pyspark API(2)

阿新 • • 發佈：2018-11-06

文章目錄

•   1 sortBy
•   2 glom
•   3 cartesian
•   4 groupBy
•   5 pipe
•   6 foreach
•   7 foreachPartition
•   8 collect
•   9 reduce
•   10 fold
•   11 aggregate
•   12 max
•   13 min
•   14 sum
•   15 count

sortBy

spark sortBy

# sortBy
x = sc.parallelize(['Cat','Apple','Bat'])
def keyGen(val): return val[0]
y = x.sortBy(keyGen)
print(y.collect())

['Apple', 'Bat', 'Cat']

glom

spark glom

# glom
x = sc.parallelize(['C','B','A'], 2)
y = x.glom()
print(x.collect()) 
print(y.collect())

['C', 'B', 'A']
[['C'], ['B', 'A']]

cartesian

spark cartesian

# cartesian
x = sc.parallelize(['A','B'])
y = sc.parallelize(['C','D'])
z = x.cartesian(y)
print(x.collect())
print(y.collect())
print(z.collect())

['A', 'B']
['C', 'D']
[('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D')]

groupBy

spark groupBy

# groupBy
x = sc.parallelize([1,2,3])
y = x.groupBy(lambda x: 'A' if (x%2 == 1) else 'B' )
print(x.collect())
# y is nested, this iterates through it
print([(j[0],[i for i in j[1]]) for j in y.collect()]) 

[1, 2, 3]
[('A', [1, 3]), ('B', [2])]

pipe

spark pipe

# pipe
x = sc.parallelize(['A', 'Ba', 'C', 'AD'])
y = x.pipe('grep -i "A"') # calls out to grep, may fail under Windows 
print(x.collect())
print(y.collect())

['A', 'Ba', 'C', 'AD']
['A', 'Ba', 'AD']

foreach

spark foreach

# foreach
from __future__ import print_function
x = sc.parallelize([1,2,3])
def f(el):
    '''side effect: append the current RDD elements to a file'''
    f1=open("./foreachExample.txt", 'a+') 
    print(el,file=f1)

# first clear the file contents
open('./foreachExample.txt', 'w').close()  

y = x.foreach(f) # writes into foreachExample.txt

print(x.collect())
print(y) # foreach returns 'None'
# print the contents of foreachExample.txt
with open("./foreachExample.txt", "r") as foreachExample:
    print (foreachExample.read())
    
[1, 2, 3]
None
3
1
2

foreachPartition

spark foreachPartition

# foreachPartition
from __future__ import print_function
x = sc.parallelize([1,2,3],5)
def f(parition):
    '''side effect: append the current RDD partition contents to a file'''
    f1=open("./foreachPartitionExample.txt", 'a+') 
    print([el for el in parition],file=f1)

# first clear the file contents
open('./foreachPartitionExample.txt', 'w').close()  

y = x.foreachPartition(f) # writes into foreachExample.txt

print(x.glom().collect())
print(y)  # foreach returns 'None'
# print the contents of foreachExample.txt
with open("./foreachPartitionExample.txt", "r") as foreachExample:
    print (foreachExample.read())

[[], [1], [], [2], [3]]
None
[]
[]
[1]
[2]
[3]

collect

spark collect

# collect
x = sc.parallelize([1,2,3])
y = x.collect()
print(x)  # distributed
print(y)  # not distributed

ParallelCollectionRDD[87] at parallelize at PythonRDD.scala:382
[1, 2, 3]

reduce

spark reduce

# reduce
x = sc.parallelize([1,2,3])
y = x.reduce(lambda obj, accumulated: obj + accumulated)  # computes a cumulative sum
print(x.collect())
print(y)

[1, 2, 3]
6

fold

spark fold

# fold
x = sc.parallelize([1,2,3])
neutral_zero_value = 0  # 0 for sum, 1 for multiplication
y = x.fold(neutral_zero_value,lambda obj, accumulated: accumulated + obj) # computes cumulative sum
print(x.collect())
print(y)

[1, 2, 3]
6

aggregate

spark aggregate

# aggregate
x = sc.parallelize([2,3,4])
neutral_zero_value = (0,1) # sum: x+0 = x, product: 1*x = x
seqOp = (lambda aggregated, el: (aggregated[0] + el, aggregated[1] * el)) 
combOp = (lambda aggregated, el: (aggregated[0] + el[0], aggregated[1] * el[1]))
y = x.aggregate(neutral_zero_value,seqOp,combOp)  # computes (cumulative sum, cumulative product)
print(x.collect())
print(y)

[2, 3, 4]
(9, 24)

max

spark max

# max
x = sc.parallelize([1,3,2])
y = x.max()
print(x.collect())
print(y)

[1, 3, 2]
3

min

spark min

# min
x = sc.parallelize([1,3,2])
y = x.min()
print(x.collect())
print(y)

[1, 3, 2]
1

sum

spark sum

# sum
x = sc.parallelize([1,3,2])
y = x.sum()
print(x.collect())
print(y)

[1, 3, 2]
6

count

spark count

# count
x = sc.parallelize([1,3,2])
y = x.count()
print(x.collect())
print(y)

[1, 3, 2]
3

Spark Python API函式：pyspark API(2)

文章目錄 • 1 sortBy • 2 glom • 3 cartesian • 4 groupBy • 5 pip

Spark Python API函式：pyspark API(4)

文章目錄 • 1 countByKey • 2 join • 3 leftOuterJoin • 4 rightOuterJoin • &nb

Spark Python API函式：pyspark API(3)

文章目錄 • 1 histogram • 2 mean • 3 variance • 4 stdev • 5 sam

Spark Python API函式：pyspark API(1)

文章目錄 • 1 pyspark version • 2 map • 3 flatMap • 4 mapPartitions •

Python numpy函式：zeros（）、ones（）、empty（）

轉自：https://blog.csdn.net/qq_28618765/article/details/78085457 在給陣列賦初始值的時候，經常會用到0陣列，而Python中，我們使用zero（）函式來實現。 ones函式可以建立任意維度和元素個數的陣列，其元素值均為1； empty

Python匿名函式：lamdba()函式

Python裡，這個小的函式。可以用來代替一些很小的函式。在影象，音訊，遊戲方面有比較重要的作用。當然，這也是我單獨寫出來的原因· 來一個簡單的例子: word：單詞列表 ff:遍歷列表裡面單詞的函式 >>> def ee(word,ff): ... for

python匿名函式：lambda函式

lambda函式也叫做匿名函式，即不需要用def單獨定義，沒有函式名。一般為了在表示式中書寫簡便，比如表示式的一部分需要對某個引數做簡單的運算操作，由於運算操作十分簡單覺得沒必要單獨用def定義一個函式來實現它，就可以用lambda函式直接書寫 lambda函

Python numpy函式：arange（）

arange()函式用於建立等差陣列，使用頻率很高。arange函式和range函式很像，兩個的區別是arange函式返回的是一個數據，而range函式返回的是list。另外，在使用arange前，需要先引入numpy，而range不用。其他，兩者都差不多我們對比著ran

《Spark Python API 官方文檔中文版》之 pyspark.sql (一)

開始 clear sorted 緩存 news 數據 sch json 大數摘要：在Spark開發中，由於需要用Python實現，發現API與Scala的略有不同，而Python API的中文資料相對很少。每次去查英文版API的說明相對比較慢，還是中文版比較容易get到所

Spark 2.0介紹：從RDD API遷移到DataSet API

RDD遷移到DataSet DataSet API將RDD和DataFrame兩者的優點整合起來，DataSet中的許多API模仿了RDD的API，雖然兩者的實現很不一樣。所以大多數呼叫RDD API編寫的程式可以很容易地遷移到DataSet API中，下面我

API Star：一個 Python 3 的 API 框架

ces 方式發生 asn status .py 通過 enc static 為了在 Python 中快速構建 API，我主要依賴於 Flask。最近我遇到了一個名為 “API Star” 的基於 Python 3 的新 API 框架。由於幾個原因，我對它很感興趣。首先，該

Spark2.2+ES6.4.2（三十二）：ES API之ndex的create（建立index時設定setting，並建立index後根據avro模板動態設定index的mapping）/update/delete/open/close

要想通過ES API對es的操作，必須獲取到TransportClient物件，讓後根據TransportClient獲取到IndicesAdminClient物件後，方可以根據IndicesAdminClient物件提供的方法對ES的index進行操作：create index,update inde