pyspark-Spark程式設計指南

阿新 • • 發佈：2019-01-08

參考：

1、http://spark.apache.org/docs/latest/rdd-programming-guide.html

2、https://github.com/apache/spark/tree/v2.2.0

Spark程式設計指南

連線Spark

from pyspark import SparkContext, SparkConf

初始化Spark

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf 
=conf)

使用Shell

$ ./bin/pyspark --master local[4]

$ ./bin/pyspark --master local[4] --py-files code.py

彈性分散式資料集 (RDDs)

並行集合

data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)

外部資料集

# Text file RDDs can be created using SparkContext’s textFile method.
# This method takes an URI for the file (either a local path on the machine,
 
# or a hdfs://, s3n://, etc URI) and reads it as a collection of lines.
>>> distFile = sc.textFile("data.txt")

textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").

# lets you read a directory containing multiple small text files
SparkContext.wholeTextFiles

RDD.saveAsPickleFile and  
SparkContext.pickleFile 
# support saving an RDD in a simple format consisting of pickled Python objects.

儲存和載入SequenceFiles

>>> rdd = sc.parallelize(range(1, 4)).map(lambda x: (x, "a" * x))
>>> rdd.saveAsSequenceFile("path/to/file")
>>> sorted(sc.sequenceFile("path/to/file").collect())
[(1, u'a'), (2, u'aa'), (3, u'aaa')]

儲存和載入其他Hadoop輸入/輸出格式

$ ./bin/pyspark --jars /path/to/elasticsearch-hadoop.jar
>>> conf = {"es.resource" : "index/type"}  # assume Elasticsearch is running on localhost defaults
>>> rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",
"org.apache.hadoop.io.NullWritable",
"org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=conf)
>>> rdd.first()  # the result is a MapWritable that is converted to a Python dict
(u'Elasticsearch ID',
{u'field1': True,
u'field2': u'Some Text',
u'field3': 12345})

RDD 操作

lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)

lineLengths.persist()

傳遞函式到Spark

"""MyScript.py"""
if __name__ == "__main__":
    def myFunc(s):
        words = s.split(" ")
        return len(words)

    sc = SparkContext(...)
    sc.textFile("file.txt").map(myFunc)


class MyClass(object):
    def func(self, s):
        return s
    def doStuff(self, rdd):
        return rdd.map(self.func)


class MyClass(object):
    def __init__(self):
        self.field = "Hello"
def doStuff(self, rdd):
        return rdd.map(lambda s: self.field + s)

def doStuff(self, rdd):
    field = self.field
    return rdd.map(lambda s: field + s)

瞭解關閉

例子

counter = 0
rdd = sc.parallelize(data)

# Wrong: Don't do this!!
def increment_counter(x):
    global counter
    counter += x
rdd.foreach(increment_counter)

print("Counter value: ", counter)

使用鍵值對

lines = sc.textFile("data.txt")
pairs = lines.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a + b)

變數共享

廣播變數

>>> broadcastVar = sc.broadcast([1, 2, 3])

<pyspark.broadcast.Broadcast object at 0x102789f10>

>>> broadcastVar.value  # list
[1, 2, 3]
da=sc.parallelize(broadcastVar.value) # -->RDD

累加器

>>> accum = sc.accumulator(0)
>>> accum
Accumulator<id=0, value=0>

>>> sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x))
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s
>>> accum.value # <type 'int'>
10

class VectorAccumulatorParam(AccumulatorParam):
    def zero(self, initialValue):
        return Vector.zeros(initialValue.size)

    def addInPlace(self, v1, v2):
        v1 += v2
        return v1

# Then, create an Accumulator of this type:
vecAccum = sc.accumulator(Vector(...), VectorAccumulatorParam())

accum = sc.accumulator(0)
def g(x):
    accum.add(x)
    return f(x)
data.map(g)
# Here, accum is still 0 because no actions have caused the `map` to be computed.

pyspark-Spark程式設計指南

參考： 1、http://spark.apache.org/docs/latest/rdd-programming-guide.html 2、https://github.com/apache/spark/tree/v2.2.0 Spark程式設計指南連線Spark

《Spark 官方文件》Spark程式設計指南

概述總體上來說，每個Spark應用都包含一個驅動器（driver）程式，驅動器執行使用者的main函式，並在叢集上執行各種並行操作。 Spark最重要的一個抽象概念就是彈性分散式資料集（resilient distributed dataset – RDD），RDD是一個可分割槽的元素集合，其包含的元素可

Spark程式設計指南之三：RDD基本概念

RDD是什麼？ RDD(Resilient Distributed Dataset)，彈性分散式資料集，是Spark的核心資料結構抽象。它是彈性的，具有容錯能力，能夠重新計算失敗結點。它是分散式的，資料分佈在多個結點上。它是一個數據集，可以從外部載入資料，可以是文字檔案，JSON，

Spark程式設計指南之二：向Spark運算元傳遞函式

文章目錄向Spark運算元傳遞函式 Java的兩種方法匿名內部類建立類實現Function介面 Scala的兩種方法傳遞匿名函式定義全域性單例物件中的靜態方法

Spark程式設計指南之一：transformation和action等RDD基本操作

文章目錄基本概念開發環境程式設計實戰初始化SparkContext RDD的生成 RDD基本操作 Key-Value Pairs Transformations f

Spark程式設計指南入門之Java篇一-基本知識

1. Spark的Java開發包 Spark提供Java的開發包，當前最新版本是2.0.2版本：spark-core_2.11-2.0.2.jar，可以從下面連結下載： http://central.maven.org/maven2/org/apache/spark/spa

Spark2.1.0文件：Spark程式設計指南-Spark Programming Guide

1 概述從一個較高的層次來看，每一個 Spark 應用程式由兩部分組成：driver program（驅動程式）端執行的 main 函式以及在整個叢集中被執行的各種並行操作。Spark 提供的主要抽象是一個彈性分散式資料集（RDD），它是可以被並行處理且跨節點分佈的元素的

Spark程式設計指南之四：Spark分散式叢集模式的執行時系統架構

文章目錄官方叢集模式介紹 Cluster Manager有哪些？ Standalone Apache Mesos Hadoop YARN Kubernetes Standalone模

Spark Programming Guide (Python) Spark程式設計指南 (二)

對部分內容有修改，恕本人水平有限，如有錯誤，在所難免。 PySpark程式設計指南(譯)： 1. 概述： a) 從高層次上來看，每一個Spark應用都包含一個驅動程式，用於執行使用者的main函式以及在叢集上執行各種並行操作。Spark提供的主要抽象是彈性分散式資

Spark程式設計指南入門之Java篇二-基本操作

4. RDD的操作 4.1 基本操作 RDD有2種類型的操作，一種是轉換transformations，它基於一個存在的資料集創建出一個新的資料集；另一種是行動actions，它通過對一個存在的資料集進行運算得出結果。例如，map方法是轉換操作，它將資料集的每一個元素按指定

#########好####### pyspark-Spark Streaming程式設計指南

參考： 1、http://spark.apache.org/docs/latest/streaming-programming-guide.html 2、https://github.com/apache/spark/tree/v2.2.0 Spark Streami

轉載：Spark中文指南(入門篇)-Spark程式設計模型(一)

原文：https://www.cnblogs.com/miqi1992/p/5621268.html 前言本章將對Spark做一個簡單的介紹，更多教程請參考： Spark教程本章知識點概括 Apache Spark簡介 Spark的四種執行模式 Spark基於

spark RDD官網RDD程式設計指南

http://spark.apache.org/docs/latest/rdd-programming-guide.html#using-the-shell Overview（概述）在較高的層次上，每個Spark應用程式都包含一個驅動程式，該程式執行使用者的主要功能並在叢集上執行各

《Spark官方文件》Spark Streaming程式設計指南

spark-1.6.1 [原文地址] Spark Streaming程式設計指南概覽 Spark Streaming是對核心Spark API的一個擴充套件，它能夠實現對實時資料流的流式處理，並具有很好的可擴充套件性、高吞吐量和容錯性。Spark Streaming支援從多種資料來源提取資

《Spark 官方文件》Spark SQL, DataFrames 以及 Datasets 程式設計指南

spark-1.6.0 [原文地址] Spark SQL, DataFrames 以及 Datasets 程式設計指南概要 Spark SQL是Spark中處理結構化資料的模組。與基礎的Spark RDD API不同，Spark SQL的介面提供了更多關於資料的結構資訊和計算任務的執

Spark 2.4.0程式設計指南--Spark DataSources

Spark 2.4.0程式設計指南–Spark DataSources 更多資源 github: https://github.com/opensourceteams/spark-scala-maven-2.4.0 視訊 Spark 2.4.0程式設計指

Spark 2.4.0程式設計指南--Spark SQL UDF和UDAF

Spark 2.4.0程式設計指南–Spark SQL UDF和UDAF 更多資源 github: https://github.com/opensourceteams/spark-scala-maven-2.4.0 視訊 Spark 2.4.0程

Spark 2.4.0程式設計指南--spark dataSet action

Spark 2.4.0程式設計指南–spark dataSet action 更多資源 github: https://github.com/opensourceteams/spark-scala-maven-2.4.0 視訊 Spark 2.4.

Spark 2.4.0 程式設計指南--快速入門

Spark 2.4.0 程式設計指南–快速入門更多資源 github: https://github.com/opensourceteams/spark-scala-maven-2.4.0 視訊 Spark 2.4.0 程式設計指南–快速入門(b

Spark2.1.0文件：Spark Streaming 程式設計指南（上）

本文翻譯自spark官方文件，僅翻譯了Scala API部分，目前版本為2.1.0，如有疏漏錯誤之處請多多指教。原文地址：http://spark.apache.org/docs/latest/streaming-programming-guide.html 因文件篇幅較

pyspark-Spark程式設計指南

Spark程式設計指南

連線Spark

初始化Spark

使用Shell

彈性分散式資料集 (RDDs)

並行集合

外部資料集

RDD 操作

傳遞函式到Spark

瞭解關閉

例子

使用鍵值對

變數共享

廣播變數

相關推薦