Python寫的Spark示例,報錯與解決方法
阿新 • • 發佈:2019-02-05
對應的環境變數:
#java export JAVA_HOME=/usr/local/jdk1.8.0_181 export PATH=$JAVA_HOME/bin:$PATH #python export PYTHON_HOME=/usr/local/python3 export PATH=$PYTHON_HOME/bin:$PATH #spark export SPARK_HOME=/usr/local/spark export PATH=$SPARK_HOME/bin:$PATH #add spark to python export PYTHONPATH=/usr/local/spark/python #add pyspark to jupyter export PYSPARK_PYTHON=/usr/local/python3/bin/python3 # 因為我們裝了兩個版本的python,所以要指定pyspark_python,>否則pyspark執行程式會報錯。 export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS='notebook --allow-root'
使用 python寫的Spark示例:
# -*- coding: utf-8 -*- from __future__ import print_function from pyspark import * import os if __name__ == '__main__': sc = SparkContext("local[4]") sc.setLogLevel("WARN") rdd = sc.parallelize("hello Pyspark world".split(" ")) counts = rdd \ .flatMap(lambda line: line) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) \ .foreach(print) sc.stop
出現如下錯誤
Traceback (most recent call last): File "test1.py", line 3, in <module> from pyspark import * File "/usr/local/spark/python/pyspark/__init__.py", line 46, in <module> from pyspark.context import SparkContext File "/usr/local/spark/python/pyspark/context.py", line 29, in <module> from py4j.protocol import Py4JError ImportError: No module named py4j.protocol
解決方法:
#進入python的目錄
/usr/local/python3/lib/python3.6/site-packages
#拷貝日誌包過來
cp /usr/local/spark/python/lib/py4j-0.10.7-src.zip ./
#解壓
unzip py4j-0.10.7-src.zip