Spark安裝使用例項
阿新 • • 發佈:2018-12-23
2017-06-09
安裝java opensdk 1.8
如果沒有安裝Java環境,需要先下載安裝。
[[email protected] ~]$ yum search java | grep openjdk
[[email protected] ~]$ sudo yum install java-1.8.0-openjdk-devel.x86_64
[[email protected] ~]$ sudo yum install java-1.8.0-openjdk-src
centos 使用yum命令後,將 OpenSDK 安裝到/usr/lib/jvm/ 目錄.
配置Java環境
[[email protected] ~]$ vi /etc/profile
export JAVA_HOME=/etc/alternatives/java_sdk_openjdk
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
[[email protected] ~]$ java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-b12)
OpenJDK 64-Bit Server VM (build 25.131-b12, mixed mode)
[ [email protected] ~]$ ls -l /usr/bin/java
lrwxrwxrwx 1 root root 22 Jun 5 17:38 /usr/bin/java -> /etc/alternatives/java
[[email protected] ~]$ javac -version
javac 1.8.0_131
Java程式測試環境
[[email protected] java]$ cat HelloWorld.java
public class HelloWorld {
public static void main(String[] args) {
System.out.println("Hello, World! ");
}
}
[ [email protected] java]$ javac HelloWorld.java
[[email protected] java]$ java HelloWorld
Hello, World!
下載spark
我這裡直接通過git下載最新原始碼.
[[email protected] java]$ git clone git://github.com/apache/spark.git
# 穩定分支: git clone git://github.com/apache/spark.git -b branch-2.1
[[email protected] spark]$ ls
appveyor.yml CONTRIBUTING.md external mllib R spark-warehouse
assembly core graphx mllib-local README.md sql
bin data hadoop-cloud NOTICE repl streaming
build dev launcher pom.xml resource-managers target
common docs LICENSE project sbin tools
conf examples licenses python scalastyle-config.xml work
[[email protected] spark]$ mvn install -DskipTests
[INFO] BUILD SUCCESS
設定spark環境變數
[[email protected] ~]$ vi .bashrc
#修改.bashrc或.zshrc
# spark
export SPARK_HOME="${HOME}/java/spark"
export PATH="$SPARK_HOME/bin:$PATH"
[[email protected] ~]$ source .bashrc
示例
用mapreduce求PI
from __future__ import print_function
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession
if __name__ == "__main__":
"""
Usage: pi [partitions]
蒙特卡羅演算法, 求落在圓內的點和正方形內的點的比值求PI
"""
spark = SparkSession\
.builder\
.appName("PythonPi")\
.getOrCreate()
# 分割槽數, 為輸入值,預設為2
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
n = 100000 * partitions
# map 函式:[(-1,1),(-1,1)]之間的隨機數, 落在圓內為1, 圓外為0.
圓面積pi, 落了count個點, 正方形面積4, 落了n個點.
pi/count=4/n.
pi=4*count/n
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0
count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
spark.stop()
執行示例
[[email protected] spark]$ ./bin/run-example SparkPi 10
17/06/06 19:41:03 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.814962 s
Pi is roughly 3.142123142123142
啟動shell
spark可以靈活執行在local, mesos,YARN,或分散式中的獨立計劃(Standalone Scheduler)各種不同的模式中.
scala shell
[[email protected] conf]$ cp log4j.properties.template log4j.properties
[[email protected] spark]$ ./bin/spark-shell --master local[2]
–master 指定分散式遠端的url, 或者 local指定本地單執行緒處理,local[n]表示啟動n個執行緒處理.
python shell
[[email protected] ~]$ pyspark --master local[2]
./bin/spark-submit examples/src/main/python/pi.py 10
啟動ipython 或jupyter notebook
原來的IPYTHON和IPYTHON_OPTS都已經作廢, 改用PYSPARK_DRIVER_PYTHON和PYSPARK_DRIVER_PYTHON_OPTS
[[email protected] spark]$ PYSPARK_DRIVER_PYTHON=ipython ./bin/pyspark
In [4]: lines = sc.textFile("README.md")
In [5]: lines.count()
Out[5]: 103
In [6]: lines.first()
Out[6]: '# Apache Spark'
In [10]: type(lines)
Out[10]: pyspark.rdd.RDD
In [15]: pylines = lines.filter(lambda line: "Python" in line)
In [16]: pylines.first()
Out[16]: 'high-level APIs in Scala, Java, Python, and R, and an optimized engine that'
#notebook
[[email protected] spark]$ PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=10.6.0.200" ./bin/pyspark
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://10.6.0.200:8888/?token=69456fd93a5ce196b3b3f7ee5a983a40115da9cef982e35f
這裡通過–ip綁定了區域網訪問的地址. 否則只可本地訪問. 可以通過nohup 啟動後臺程式, 保持一直遠端訪問能力.
Rshell
./bin/sparkR --master local[2]
./bin/spark-submit examples/src/main/r/dataframe.R
如非註明轉載, 均為原創. 本站遵循知識共享CC協議,轉載請註明來源