1. 程式人生 > >Spark安裝使用例項

Spark安裝使用例項

2017-06-09

image

安裝java opensdk 1.8

如果沒有安裝Java環境,需要先下載安裝。

[[email protected] ~]$ yum search java | grep openjdk
[[email protected] ~]$ sudo yum install java-1.8.0-openjdk-devel.x86_64
[[email protected] ~]$ sudo yum install java-1.8.0-openjdk-src

centos 使用yum命令後,將 OpenSDK 安裝到/usr/lib/jvm/ 目錄.

配置Java環境

[[email protected]
~]$ vi /etc/profile export JAVA_HOME=/etc/alternatives/java_sdk_openjdk export PATH=$PATH:$JAVA_HOME/bin export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar [[email protected] ~]$ java -version openjdk version "1.8.0_131" OpenJDK Runtime Environment (build 1.8.0_131-b12) OpenJDK 64-Bit Server VM (build 25.131-b12, mixed mode) [
[email protected]
~]$ ls -l /usr/bin/java lrwxrwxrwx 1 root root 22 Jun 5 17:38 /usr/bin/java -> /etc/alternatives/java [[email protected] ~]$ javac -version javac 1.8.0_131

Java程式測試環境

[[email protected] java]$ cat HelloWorld.java
public class HelloWorld {
    public static void main(String[] args) {
            System.out.println("Hello, World! ");
        }
}
[
[email protected]
java]$ javac HelloWorld.java [[email protected] java]$ java HelloWorld Hello, World!

下載spark

我這裡直接通過git下載最新原始碼.

[[email protected] java]$ git clone git://github.com/apache/spark.git
# 穩定分支: git clone git://github.com/apache/spark.git -b branch-2.1

[[email protected] spark]$ ls
appveyor.yml  CONTRIBUTING.md  external      mllib        R                      spark-warehouse
assembly      core             graphx        mllib-local  README.md              sql
bin           data             hadoop-cloud  NOTICE       repl                   streaming
build         dev              launcher      pom.xml      resource-managers      target
common        docs             LICENSE       project      sbin                   tools
conf          examples         licenses      python       scalastyle-config.xml  work

[[email protected] spark]$ mvn install -DskipTests
[INFO] BUILD SUCCESS

設定spark環境變數


[[email protected] ~]$ vi .bashrc
#修改.bashrc或.zshrc

# spark
export SPARK_HOME="${HOME}/java/spark"
export PATH="$SPARK_HOME/bin:$PATH"


[[email protected] ~]$ source .bashrc

示例

用mapreduce求PI

from __future__ import print_function

import sys
from random import random
from operator import add

from pyspark.sql import SparkSession

if __name__ == "__main__":
    """
        Usage: pi [partitions]
        蒙特卡羅演算法, 求落在圓內的點和正方形內的點的比值求PI
    """
    spark = SparkSession\
        .builder\
        .appName("PythonPi")\
        .getOrCreate()
    
    # 分割槽數, 為輸入值,預設為2
    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    n = 100000 * partitions
    
    # map 函式:[(-1,1),(-1,1)]之間的隨機數, 落在圓內為1, 圓外為0.
    圓面積pi, 落了count個點, 正方形面積4, 落了n個點.
    pi/count=4/n.
    pi=4*count/n
    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 <= 1 else 0

    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    print("Pi is roughly %f" % (4.0 * count / n))

    spark.stop()

執行示例

[[email protected] spark]$ ./bin/run-example SparkPi 10
17/06/06 19:41:03 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.814962 s
Pi is roughly 3.142123142123142

啟動shell

spark可以靈活執行在local, mesos,YARN,或分散式中的獨立計劃(Standalone Scheduler)各種不同的模式中.

scala shell

[[email protected] conf]$ cp log4j.properties.template log4j.properties

[[email protected] spark]$ ./bin/spark-shell --master local[2]

–master 指定分散式遠端的url, 或者 local指定本地單執行緒處理,local[n]表示啟動n個執行緒處理.

python shell

[[email protected] ~]$ pyspark --master local[2]

./bin/spark-submit examples/src/main/python/pi.py 10

啟動ipython 或jupyter notebook

原來的IPYTHON和IPYTHON_OPTS都已經作廢, 改用PYSPARK_DRIVER_PYTHON和PYSPARK_DRIVER_PYTHON_OPTS

[[email protected] spark]$ PYSPARK_DRIVER_PYTHON=ipython ./bin/pyspark
In [4]: lines = sc.textFile("README.md")

In [5]: lines.count()
Out[5]: 103

In [6]: lines.first()
Out[6]: '# Apache Spark'

In [10]: type(lines)
Out[10]: pyspark.rdd.RDD
In [15]: pylines = lines.filter(lambda line: "Python" in line)

In [16]: pylines.first()
Out[16]: 'high-level APIs in Scala, Java, Python, and R, and an optimized engine that'


#notebook
[[email protected] spark]$ PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=10.6.0.200" ./bin/pyspark
Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://10.6.0.200:8888/?token=69456fd93a5ce196b3b3f7ee5a983a40115da9cef982e35f

這裡通過–ip綁定了區域網訪問的地址. 否則只可本地訪問. 可以通過nohup 啟動後臺程式, 保持一直遠端訪問能力.

Rshell

./bin/sparkR --master local[2]
./bin/spark-submit examples/src/main/r/dataframe.R

如非註明轉載, 均為原創. 本站遵循知識共享CC協議,轉載請註明來源