配置hadoop+pyspark環境
1、部署hadoop環境
配置hadoop偽分布式環境,所有服務都運行在同一個節點上。
1.1、安裝JDK
安裝jdk使用的是二進制免編譯包,下載頁面
- 下載jdk
$ cd /opt/local/src/ $ curl -o jdk-8u171-linux-x64.tar.gz http://download.oracle.com/otn-pub/java/jdk/8u171-b11/512cd62ec5174c3487ac17c61aaa89e8/jdk-8u171-linux-x64.tar.gz?AuthParam=1529719173_f230ce3269ab2fccf20e190d77622fe1
- 解壓文件,配置環境變量
### 解壓到指定位置 $ tar -zxf jdk-8u171-linux-x64.tar.gz -C /opt/local ### 創建軟連接 $ cd /opt/local/ $ ln -s jdk1.8.0_171 jdk ### 配置環境變量,在當前用的配置文件 ~/.bashrc 增加如下配置 $ tail ~/.bashrc # Java export JAVA_HOME=/opt/local/jdk export JRE_HOME=$JAVA_HOME/jre export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JRE_HOME/lib export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
- 刷新環境變量
$ source ~/.bashrc
### 演那種是否生效,返回java信息說明正確
$ java -version
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
1.2、配置/etc/hosts
### 配置/etc/hosts 把主機名和IP地址一一對應 $ head -n 3 /etc/hosts # ip --> hostname or domain 192.168.20.10 node ### 驗證 $ ping node -c 2 PING node (192.168.20.10) 56(84) bytes of data. 64 bytes from node (192.168.20.10): icmp_seq=1 ttl=64 time=0.063 ms 64 bytes from node (192.168.20.10): icmp_seq=2 ttl=64 time=0.040 ms
1.3、設置ssh無密碼登錄
- 生成SSH key
### 生成ssh key
$ ssh-keygen -t rsa -P ‘‘ -f ~/.ssh/id_rsa
- 配置公鑰到許可文件authorizd_keys
### 需要輸入密碼
ssh-copy-id node
### 驗證登錄,不需要密碼即為成功
$ ssh node
1.4、安裝配置hadoop
- 下載hadoop
### 下載Hadoop2.7.6
$ cd /opt/local/src/
$ wget -c http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-2.7.6/hadoop-2.7.6.tar.gz
- 創建hadoop相關目錄
$ mkdir -p /opt/local/hdfs/{namenode,datanode,tmp}
$ tree /opt/local/hdfs/
/opt/local/hdfs/
├── datanode
├── namenode
└── tmp
- 解壓hadoop安裝文件
### 解壓到指定位置
$ cd /opt/local/src/
$ tar -zxf hadoop-2.7.6.tar.gz -C /opt/local/
### 創建軟連接
$ cd /opt/local/
$ ln -s hadoop-2.7.6 hadoop
1.5、配置hadoop
1.5.1、 配置core-site.xml
$ vim /opt/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/opt/local/hdfs/tmp/</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://node:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>
1.5.2、 配置hdfs-site.xml
$ vim /opt/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/local/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/local/hdfs/datanode</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
1.5.3、 配置mapred-site.xml
### mapred-site.xml需要從一個模板拷貝在修改
$ cp /opt/local/hadoop/etc/hadoop/mapred-site.xml.template /opt/local/hadoop/etc/hadoop/mapred-site.xml
$ vim /opt/local/hadoop/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>node:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>node:19888</value>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/history/done</value>
</property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/history/done_intermediate</value>
</property>
</configuration>
1.5.4、 配置yarn-site.xml
$ vim /opt/local/hadoop/etc/hadoop/yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>node:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>node:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>node:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>node:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>node:8088</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
1.5.5、 配置slaves
$ cat /opt/local/hadoop/etc/hadoop/slaves
node
1.5.6、 配置master
$ cat /opt/local/hadoop/etc/hadoop/master
node
1.5.7、 配置hadoop-env
$ vim /opt/local/hadoop/etc/hadoop/hadoop-env.sh
### 修改JAVA_HOME
export JAVA_HOME=/opt/local/jdk
1.5.8、 配置yarn-env
$ vim /opt/local/hadoop/etc/hadoop/yarn-env.sh
### 修改JAVA_HOME
export JAVA_HOME=/opt/local/jdk
1.5.9、 配置mapred-env
$ vim /opt/local/hadoop/etc/hadoop/mapred-env.sh
### 修改JAVA_HOME
export JAVA_HOME=/opt/local/jdk
1.5.10、配置hadoop環境變量
- 增加hadoop相關配置
在 ~/.bashrc 增加hadoop環境變量,配置如下
# hadoop
export HADOOP_HOME=/opt/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-DJava.library.path=$HADOOP_HOME/lib"
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
- 啟用配置
$ source ~/.bashrc
### 驗證
$ hadoop version
Hadoop 2.7.6
Subversion https://[email protected]/repos/asf/hadoop.git -r 085099c66cf28be31604560c376fa282e69282b8
Compiled by kshvachk on 2018-04-18T01:33Z
Compiled with protoc 2.5.0
From source with checksum 71e2695531cb3360ab74598755d036
This command was run using /opt/local/hadoop-2.7.6/share/hadoop/common/hadoop-common-2.7.6.jar
1.6、格式化hdfs文件系統
### 格式化hdfs,如果已有數據慎重使用,這會刪除原有的數據
$ hadoop namenode -format
### namenode存儲目錄會產生數據
$ ls /opt/local/hdfs/namenode/
current
1.7、啟動hadoop
啟動hadoop主要有HDFS(Namenode、Datanode)和YARN(ResourceManager、NodeManager),可以使用start-all.sh
命令啟動;關閉命令stop-all.sh
,也可以指定應用啟動;
1.7.1、啟動dfs
啟動dfs包括namenode和datanode兩個服務,可以使用start-dfs.sh
啟動,以下采用分布啟動;
1.7.1.1、啟動Namenode
### 啟動namenode
$ hadoop-daemon.sh start namenode
starting namenode, logging to /opt/local/hadoop-2.7.6/logs/hadoop-hadoop-namenode-node.out
### 查看進程
$ jps
7547 Jps
7500 NameNode
### 啟動SecondaryNameNode
$ hadoop-daemon.sh start secondarynamenode
starting secondarynamenode, logging to /opt/local/hadoop-2.7.6/logs/hadoop-hadoop-secondarynamenode-node.out
### 查看進程
$ jps
10001 SecondaryNameNode
10041 Jps
9194 NameNode
1.7.1.2、啟動Datanode
### 啟動datanode
$ hadoop-daemon.sh start datanode
starting datanode, logging to /opt/local/hadoop-2.7.6/logs/hadoop-hadoop-datanode-node.out
### 查看進程
$ jps
7607 DataNode
7660 Jps
7500 NameNode
10001 SecondaryNameNode
1.7.2、啟動yarn
啟動yarn包括ResourceManager和NodeManager,可以使用start-yarn.sh
啟動,以下采用分布啟動;
1.7.2.1、啟動ResourceManager
### 啟動resourcemanager
$ yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /opt/local/hadoop-2.7.6/logs/yarn-hadoop-resourcemanager-node.out
### 查看進程
$ jps
7607 DataNode
7993 Jps
7500 NameNode
7774 ResourceManager
10001 SecondaryNameNode
1.7.2.2、啟動NodeManager
### 啟動nodemanager
$ yarn-daemon.sh start nodemanager
starting nodemanager, logging to /opt/local/hadoop-2.7.6/logs/yarn-hadoop-nodemanager-node.out
### 查看進程
$ jps
7607 DataNode
8041 NodeManager
8106 Jps
7500 NameNode
7774 ResourceManager
10001 SecondaryNameNode
1.7.3、啟動Historyserver
### 啟動 historyserver
$ mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /opt/local/hadoop/logs/mapred-hadoop-historyserver-node.out
### 查看進程
$ jps
8278 JobHistoryServer
7607 DataNode
8041 NodeManager
7500 NameNode
8317 Jps
7774 ResourceManager
10001 SecondaryNameNode
1.7.4、hadoop相關功能說明
hadoop啟動後,主要有以下功能
- HDFS功能:NameNode、SecondaryNameNode、DataNode
- YARN功能:ResourceManager、NodeManager
- HistoryServer:JobHistoryServer
1.8、hadoop基本操作
1.8.1、 hadoop 常用命令
命令 | 說明 |
---|---|
hadoop fs -mkdir | 創建HDFS 目錄 |
hadoop fs -ls | 列出HDFS 目錄 |
hadoop fs -copyFromLocal | 復制本地文件到HDFS |
hadoop fs -put | 復制本地文件到HDFS,put可以接收stdin(標準輸入) |
hadoop fs -cat | 列出HDFS文件的內容 |
hadoop fs -copyToLocal | 將HDFS上的文件復制到本地 |
hadoop fs -get | 將HDFS上的文件復制到本地 |
hadoop fs -cp | 負載HDFS文件 |
hadoop fs -rm | 刪除HDFS文件或目錄(-R參數) |
1.8.2、haodoom 命令操作
1.8.2.1 基本命令操作
- 創建目錄
$ hadoop fs -mkdir /user/hadoop
- 創建多個目錄
$ hadoop fs -mkdir -p /user/hadoop/{input,output}
- 查看HDFS目錄
$ hadoop fs -ls /
Found 2 items
drwxrwx--- - hadoop supergroup 0 2018-06-23 12:20 /history
drwxr-xr-x - hadoop supergroup 0 2018-06-23 13:20 /user
$ hadoop fs -ls /user
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2018-06-23 13:20 /user/hadoop
- 查看所有的目錄
$ hadoop fs -ls -R /
drwxrwx--- - hadoop supergroup 0 2018-06-23 12:20 /history
drwxrwx--- - hadoop supergroup 0 2018-06-23 12:20 /history/done
drwxrwxrwt - hadoop supergroup 0 2018-06-23 12:20 /history/done_intermediate
drwxr-xr-x - hadoop supergroup 0 2018-06-23 13:20 /user
drwxr-xr-x - hadoop supergroup 0 2018-06-23 13:24 /user/hadoop
drwxr-xr-x - hadoop supergroup 0 2018-06-23 13:24 /user/hadoop/input
drwxr-xr-x - hadoop supergroup 0 2018-06-23 13:24 /user/hadoop/output
- 上傳本地文件到HDFS
$ hadoop fs -copyFromLocal /opt/local/hadoop/README.txt /user/hadoop/input
- 查看HDFS上的文件內容
$ hadoop fs -cat /user/hadoop/input/README.txt
- 將HDFS上的文件下載到本地
$ hadoop fs -get /user/hadoop/input/README.txt ./
- 刪除文件或目錄
### 刪除文件會提示
$ hadoop fs -rm /user/hadoop/input/examples.desktop
18/06/23 13:47:06 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/hadoop/input/examples.desktop
### 刪除目錄
$ hadoop fs -rm -R /user/hadoop
18/06/23 13:48:17 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/hadoop
1.8.2.2、執行mapreduce任務
使用hadoop內置的wordcount程序統計字數
- 執行計算任務
$ hadoop fs -put /opt/local/hadoop/README.txt /user/input
$ cd /opt/local/hadoop/share/hadoop/mapreduce
#### hadoop jar jar包名稱 類 輸入文件 輸出目錄
$ hadoop jar hadoop-mapreduce-examples-2.7.6.jar wordcount /user/input/ /user/output/wordcount
- 查看當前的任務
#### 查看當然的任務情況,也可以在http://node:8088 上查看
$ yarn application -list
18/06/23 13:55:34 INFO client.RMProxy: Connecting to ResourceManager at node/192.168.20.10:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1529732240998_0001 word count MAPREDUCE hadoop default RUNNING UNDEFINED 5% http://node:41713
- 查看計算結果
#### _SUCCESS 表示成功,part開頭的文件表是結果
$ hadoop fs -ls /user/output/wordcount
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2018-06-23 13:55 /user/output/wordcount/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 1306 2018-06-23 13:55 /user/output/wordcount/part-r-00000
#### 查看內容
$ hadoop fs -cat /user/output/wordcount/part-r-00000|tail
uses 1
using 2
visit 1
website 1
which 2
wiki, 1
with 1
written 1
you 1
your 1
1.9、hadoop Web界面
- hadoop NameNode HDFS Web界面可以查看當前HDFS和DataNode的運行情況
http://node:50070
- hadoop ResourceManager Web界面可以查看當前hadoop的Node節點狀態,應用進程、任務執行狀態
http://node:8088
2、部署spark
2.1、Scala簡介和安裝
2.1.1、Scala簡介
spark是用是scala編寫,Scala官網為:https://www.scala-lang.org/,因此需要首先安裝Scala,Scala有以下特點:
- Scala可編譯為Java bytecode 字節碼,也就是說可以在JVM(Java Virtual Machine)上運行,具備跨平臺能力;
- 現有的Java的鏈接庫都可以使用,可以繼續使用豐富的Java開放源代碼生態系統;
- Scala是一種函數式語言,在函數式語言中,函數也是值,與整數字符串處於同一地位;函數可以作為參數傳遞給其他函數;
- Scala是一種純面向對象的語言,所有東西都是對象,而所有操作都是方法;
2.1.2、Scala安裝
Scala下載地址為https://www.scala-lang.org/files/archive/;
從Spark2.0版開始,Spark默認使用Scala 2.11構建,因此下載scala-2.11版本;
- 下載Scala
$ cd /opt/local/src/
$ wget -c https://www.scala-lang.org/files/archive/scala-2.11.11.tgz
- 解壓Scala文件
#### 解壓到指定位置,並做軟連接
$ tar -zxf scala-2.11.11.tgz -C /opt/local/
$ cd /opt/local/
- 配置Scala環境變量
#### 配置 ~/.bashrc 增加如下
$ tail -n 5 ~/.bashrc
# scala
export SCALA_HOME=/opt/local/scala
export PATH=$PATH:$SCALA_HOME/bin
#### 啟用配置
$ source ~/.bashrc
#### 驗證
$ scala -version
Scala code runner version 2.11.11 -- Copyright 2002-2017, LAMP/EPFL
2.2、Spark安裝
2.2.1、Spark下載
spark下載頁面地址是http://spark.apache.org/downloads.html,需要選擇用於hadoop2.7及以上版本;
$ cd /opt/local/src/
$ wget -c http://mirror.bit.edu.cn/apache/spark/spark-2.2.1/spark-2.3.1-bin-hadoop2.7.tgz
2.2.2、Spark解壓配置
- 解壓spark到指定目錄,並做軟連接
$ tar zxf spark-2.3.1-bin-hadoop2.7.tgz -C /opt/local/
$ cd /opt/local/
$ ln -s spark-2.3.1-bin-hadoop2.7 spark
- 配置spark環境變量
$ tail -n 5 ~/.bashrc
# spark
export SPARK_HOME=/opt/local/spark
export PATH=$PATH:$SPARK_HOME/bin
- 啟用環境變量
$ source ~/.bashrc
2.3、運行pyspark
2.3.1、本地運行pyspyark
在終端輸入pyspark
啟動spark的python接口,啟動會顯示使用的python版本和spark版本信息;
pyspark --master local[4]
,local[N]代表本地運行,使用N個線程(thread);local[*] 會盡量使用所有的CPU核心;
- 本地啟動pyspark
$ pyspark
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
2018-06-23 19:25:00 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/__ / .__/\_,_/_/ /_/\_\ version 2.3.1
/_/
Using Python version 2.7.12 (default, Dec 4 2017 14:50:18)
SparkSession available as ‘spark‘.
- 查看當前運行模式
>>> sc.master
u‘local[*]‘
- 讀取本地文件
>>> textFile=sc.textFile("file:/opt/local/spark/README.md")
>>> textFile.count()
103
- 讀取HDFS文件
>>> textFile=sc.textFile("hdfs://node:9000/user/input/README.md")
>>> textFile.count()
103
2.3.2、Hadoop YARN 運行spark
spark可以在Hadoop YARN上運行,讓YARN幫助它進行資源管理;
HADOOP_CONF_DIR=/opt/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
;
HADOOP_CONF_DIR=/opt/local/hadoop/etc/hadoop 表示設置hadoop配置文件目錄;
pyspark 運行的程序;
--master yarn --deploy-mode client 設置運行模式為YARN-Client
$ HADOOP_CONF_DIR=/opt/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
2018-06-23 20:27:48 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-06-23 20:27:52 WARN Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/__ / .__/\_,_/_/ /_/\_\ version 2.3.1
/_/
Using Python version 2.7.12 (default, Dec 4 2017 14:50:18)
SparkSession available as ‘spark‘.
- 查看當前運行模式
>>> sc.master
u‘yarn‘
- 讀取HDFS文件
>>> textFile=sc.textFile("hdfs://node:9000/user/input/README.md")
>>> textFile.count()
103
- yarn 查看任務運行情況
#### 也可以通過web查看:http://node:8088
$ yarn application -list
18/06/23 20:34:40 INFO client.RMProxy: Connecting to ResourceManager at node/192.168.20.10:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1529756801315_0001 PySparkShell SPARK hadoop default RUNNING UNDEFINED 10% http://node:4040
2.3.3、Spark Standalone Cluster運行spark
配置Spark Standalone Cluste偽分布式環境,所有服務都運行在同一個節點上。
2.3.3.1、配置spark-env.sh
- 復制模板文件創建spark-env.sh
$ cp /opt/local/spark/conf/spark-env.sh.template /opt/local/spark/conf/spark-env.sh
- 配置spark-env.sh
$ tail -n 6 /opt/local/spark/conf/spark-env.sh
#### Spark Standalone Cluster
export JAVA_HOME=/opt/local/jdk
export SPARK_MASTER_HOST=node
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=512m
export SPARK_WORKER_INSTANCES=1
2.3.3.2、配置slave
#### 增加編輯,也可以拷貝模板文件
$ tail /opt/local/spark/conf/slaves
node
2.3.3.3、在Spark Standalone Cluster運行pyspark
2.3.3.3.1、啟動Spark Standalone Cluster
啟動Spark Standalone Cluster可以使用${SPARN_HOME}/sbin/start-all.sh
一個腳本啟動所有服務;也可以分布啟動master和slaves;
- 啟動master
$ /opt/local/spark/sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/local/spark/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-node.out
~$ jps
4185 Master
- 啟動slaves
$ /opt/local/spark/sbin/start-slaves.sh
node: starting org.apache.spark.deploy.worker.Worker, logging to /opt/local/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-node.out
$ jps
4185 Master
4313 Worker
- 通過http://node:8080查看集群狀態
$ w3m http://node:8080/
[spark-logo] 2.3.1 Spark Master at spark://node:7077
? URL: spark://node:7077
? REST URL: spark://node:6066 (cluster mode)
? Alive Workers: 1
? Cores in use: 1 Total, 0 Used
? Memory in use: 256.0 MB Total, 0.0 B Used
? Applications: 0 Running, 0 Completed
? Drivers: 0 Running, 0 Completed
? Status: ALIVE
Workers (1)
Worker Id Address State Cores Memory
512.0
worker-20180624102100-192.168.20.10-42469 192.168.20.10:42469 ALIVE 1 (0 MB
Used) (0.0 B
Used)
Running Applications (0)
Application ID Name Cores Memory per Submitted Time User State Duration
Executor
Completed Applications (0)
Application ID Name Cores Memory per Submitted Time User State Duration
Executor
2.3.3.3.2、在Spark Standalone Cluster運行pyspark
- 運行pyspark
$ pyspark --master spark://node:7077 --num-executors 1 --total-executor-cores 1 --executor-memory 512m
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
2018-06-24 10:39:09 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/__ / .__/\_,_/_/ /_/\_\ version 2.3.1
/_/
Using Python version 2.7.12 (default, Dec 4 2017 14:50:18)
SparkSession available as ‘spark‘.
- 查看當前運行模式
>>> sc.master
u‘spark://node:7077‘
- 讀取本地文件
>>> textFile=sc.textFile("file:/opt/local/spark/README.md")
>>> textFile.count()
103
- 讀取hdfs文件
>>> textFile=sc.textFile("hdfs://node:9000/user/input/README.md")
>>> textFile.count()
103
2.3、總結
spark的運行方式有多種,主要有獨立集群、YARN集群、Mesos集群,和本地模式
master可選值 | 描述說明 |
---|---|
spark://host:port | spark standalone集群,默認端口為7077 |
yarn | YARN集群,當在YARN上運行時,需設置環境變量HADOOP_CONF_DIR指向hadoop配置目錄,以獲取集群信息 |
mesos://host:port | Mesos集群,默認端口為5050 |
local | 本地模式,使用1個核心 |
local[n] | 本地模式,使用n個核心 |
local[*] | 本地模式,使用盡可能多的核心 |
配置hadoop+pyspark環境