基於Docker的Spark環境搭建理論部分
1.映象製作方案
我們要使用Docker來搭建hadoop,spark,hive及mysql叢集,首先使用Dockerfile製作映象,把相關的軟體拷貝到約定好的目錄下,把配置檔案在外面先配置好,再使用docker and / docker run
,拷貝移動到hadoop,spark,hive的配置目錄。需要注意一點在spark中讀取hive中的資料,需要把配置檔案hive-site.xml
拷貝到spark的conf
目錄(Spark在讀取Hive表時,會從hive-site.xml
要與Hive配置通訊)此外,為了能使得mysql能從其他節點被訪問到(要用mysql儲存Hive元資料),要配置mysql的訪問許可權。
如果在容器裡面配置檔案,當我們使用docker rm
將容器刪除之後,容器裡的內容如果沒有使用docker commit
更新到映象中,刪除後容器裡的配置會全部丟失。
2.叢集整體架構設計
一共5個節點,即啟動5個容器。hadoop-maste,hadoop-node1,hadoop-node2這三個容器裡面安裝hadoop和spark叢集,hadoop-hive這個容器安裝Hive,hadoop-mysql這個容器安裝mysql資料庫。
Spark中可以在SparkSession中的builder中通過enableHiveSupport()方法,啟用對hive資料倉庫表操作的支援。Mysql用於儲存hive的元資料。當然spark中的DataFrame也可以通過write方法將資料寫入Mysql中。
3. 叢集網路規劃及子網配置
網路可以通過Docker中的DockerNetworking的支援配置。首先設定網路,docker中設定子網可以通過docker network create
方法,這裡我們通過名利設定如下的子網。
docker network create --subnet=172.16.0.0/16 spark
–subnet制定自網路的網段,併為這個子網明明一個名字叫做spark.
接下來在我們建立的自網路spark中規劃叢集中每個容器的ip地址。網路ip分配如下:
注意:5個容器的hostname都是以hadoop-*開頭,因為我們要配置容器之間的SSH金鑰登陸,在不生成id_rsa.pub
4.軟體版本
Spark:最新版本2.3.0
Hadoop:採用穩定的hadoop-2.7.3
Hive:最新的穩定版本hive-2.3.2
Scala:Scala-2.11.8
JDK:jdk-8u101-linux-x64
Mysql:mysql-5.5.45-linux2.6-x86_64
Hive和Spark連線Mysql的驅動程式:mysql-connector-java-5.1.37-bin.jar
5.SSH無金鑰登陸規則配置
這裡不使用ssh-keygen -t rsa -P
這種方式生成id_rsa.pub,然後叢集節點互拷貝id_rsa.pub到authorized_keys檔案 ,而是通過在.ssh目錄下配置ssh_conf檔案的方式,ssh_conf中可以配置SSH的通訊規則。
ssh_conf配置內容:
Host localhost
StrictHostKeyChecking no
Host 0.0.0.0
StrictHostKeyChecking no
Host hadoop-*
StrictHostKeyChecking no
6.Hadoop、HDFS、Yarn配置檔案
hadoop的配置檔案位於HADOOP_HOME/etc/hadoop
檔案下,重要的配置檔案有core-site.xml
hadoop-env.sh
hdfs-site.xml
mapred-env.sh
mapred-site.xml
yarn-env.sh
yarn-site.xml
master
slaves
這九個配置檔案。
其中core-site.xml
用於配置hadoop預設的檔案系統的訪問路徑,訪問檔案系統的使用者及使用者組等相關的配置。core-site.xml
配置如下
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-maste:9000/</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/hadoop/tmp</value>
</property>
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>
</configuration>
hadoop-env.sh
這個配置檔案用來匹配hadoop與逆行依賴的JDK環境,及一些JVM引數的配置,除了JDK路徑的配置外,其他的我們不用管,內容如下:
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use.
# 這裡需要特殊配置! 匯入JAVA_HOME
export JAVA_HOME=/usr/local/jdk1.8.0_101
# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol. Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
# Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
if [ "$HADOOP_CLASSPATH" ]; then
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
else
export HADOOP_CLASSPATH=$f
fi
done
# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""
# Extra Java runtime options. Empty by default.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"
# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol. This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}
# Where log files are stored. $HADOOP_HOME/logs by default.
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER
# Where log files are stored in the secure data environment.
export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}
###
# HDFS Mover specific parameters
###
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_MOVER_OPTS=""
###
# Advanced Users Only!
###
# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by
# the user that will run the hadoop daemons. Otherwise there is the
# potential for a symlink attack.
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}
# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER
之後配置hdfs-site.xml
,它主要用來配置hdfs分散式檔案系統的namenode即datanode資料的儲存路徑,及資料區塊的冗餘數。
<?xml version="1.0"?>
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop2.7/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop2.7/dfs/data</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
</configuration>
mapred-env.sh
和mapred-site.xml
這兩個配置檔案是對mapreduce計算框架的執行環境引數及網路的配置檔案,因為我們不會用到mapreduce,因為它的計算效能不如spark。
mapred-site.xml
配置
<?xml version="1.0"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<!-- 配置實際的Master主機名和埠-->
<value>hadoop-maste:10020</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>8192</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/stage</value>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/mr-history/done</value>
</property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/mr-history/tmp</value>
</property>
</configuration>
對於Yarn的配置有yarn-env.sh
和0yarn-site.xml
兩個配置檔案,yarn是hadoop的任務排程系統,從配置檔案的名字可以看出,他們分別用於yarn執行環境的配置及網路的配置。yarn-env.sh中會讀取JAVA_HOME環境變數,還會設定一些預設的jdk引數,因此通常情況下我們都不用修改yarn-env.sh
這個配置檔案。
yarn-site.xml
配置
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-maste</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop-maste:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop-maste:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop-maste:8035</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop-maste:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop-maste:8088</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>5</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>22528</value>
<discription>每個節點可用記憶體,單位MB</discription>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>4096</value>
<discription>單個任務可申請最少記憶體,預設1024MB</discription>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>16384</value>
<discription>單個任務可申請最大記憶體,預設8192MB</discription>
</property>
</configuration>
最後是master和slaves兩個配置檔案。hadoop是一個master-slave結構的分散式系統,制定哪個節點為master節點,哪些節點為slave節點的方案是通過master和slaves兩個配置檔案決定的。
master配置檔案:
hadoop-maste
即指定master主節點執行在網路規劃中的hadoop-maste這個hostname對應的容器中。
slaves配置檔案:
hadoop-node1
hadoop-node2
即指定slaves節點分別為hadoop-node1和hadoop-node2,在這兩個容器中將會啟動Hdfs對應的DataNode程序及YARN資源管理系統啟動的NodeManager程序。
7. Spark配置檔案
主要有masters
slaves
spark-defaults.conf
spark-env.sh
masters配置
hadoop-maste
slaves配置
hadoop-node1
hadoop-node2
spark-defaults.conf配置
spark.executor.memory=2G
spark.driver.memory=2G
spark.executor.cores=2
#spark.sql.codegen.wholeStage=false
#spark.memory.offHeap.enabled=true
#spark.memory.offHeap.size=4G
#spark.memory.fraction=0.9
#spark.memory.storageFraction=0.01
#spark.kryoserializer.buffer.max=64m
#spark.shuffle.manager=sort
#spark.sql.shuffle.partitions=600
spark.speculation=true
spark.speculation.interval=5000
spark.speculation.quantile=0.9
spark.speculation.multiplier=2
spark.default.parallelism=1000
spark.driver.maxResultSize=1g
#spark.rdd.compress=false
spark.task.maxFailures=8
spark.network.timeout=300
spark.yarn.max.executor.failures=200
spark.shuffle.service.enabled=true
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=4
spark.dynamicAllocation.maxExecutors=8
spark.dynamicAllocation.executorIdleTimeout=60
#spark.serializer=org.apache.spark.serializer.JavaSerializer
#spark.sql.adaptive.enabled=true
#spark.sql.adaptive.shuffle.targetPostShuffleInputSize=100000000
#spark.sql.adaptive.minNumPostShufflePartitions=1
##for spark2.0
#spark.sql.hive.verifyPartitionPath=true
#spark.sql.warehouse.dir
spark.sql.warehouse.dir=/spark/warehouse
spark-env.sh配置
#!/usr/bin/env bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos
# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)
# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS The scheduling priority for daemons. (Default: 0)
# - SPARK_NO_DAEMONIZE Run the proposed command in the foreground. It will not output a PID file.
SPARK_MASTER_WEBUI_PORT=8888
export SPARK_HOME=$SPARK_HOME
export HADOOP_HOME=$HADOOP_HOME
export MASTER=spark://hadoop-maste:7077
export SCALA_HOME=$SCALA_HOME
export SPARK_MASTER_HOST=hadoop-maste
export JAVA_HOME=/usr/local/jdk1.8.0_101
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
#SPARK_LOCAL_DIRS=/home/spark/softwares/spark/local_dir
8.Hive配置檔案
Hive是一個支援SQL語句的資料倉庫,SparkSQL之前的版本曾經使用過Hive底層的SQL直譯器及優化器,因此Spark自然也是支援讀寫Hive表格的,前提條件是在Spark中使用enableHiveSupport指令。
需要注意,Hive的配置檔案hive-site.xml
需要放到$SPARK_HOME/conf
目錄下,這樣Spark在操作Hive的時候才能找到相應的Hive的通訊地址。Hive中重要的配置檔案包括hive-site.xml
和hive-env.sh
兩個配置檔案。
hive-site.xml配置:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/home/hive/warehouse</value>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop-hive:9083</value>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>
<property>
<name>hive.server2.transport.mode</name>
<value>http</value>
</property>
<property>
<name>hive.server2.thrift.http.port</name>
<value>10001</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop-mysql:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
</property>
</configuration>
在配置檔案中,通過javax.jdo.option.ConnectionURL
配置選項指定了Hive元資料存放的關係型資料庫mysql的儲存地址。通過javax.jdo.option.ConnectionDriverName指定驅動,通過hive.metastore.warehouse.dir
指定倉庫在HDFS中的存放位置。hive.matastore.uris
指定Hive元資料訪問的通訊地址,使用的是thrift協議。javax.jdo.option.ConnectionUserName
指定連線資料庫的使用者名稱,javax.jdo.option.ConnectionPassword指定資料庫的密碼。
關於hive-env.sh
配置檔案,因為Hive的資料要儲存在HDFS中,那Hive怎麼和Hadoop通訊呢 ?Hive的解決方案是在hive-env.sh
中加入Hadoop的路徑,這樣Hive就會從Hadoop的路徑下去尋找配置檔案呢,就剋找到和Hadoop中HDFS通訊的資訊,從而完成Hive和Hadoop的通訊。
hive-env.sh配置:
HADOOP_HOME=/usr/local/hadoop-2.7.3
9.其他配置
豆瓣pip源配置
這個配置用於為國內節點pip加速。
新建pip.conf
檔案,新增:
[global]
index-url = http://pypi.douban.com/simple
trusted-host = pypi.douban.com
將pip.conf檔案放到 /.pip/pip.conf資料夾中,製作映象時先新建/.pip資料夾,然後把config目錄中已經配置好的pip.conf mv到~/.pip資料夾中。
profile
這個配置用於配置系統環境變數。profile配置檔案位於/etc/profile,我們需要把hadoop,spark,hive,jdk,scala,mysql的環境變數配hi在這裡
# /etc/profile: system-wide .profile file for the Bourne shell (sh(1))
# and Bourne compatible shells (bash(1), ksh(1), ash(1), ...).
if [ "$PS1" ]; then
if [ "$BASH" ] && [ "$BASH" != "/bin/sh" ]; then
# The file bash.bashrc already sets the default PS1.
# PS1='\h:\w\$ '
if [ -f /etc/bash.bashrc ]; then
. /etc/bash.bashrc
fi
else
if [ "`id -u`" -eq 0 ]; then
PS1='# '
else
PS1='$ '
fi
fi
fi
if [ -d /etc/profile.d ]; then
for i in /etc/profile.d/*.sh; do
if [ -r $i ]; then
. $i
fi
done
unset i
fi
export JAVA_HOME=/usr/local/jdk1.8.0_101
export SCALA_HOME=/usr/local/scala-2.11.8
export HADOOP_HOME=/usr/local/hadoop-2.7.3
export SPARK_HOME=/usr/local/spark-2.3.0-bin-hadoop2.7
export HIVE_HOME=/usr/local/apache-hive-2.3.2-bin
export MYSQL_HOME=/usr/local/mysql
export PATH=$HIVE_HOME/bin:$MYSQL_HOME/bin:$JAVA_HOME/bin:$SCALA_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin:$PATH
分別將JAVA_HOME,SCALA_HOME,HADOOP_HOME,SPARK_HOME,HIVE_HOME,MYSQL_HOME新增到PATH中,在製作映象時,需要把profile檔案COPY到/etc/profile。profile的目的時防止Dockerfile中通過ENV命令設定環境變數不成功。
restart_containers.sh, start_containers.sh, stop_containers.sh
這幾個指令碼用來啟動、重啟、停止容器。也是一鍵啟動,重啟,關閉容器叢集的指令碼。
Dockerfile製作映象的核心檔案
FROM ubuntu
MAINTAINER reganzm [email protected]
ENV BUILD_ON 2018-03-04
COPY config /tmp
#RUN mv /tmp/apt.conf /etc/apt/
RUN mkdir -p ~/.pip/
RUN mv /tmp/pip.conf ~/.pip/pip.conf
RUN apt-get update -qqy
RUN apt-get -qqy install netcat-traditional vim wget net-tools iputils-ping openssh-server python-pip libaio-dev apt-utils
RUN pip install pandas numpy matplotlib sklearn seaborn scipy tensorflow gensim
#新增JDK
ADD ./jdk-8u101-linux-x64.tar.gz /usr/local/
#新增hadoop
ADD ./hadoop-2.7.3.tar.gz /usr/local
#新增scala
ADD ./scala-2.11.8.tgz /usr/local
#新增spark
ADD ./spark-2.3.0-bin-hadoop2.7.tgz /usr/local
#新增mysql
ADD ./mysql-5.5.45-linux2.6-x86_64.tar.gz /usr/local
RUN mv /usr/local/mysql-5.5.45-linux2.6-x86_64 /usr/local/mysql
ENV MYSQL_HOME /usr/local/mysql
#新增hive
ADD ./apache-hive-2.3.2-bin.tar.gz /usr/local
ENV HIVE_HOME /usr/local/apache-hive-2.3.2-bin
RUN echo "HADOOP_HOME=/usr/local/hadoop-2.7.3" | cat >> /usr/local/apache-hive-2.3.2-bin/conf/hive-env.sh
#新增mysql-connector-java-5.1.37-bin.jar到hive的lib目錄中
ADD ./mysql-connector-java-5.1.37-bin.jar /usr/local/apache-hive-2.3.2-bin/lib
RUN cp /usr/local/apache-hive-2.3.2-bin/lib/mysql-connector-java-5.1.37-bin.jar /usr/local/spark-2.3.0-bin-hadoop2.7/jars
#增加JAVA_HOME環境變數
ENV JAVA_HOME /usr/local/jdk1.8.0_101
#hadoop環境變數
ENV HADOOP_HOME /usr/local/hadoop-2.7.3
#scala環境變數
ENV SCALA_HOME /usr/local/scala-2.11.8
#spark環境變數
ENV SPARK_HOME /usr/local/spark-2.3.0-bin-hadoop2.7
#將環境變數新增到系統變數中
ENV PATH $HIVE_HOME/bin:$MYSQL_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$HADOOP_HOME/bin:$JAVA_HOME/bin:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$PATH
RUN ssh-keygen -t rsa -f ~/.ssh/id_rsa -P '' && \
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \
chmod 600 ~/.ssh/authorized_keys
COPY config /tmp
#將配置移動到正確的位置
RUN mv /tmp/ssh_config ~/.ssh/config && \
mv /tmp/profile /etc/profile && \
mv /tmp/masters $SPARK_HOME/conf/masters && \
cp /tmp/slaves $SPARK_HOME/conf/ && \
mv /tmp/spark-defaults.conf $SPARK_HOME/conf/spark-defaults.conf && \
mv /tmp/spark-env.sh $SPARK_HOME/conf/spark-env.sh && \
cp /tmp/hive-site.xml $SPARK_HOME/conf/hive-site.xml && \
mv /tmp/hive-site.xml $HIVE_HOME/conf/hive-site.xml && \
mv /tmp/hadoop-env.sh $HADOOP_HOME/etc/hadoop/hadoop-env.sh && \
mv /tmp/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml && \
mv /tmp/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml && \
mv /tmp/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml && \
mv /tmp/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml && \
mv /tmp/master $HADOOP_HOME/etc/hadoop/master && \
mv /tmp/slaves $HADOOP_HOME/etc/hadoop/slaves && \
mv /tmp/start-hadoop.sh ~/start-hadoop.sh && \
mkdir -p /usr/local/hadoop2.7/dfs/data && \
mkdir -p /usr/local/hadoop2.7/dfs/name && \
mv /tmp/init_mysql.sh ~/init_mysql.sh && chmod 700 ~/init_mysql.sh && \
mv /tmp/init_hive.sh ~/init_hive.sh && chmod 700 ~/init_hive.sh && \
mv /tmp/restart-hadoop.sh ~/restart-hadoop.sh && chmod 700 ~/restart-hadoop.sh
RUN echo $JAVA_HOME
#設定工作目錄
WORKDIR /root
#啟動sshd服務
RUN /etc/init.d/ssh start
#修改start-hadoop.sh許可權為700
RUN chmod 700 start-hadoop.sh
#修改root密碼
RUN echo "root:111111" | chpasswd
CMD ["/bin/bash"]