1. 程式人生 > >基於Docker的Spark環境搭建理論部分

基於Docker的Spark環境搭建理論部分

1.映象製作方案

我們要使用Docker來搭建hadoop,spark,hive及mysql叢集,首先使用Dockerfile製作映象,把相關的軟體拷貝到約定好的目錄下,把配置檔案在外面先配置好,再使用docker and / docker run,拷貝移動到hadoop,spark,hive的配置目錄。需要注意一點在spark中讀取hive中的資料,需要把配置檔案hive-site.xml拷貝到spark的conf目錄(Spark在讀取Hive表時,會從hive-site.xml要與Hive配置通訊)此外,為了能使得mysql能從其他節點被訪問到(要用mysql儲存Hive元資料),要配置mysql的訪問許可權。

如果在容器裡面配置檔案,當我們使用docker rm將容器刪除之後,容器裡的內容如果沒有使用docker commit更新到映象中,刪除後容器裡的配置會全部丟失。

2.叢集整體架構設計

一共5個節點,即啟動5個容器。hadoop-maste,hadoop-node1,hadoop-node2這三個容器裡面安裝hadoop和spark叢集,hadoop-hive這個容器安裝Hive,hadoop-mysql這個容器安裝mysql資料庫。

在這裡插入圖片描述

Spark中可以在SparkSession中的builder中通過enableHiveSupport()方法,啟用對hive資料倉庫表操作的支援。Mysql用於儲存hive的元資料。當然spark中的DataFrame也可以通過write方法將資料寫入Mysql中。

3. 叢集網路規劃及子網配置

網路可以通過Docker中的DockerNetworking的支援配置。首先設定網路,docker中設定子網可以通過docker network create方法,這裡我們通過名利設定如下的子網。

docker network create --subnet=172.16.0.0/16 spark

–subnet制定自網路的網段,併為這個子網明明一個名字叫做spark.

接下來在我們建立的自網路spark中規劃叢集中每個容器的ip地址。網路ip分配如下:

在這裡插入圖片描述

注意:5個容器的hostname都是以hadoop-*開頭,因為我們要配置容器之間的SSH金鑰登陸,在不生成id_rsa.pub

公鑰的條件下,我們可以通過配置SSH過濾規則來配置容器間的互通訊。

4.軟體版本

Spark:最新版本2.3.0

Hadoop:採用穩定的hadoop-2.7.3

Hive:最新的穩定版本hive-2.3.2

Scala:Scala-2.11.8

JDK:jdk-8u101-linux-x64

Mysql:mysql-5.5.45-linux2.6-x86_64

Hive和Spark連線Mysql的驅動程式:mysql-connector-java-5.1.37-bin.jar

5.SSH無金鑰登陸規則配置

這裡不使用ssh-keygen -t rsa -P這種方式生成id_rsa.pub,然後叢集節點互拷貝id_rsa.pub到authorized_keys檔案 ,而是通過在.ssh目錄下配置ssh_conf檔案的方式,ssh_conf中可以配置SSH的通訊規則。

ssh_conf配置內容:

Host localhost
	StrictHostKeyChecking no
	
Host 0.0.0.0
	StrictHostKeyChecking no
	
Host hadoop-*
	StrictHostKeyChecking no

6.Hadoop、HDFS、Yarn配置檔案

hadoop的配置檔案位於HADOOP_HOME/etc/hadoop檔案下,重要的配置檔案有core-site.xml hadoop-env.sh hdfs-site.xml mapred-env.sh mapred-site.xml yarn-env.sh yarn-site.xml master slaves這九個配置檔案。

其中core-site.xml用於配置hadoop預設的檔案系統的訪問路徑,訪問檔案系統的使用者及使用者組等相關的配置。core-site.xml配置如下

<?xml version="1.0"?>
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop-maste:9000/</value>
    </property>
	<property>
         <name>hadoop.tmp.dir</name>
         <value>file:/usr/local/hadoop/tmp</value>
    </property>
	<property>
        <name>hadoop.proxyuser.root.hosts</name>
        <value>*</value>
    </property>
    <property>
        <name>hadoop.proxyuser.root.groups</name>
        <value>*</value>
    </property>
	<property>
        <name>hadoop.proxyuser.oozie.hosts</name>
        <value>*</value>
    </property>
    <property>
        <name>hadoop.proxyuser.oozie.groups</name>
        <value>*</value>
    </property>
</configuration>

hadoop-env.sh這個配置檔案用來匹配hadoop與逆行依賴的JDK環境,及一些JVM引數的配置,除了JDK路徑的配置外,其他的我們不用管,內容如下:

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.
# 這裡需要特殊配置! 匯入JAVA_HOME
export JAVA_HOME=/usr/local/jdk1.8.0_101

# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol.  Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME}

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}

# Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
  if [ "$HADOOP_CLASSPATH" ]; then
    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
  else
    export HADOOP_CLASSPATH=$f
  fi
done

# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""

# Extra Java runtime options.  Empty by default.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"

export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"

export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"

# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"

# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol.  This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}

# Where log files are stored.  $HADOOP_HOME/logs by default.
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER

# Where log files are stored in the secure data environment.
export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}

###
# HDFS Mover specific parameters
###
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_MOVER_OPTS=""

###
# Advanced Users Only!
###

# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by 
#       the user that will run the hadoop daemons.  Otherwise there is the
#       potential for a symlink attack.
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}

# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER

之後配置hdfs-site.xml,它主要用來配置hdfs分散式檔案系統的namenode即datanode資料的儲存路徑,及資料區塊的冗餘數。

<?xml version="1.0"?>
<configuration>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop2.7/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop2.7/dfs/data</value>
    </property>
	<property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
	<property>
	    <name>dfs.permissions.enabled</name>
		<value>false</value>
	</property>
 </configuration>

mapred-env.shmapred-site.xml這兩個配置檔案是對mapreduce計算框架的執行環境引數及網路的配置檔案,因為我們不會用到mapreduce,因為它的計算效能不如spark。

mapred-site.xml配置

<?xml version="1.0"?>
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.address</name>
        <!-- 配置實際的Master主機名和埠-->
        <value>hadoop-maste:10020</value>
    </property>
    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>4096</value>
    </property>
    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>8192</value>
    </property>
	<property>
      <name>yarn.app.mapreduce.am.staging-dir</name>
      <value>/stage</value>
    </property>
    <property>
      <name>mapreduce.jobhistory.done-dir</name>
      <value>/mr-history/done</value>
    </property>
    <property>
      <name>mapreduce.jobhistory.intermediate-done-dir</name>
      <value>/mr-history/tmp</value>
    </property>
</configuration>

對於Yarn的配置有yarn-env.sh和0yarn-site.xml兩個配置檔案,yarn是hadoop的任務排程系統,從配置檔案的名字可以看出,他們分別用於yarn執行環境的配置及網路的配置。yarn-env.sh中會讀取JAVA_HOME環境變數,還會設定一些預設的jdk引數,因此通常情況下我們都不用修改yarn-env.sh這個配置檔案。

yarn-site.xml配置

<?xml version="1.0"?>
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop-maste</value>
    </property>
	<property>
        <name>yarn.resourcemanager.address</name>
        <value>hadoop-maste:8032</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>hadoop-maste:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>hadoop-maste:8035</value>
    </property>
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>hadoop-maste:8033</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>hadoop-maste:8088</value>
    </property>
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
   </property>
    <property>
       <name>yarn.nodemanager.vmem-pmem-ratio</name>
       <value>5</value>
    </property>
<property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>22528</value>
    <discription>每個節點可用記憶體,單位MB</discription>
  </property>
  
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>4096</value>
    <discription>單個任務可申請最少記憶體,預設1024MB</discription>
  </property>
  
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>16384</value>
    <discription>單個任務可申請最大記憶體,預設8192MB</discription>
  </property>
</configuration>

最後是master和slaves兩個配置檔案。hadoop是一個master-slave結構的分散式系統,制定哪個節點為master節點,哪些節點為slave節點的方案是通過master和slaves兩個配置檔案決定的。

master配置檔案:

hadoop-maste

即指定master主節點執行在網路規劃中的hadoop-maste這個hostname對應的容器中。

slaves配置檔案:

hadoop-node1
hadoop-node2

即指定slaves節點分別為hadoop-node1和hadoop-node2,在這兩個容器中將會啟動Hdfs對應的DataNode程序及YARN資源管理系統啟動的NodeManager程序。

7. Spark配置檔案

主要有masters slaves spark-defaults.conf spark-env.sh

masters配置
hadoop-maste
slaves配置
hadoop-node1
hadoop-node2
spark-defaults.conf配置
spark.executor.memory=2G
spark.driver.memory=2G
spark.executor.cores=2
#spark.sql.codegen.wholeStage=false
#spark.memory.offHeap.enabled=true
#spark.memory.offHeap.size=4G
#spark.memory.fraction=0.9
#spark.memory.storageFraction=0.01
#spark.kryoserializer.buffer.max=64m
#spark.shuffle.manager=sort
#spark.sql.shuffle.partitions=600
spark.speculation=true
spark.speculation.interval=5000
spark.speculation.quantile=0.9
spark.speculation.multiplier=2
spark.default.parallelism=1000
spark.driver.maxResultSize=1g
#spark.rdd.compress=false
spark.task.maxFailures=8
spark.network.timeout=300
spark.yarn.max.executor.failures=200
spark.shuffle.service.enabled=true
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=4
spark.dynamicAllocation.maxExecutors=8
spark.dynamicAllocation.executorIdleTimeout=60
#spark.serializer=org.apache.spark.serializer.JavaSerializer
#spark.sql.adaptive.enabled=true
#spark.sql.adaptive.shuffle.targetPostShuffleInputSize=100000000
#spark.sql.adaptive.minNumPostShufflePartitions=1
##for spark2.0
#spark.sql.hive.verifyPartitionPath=true
#spark.sql.warehouse.dir
spark.sql.warehouse.dir=/spark/warehouse

spark-env.sh配置
#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)

# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)
# - SPARK_NO_DAEMONIZE  Run the proposed command in the foreground. It will not output a PID file.


SPARK_MASTER_WEBUI_PORT=8888

export SPARK_HOME=$SPARK_HOME
export HADOOP_HOME=$HADOOP_HOME
export MASTER=spark://hadoop-maste:7077
export SCALA_HOME=$SCALA_HOME
export SPARK_MASTER_HOST=hadoop-maste


export JAVA_HOME=/usr/local/jdk1.8.0_101

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

#SPARK_LOCAL_DIRS=/home/spark/softwares/spark/local_dir

8.Hive配置檔案

Hive是一個支援SQL語句的資料倉庫,SparkSQL之前的版本曾經使用過Hive底層的SQL直譯器及優化器,因此Spark自然也是支援讀寫Hive表格的,前提條件是在Spark中使用enableHiveSupport指令。

需要注意,Hive的配置檔案hive-site.xml需要放到$SPARK_HOME/conf目錄下,這樣Spark在操作Hive的時候才能找到相應的Hive的通訊地址。Hive中重要的配置檔案包括hive-site.xmlhive-env.sh兩個配置檔案。

hive-site.xml配置:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> 
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
<configuration> 
        <property>
                <name>hive.metastore.warehouse.dir</name>
                <value>/home/hive/warehouse</value>
        </property>
        <property>
                <name>hive.exec.scratchdir</name>
                <value>/tmp/hive</value>
        </property>
        <property> 
                <name>hive.metastore.uris</name> 
                <value>thrift://hadoop-hive:9083</value> 
                <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description> 
        </property>
        <property>
                <name>hive.server2.transport.mode</name>
                <value>http</value>
        </property>
        <property>
                <name>hive.server2.thrift.http.port</name>
                <value>10001</value>
        </property>
        
        <property> 
                <name>javax.jdo.option.ConnectionURL</name> 
                <value>jdbc:mysql://hadoop-mysql:3306/hive?createDatabaseIfNotExist=true</value> 
        </property> 
        <property> 
                <name>javax.jdo.option.ConnectionDriverName</name> 
                <value>com.mysql.jdbc.Driver</value> 
        </property> 
        <property> 
                <name>javax.jdo.option.ConnectionUserName</name> 
                <value>root</value> 
        </property> 
        <property> 
                <name>javax.jdo.option.ConnectionPassword</name> 
                <value>root</value> 
        </property> 
        <property> 
                <name>hive.metastore.schema.verification</name> 
                <value>false</value> 
        </property>
        <property>
               <name>hive.server2.authentication</name>
               <value>NONE</value>
        </property> 
</configuration>

在配置檔案中,通過javax.jdo.option.ConnectionURL配置選項指定了Hive元資料存放的關係型資料庫mysql的儲存地址。通過javax.jdo.option.ConnectionDriverName指定驅動,通過hive.metastore.warehouse.dir指定倉庫在HDFS中的存放位置。hive.matastore.uris指定Hive元資料訪問的通訊地址,使用的是thrift協議。javax.jdo.option.ConnectionUserName指定連線資料庫的使用者名稱,javax.jdo.option.ConnectionPassword指定資料庫的密碼。

關於hive-env.sh配置檔案,因為Hive的資料要儲存在HDFS中,那Hive怎麼和Hadoop通訊呢 ?Hive的解決方案是在hive-env.sh中加入Hadoop的路徑,這樣Hive就會從Hadoop的路徑下去尋找配置檔案呢,就剋找到和Hadoop中HDFS通訊的資訊,從而完成Hive和Hadoop的通訊。

hive-env.sh配置:
HADOOP_HOME=/usr/local/hadoop-2.7.3

9.其他配置

豆瓣pip源配置

這個配置用於為國內節點pip加速。

新建pip.conf檔案,新增:

[global]
index-url = http://pypi.douban.com/simple
trusted-host = pypi.douban.com

將pip.conf檔案放到 /.pip/pip.conf資料夾中,製作映象時先新建/.pip資料夾,然後把config目錄中已經配置好的pip.conf mv到~/.pip資料夾中。

profile

這個配置用於配置系統環境變數。profile配置檔案位於/etc/profile,我們需要把hadoop,spark,hive,jdk,scala,mysql的環境變數配hi在這裡

# /etc/profile: system-wide .profile file for the Bourne shell (sh(1))
# and Bourne compatible shells (bash(1), ksh(1), ash(1), ...).

if [ "$PS1" ]; then
  if [ "$BASH" ] && [ "$BASH" != "/bin/sh" ]; then
    # The file bash.bashrc already sets the default PS1.
    # PS1='\h:\w\$ '
    if [ -f /etc/bash.bashrc ]; then
      . /etc/bash.bashrc
    fi
  else
    if [ "`id -u`" -eq 0 ]; then
      PS1='# '
    else
      PS1='$ '
    fi
  fi
fi

if [ -d /etc/profile.d ]; then
  for i in /etc/profile.d/*.sh; do
    if [ -r $i ]; then
      . $i
    fi
  done
  unset i
fi

export JAVA_HOME=/usr/local/jdk1.8.0_101
export SCALA_HOME=/usr/local/scala-2.11.8
export HADOOP_HOME=/usr/local/hadoop-2.7.3
export SPARK_HOME=/usr/local/spark-2.3.0-bin-hadoop2.7
export HIVE_HOME=/usr/local/apache-hive-2.3.2-bin
export MYSQL_HOME=/usr/local/mysql

export PATH=$HIVE_HOME/bin:$MYSQL_HOME/bin:$JAVA_HOME/bin:$SCALA_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin:$PATH


分別將JAVA_HOME,SCALA_HOME,HADOOP_HOME,SPARK_HOME,HIVE_HOME,MYSQL_HOME新增到PATH中,在製作映象時,需要把profile檔案COPY到/etc/profile。profile的目的時防止Dockerfile中通過ENV命令設定環境變數不成功。

restart_containers.sh, start_containers.sh, stop_containers.sh

這幾個指令碼用來啟動、重啟、停止容器。也是一鍵啟動,重啟,關閉容器叢集的指令碼。

Dockerfile製作映象的核心檔案
FROM ubuntu
MAINTAINER reganzm [email protected]

ENV BUILD_ON 2018-03-04

COPY config /tmp
#RUN mv /tmp/apt.conf /etc/apt/
RUN mkdir -p ~/.pip/
RUN mv /tmp/pip.conf ~/.pip/pip.conf

RUN apt-get update -qqy

RUN apt-get -qqy install netcat-traditional vim wget net-tools  iputils-ping  openssh-server python-pip libaio-dev apt-utils

RUN pip install pandas  numpy  matplotlib  sklearn  seaborn  scipy tensorflow  gensim
#新增JDK
ADD ./jdk-8u101-linux-x64.tar.gz /usr/local/
#新增hadoop
ADD ./hadoop-2.7.3.tar.gz  /usr/local
#新增scala
ADD ./scala-2.11.8.tgz /usr/local
#新增spark
ADD ./spark-2.3.0-bin-hadoop2.7.tgz /usr/local
#新增mysql
ADD ./mysql-5.5.45-linux2.6-x86_64.tar.gz /usr/local
RUN mv /usr/local/mysql-5.5.45-linux2.6-x86_64  /usr/local/mysql
ENV MYSQL_HOME /usr/local/mysql

#新增hive
ADD ./apache-hive-2.3.2-bin.tar.gz /usr/local
ENV HIVE_HOME /usr/local/apache-hive-2.3.2-bin
RUN echo "HADOOP_HOME=/usr/local/hadoop-2.7.3"  | cat >> /usr/local/apache-hive-2.3.2-bin/conf/hive-env.sh
#新增mysql-connector-java-5.1.37-bin.jar到hive的lib目錄中
ADD ./mysql-connector-java-5.1.37-bin.jar /usr/local/apache-hive-2.3.2-bin/lib
RUN cp /usr/local/apache-hive-2.3.2-bin/lib/mysql-connector-java-5.1.37-bin.jar /usr/local/spark-2.3.0-bin-hadoop2.7/jars

#增加JAVA_HOME環境變數
ENV JAVA_HOME /usr/local/jdk1.8.0_101
#hadoop環境變數
ENV HADOOP_HOME /usr/local/hadoop-2.7.3 
#scala環境變數
ENV SCALA_HOME /usr/local/scala-2.11.8
#spark環境變數
ENV SPARK_HOME /usr/local/spark-2.3.0-bin-hadoop2.7
#將環境變數新增到系統變數中
ENV PATH $HIVE_HOME/bin:$MYSQL_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$HADOOP_HOME/bin:$JAVA_HOME/bin:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$PATH

RUN ssh-keygen -t rsa -f ~/.ssh/id_rsa -P '' && \
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \
    chmod 600 ~/.ssh/authorized_keys

COPY config /tmp
#將配置移動到正確的位置
RUN mv /tmp/ssh_config    ~/.ssh/config && \
    mv /tmp/profile /etc/profile && \
    mv /tmp/masters $SPARK_HOME/conf/masters && \
    cp /tmp/slaves $SPARK_HOME/conf/ && \
    mv /tmp/spark-defaults.conf $SPARK_HOME/conf/spark-defaults.conf && \
    mv /tmp/spark-env.sh $SPARK_HOME/conf/spark-env.sh && \ 
    cp /tmp/hive-site.xml $SPARK_HOME/conf/hive-site.xml && \
    mv /tmp/hive-site.xml $HIVE_HOME/conf/hive-site.xml && \
    mv /tmp/hadoop-env.sh $HADOOP_HOME/etc/hadoop/hadoop-env.sh && \
    mv /tmp/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml && \ 
    mv /tmp/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml && \
    mv /tmp/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml && \
    mv /tmp/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml && \
    mv /tmp/master $HADOOP_HOME/etc/hadoop/master && \
    mv /tmp/slaves $HADOOP_HOME/etc/hadoop/slaves && \
    mv /tmp/start-hadoop.sh ~/start-hadoop.sh && \
    mkdir -p /usr/local/hadoop2.7/dfs/data && \
    mkdir -p /usr/local/hadoop2.7/dfs/name && \
    mv /tmp/init_mysql.sh ~/init_mysql.sh && chmod 700 ~/init_mysql.sh && \
    mv /tmp/init_hive.sh ~/init_hive.sh && chmod 700 ~/init_hive.sh && \
    mv /tmp/restart-hadoop.sh ~/restart-hadoop.sh && chmod 700 ~/restart-hadoop.sh
RUN echo $JAVA_HOME
#設定工作目錄
WORKDIR /root
#啟動sshd服務
RUN /etc/init.d/ssh start
#修改start-hadoop.sh許可權為700
RUN chmod 700 start-hadoop.sh
#修改root密碼
RUN echo "root:111111" | chpasswd
CMD ["/bin/bash"]