1. 程式人生 > 實用技巧 >win10+centos7+hadoop 叢集環境搭建

win10+centos7+hadoop 叢集環境搭建

一. 前期準備

1. Vmware workstation pro 16

官網下載 :https://www.vmware.com/

金鑰:ZF3R0-FHED2-M80TY-8QYGC-NPKYF (若失效請自行百度)

2. xshell,xftp 官網下載(需要註冊)

3. 國內映象網站下載centos(筆者以centos7為例),如華為,阿里,清華的映象。

https://mirrors.tuna.tsinghua.edu.cn ,https://developer.aliyun.com/mirror/,https://mirrors.huaweicloud.com/

4. 所需要的包

hadoop 2.7.0 jdk-1.8.0

二. 安裝Vmware workstation pro 16

三. 安裝 xshell,xftp

四. Vmware網路配置

原因:將Windows系統與虛擬機器放到同一子網下,這樣可以通過window系統上的瀏覽器訪問虛擬機器中hadoop叢集的檔案管理頁面(即 master:50070 頁面),為後續idea連線hadoop集 群也提供了方便。固定虛擬機器IP地址,方便後續操作。

1. 在VMware軟體裡面的編輯----》虛擬網路編輯器---》選擇VMnet8模式

2. 點選NAT設定

3. 點選1中的DHCP設定

4.設定VMnet8的地址

5. 右擊VMnet8->屬性

五. 建立虛擬機器例項

筆者建立了三個虛擬機器,hostname分別為master,slave1,slave2.

建立過程中網路型別選擇NAT模式

修改主機名,請參考:https://www.cnblogs.com/HusterX/articles/13425074.html

關閉防火牆(或者開放對應視窗)

firewall-cmd --state
systemctl stop firewalld.service
systemctl disable firewalld.service
CloseFirewalld

增加使用者(增加對hadoop管理的一個使用者,在本地搭建是可以忽略這一步,只要全程在root許可權下操作即可)

UserAdd

最終結果

IP地址 計算機名 主要作用
192.168.47.131 master namenode,JobTracker
192.168.47.132 salve1 datenode,TaskTracker
192.168.47.130 slave2 datenode,TaskTracker

編輯master的 /etc/hosts 檔案

192.168.47.131 master
192.168.47.132 slave1
192.168.47.130 slave2
master's hosts

六. Centos7系統環境搭建

以下步驟用xshell連線master 後操作或者在 master內直接進行操作。

PS:筆者全程用root許可權操作(若使用hadoop使用者,請注意許可權問題)

1.用xftp軟體 將Jdk Hadoop 壓縮包上傳到master中的某個目錄下(筆者以 /opt 目錄為例)

2. 搭建Java環境

1.檢查是否有Java
  java -version

2.若有,則移除openjdk
   檢視:
   rpm -qa | grep openjdk
   刪除
   rpm -e --nodeps [相關的軟體包]

3.解壓縮上傳到 /opt 目錄下的 jdk
   tar -zxvf jdk1.8*****
   重新命名
   mv  jdk1.8*****  jdk8

4.增加環境變數(root使用者下)
   vim /etc/profile

   增加
   export JAVA_HOME=/opt/jdk8
   export PATH=$PATH:$JAVA_HOME/bin

5.生效 
   source /etc/profile

6.測試
   java -version
Centos7 Java環境搭建流程

3. 搭建hadoop環境

1.解壓縮上傳到 /opt 目錄下的hadoop壓縮包
  tar -zxf hadoop-2.7.3.tar.gz 
  重新命名
  mv  hadoop-2.7.3 hadoop

2.配置hadoop環境變數(root使用者下)
   vim /etc/profile

   增加
   export HADOOP_HOME=/opt/hadoop

   修改
   export 
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

3.測試
   hadoop version

[root@master ~]# hadoop version
Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z
Compiled with protoc 2.5.0
From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-2.7.3.jar
Centos7 hadoop環境搭建流程

4. ssh免密登入

1. 生成金鑰
[root@master ~]# ssh-keygen -t rsa 
/root/.ssh下 
[root@master ~]# ls /root/.ssh/
id_rsa  id_rsa.pub

2. 加入信任列表
[root@master ~]# cat id_rsa.pub >> authorized_keys
[root@master ~]# ls /root/.ssh/
authorized_keys  id_rsa  id_rsa.pub

3. 設定許可權
[root@master ~]# chmod 600 authorized_keys

4. 在其餘centos系統中重複1 2 3 
 
5. 分發給叢集中的其他主機
    使用模式:ssh-copy-id [-i [identity_file]] [user@]machine
[root@master ~]# ssh-copy-id  [user]@[IP]
SSH免密登入操作

5. 在/opt/hadoop/下建立目錄

(1) 建立hdfs

(2) hdfs下建立 name,data,tmp目錄

6. hadoop配置檔案

在hadoop-env.sh中增加export JAVA_HOME=/opt/jdk8

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.
export JAVA_HOME=${JAVA_HOME}

# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol.  Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME}

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}

# Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
  if [ "$HADOOP_CLASSPATH" ]; then
    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
  else
    export HADOOP_CLASSPATH=$f
  fi
done

# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""

# Extra Java runtime options.  Empty by default.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"

export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"

export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"

# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"

# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol.  This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}

# Where log files are stored.  $HADOOP_HOME/logs by default.
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER

# Where log files are stored in the secure data environment.
export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}

###
# HDFS Mover specific parameters
###
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_MOVER_OPTS=""

###
# Advanced Users Only!
###

# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by 
#       the user that will run the hadoop daemons.  Otherwise there is the
#       potential for a symlink attack.
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}

# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER
export JAVA_HOME=/opt/jdk8
hadoop-env.sh
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/opt/hadoop/hdfs/tmp</value>
        <discription>A base for other temporary directories.</discription>
    </property>
    <!--master 的IP地址 -->
    <!-- 可以直接寫 hdfs://master:900 -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://192.168.47.131:9000</value>
    </property>
</configuration>
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!--dfs.replication表示副本的數量,偽分散式要設定為1-->
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
<!--dfs.namenode.name.dir表示本地磁碟目錄,是儲存fsimage檔案的地方-->
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/opt/hadoop/hdfs/name</value>
    </property>
<!--dfs.datanode.data.dir表示本地磁碟目錄,HDFS資料存放block的地方-->
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/opt/hadoop/hdfs/data</value>
    </property>

</configuration>
hdfs-site.xml
<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>master</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>
yarn-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
mapred-site.xml
slave1
slave2
slaves

PS : slaves檔案根據自己的部署確定,筆者部署了倆臺slave

7. master環境配置到此結束

將master中的 /opt/jdk,/opt/hadoop複製到slave1,slave2的/opt目錄下。

將master中的/etc/hosts,/etc/profile複製到slave2,slave2對應的目錄下。

8. 在slave1,slave2中執行

source /etc/profile

附筆者的 /etc/profile檔案

# /etc/profile

# System wide environment and startup programs, for login setup
# Functions and aliases go in /etc/bashrc

# It's NOT a good idea to change this file unless you know what you
# are doing. It's much better to create a custom.sh shell script in
# /etc/profile.d/ to make custom changes to your environment, as this
# will prevent the need for merging in future updates.

pathmunge () {
    case ":${PATH}:" in
        *:"$1":*)
            ;;
        *)
            if [ "$2" = "after" ] ; then
                PATH=$PATH:$1
            else
                PATH=$1:$PATH
            fi
    esac
}


if [ -x /usr/bin/id ]; then
    if [ -z "$EUID" ]; then
        # ksh workaround
        EUID=`/usr/bin/id -u`
        UID=`/usr/bin/id -ru`
    fi
    USER="`/usr/bin/id -un`"
    LOGNAME=$USER
    MAIL="/var/spool/mail/$USER"
fi

# Path manipulation
if [ "$EUID" = "0" ]; then
    pathmunge /usr/sbin
    pathmunge /usr/local/sbin
else
    pathmunge /usr/local/sbin after
    pathmunge /usr/sbin after
fi

HOSTNAME=`/usr/bin/hostname 2>/dev/null`
HISTSIZE=1000
if [ "$HISTCONTROL" = "ignorespace" ] ; then
    export HISTCONTROL=ignoreboth
else
    export HISTCONTROL=ignoredups
fi

export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE HISTCONTROL

# By default, we want umask to get set. This sets it for login shell
# Current threshold for system reserved uid/gids is 200
# You could check uidgid reservation validity in
# /usr/share/doc/setup-*/uidgid file
if [ $UID -gt 199 ] && [ "`/usr/bin/id -gn`" = "`/usr/bin/id -un`" ]; then
    umask 002
else
    umask 022
fi

for i in /etc/profile.d/*.sh /etc/profile.d/sh.local ; do
    if [ -r "$i" ]; then
        if [ "${-#*i}" != "$-" ]; then 
            . "$i"
        else
            . "$i" >/dev/null
        fi
    fi
done

unset i
unset -f pathmung
export JAVA_HOME=/opt/jdk8
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
/etc/profile

七. 啟動hadoop叢集

1. 格式化系統
hadoop namenode -format / hdfs namenode -formate
(注意只需要執行一次,後面再啟動不需要再次格式化,
除非master/slave有修改)

2.啟動hadoop(進入 /opt/hadoop 目錄下)
   sbin/start-all.sh

3.驗證 jps,檢視master程序
   [root@master hadoop]# jps
   28448 ResourceManager
   31777 Jps
   28293 SecondaryNameNode
   28105 NameNode
   檢視salve1程序
   [root@slave1 ~]# jps
   22950 Jps
   18665 NodeManager
   18558 DataNode

4. 瀏覽器中檢視
    http://master:50070
     若window與虛擬機器在同一子網中,可以在window系統中通過瀏覽器開啟,若不在,可以通過配置實現。同樣也可在master中通過瀏覽器訪問。

5. 停止hadoop叢集(進入 /opt/hadoop )
    sbin/stop-all.sh
啟動命令

八. 程式測試

1.在 /opt/hadoop 目錄下
echo "this is a test case, loading, please wait a minit" >> test

2.用hdfs命令建立輸入資料夾
   hadoop fs -mkdir /input

3.用hdfs命令將test內容放入 /input 資料夾中 
   hadoop fs -put test /input

4.執行hadoop自帶的wordcount例子
   hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /input /output

5. 檢視輸出結果
   hadoop fs -ls /output
   hadoop fs -cat /output/part-r-00000
TestCase

PS:這些資料夾以及都是在hdfs上,因此無法在本地磁碟中找到。且在程式執行前,結果資料夾output必須是不存在的。若檔案需要更改,然後重新執行程式,則需要將input和output都刪除,重新生成。或者新建兩個對應的資料夾。如果需要重新hadoop namenode -format務必把之前的日誌,臨時等檔案進行刪除

刪除命令
hadoop fs -rmr [/targetDir]

列出目標資料夾的檔案
hadoop fs -ls [/targetDir]

將本地檔案放到hdfs上
hadoop fs -put localFile remoteFilePath
hdfs命令

九. IDEA 連線hadoop叢集執行WordCount

請參考:https://www.cnblogs.com/HusterX/p/14162985.html

ZF3R0-FHED2-M80TY-8QYGC-NPKYF