1. 程式人生 > >2.安裝hdfs yarn

2.安裝hdfs yarn

nim 滿足 lac borde ride framework pac pro tac

下載hadoop壓縮包
設置hadoop環境變量
設置hdfs環境變量
設置yarn環境變量
設置mapreduce環境變量
修改hadoop配置
設置core-site.xml
設置hdfs-site.xml
設置yarn-site.xml
設置mapred-site.xml
設置slave文件
分發配置
啟動hdfs
格式化namenode
啟動hdfs
檢查hdfs啟動情況
啟動yarn
測試mr任務
hadoop本地庫
hdfs yarn和mapreduce參數

下載hadoop壓縮包

去hadoop官網下載hadoop-2.8.0壓縮包到hadoop1.然後放到/opt下並解壓.

$ gunzip hadoop-2.8.0.tar.gz
$ tar -xvf hadoop-2.8.0.tar

然後修改hadoop-2.8.0的目錄權限,使hdfs和yarn均有權限讀寫該目錄:

# chown -R hdfs:hadoop /opt/hadoop-2.8.0

設置hadoop環境變量

編輯/etc/profile:

export HADOOP_HOME=/opt/hadoop-2.8.0
export HADOOP_PREFIX=/opt/hadoop-2.8.0
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export LD_LIBRARY_PATH=$JAVA_HOME/jre/lib/amd64/server
export PATH=${HADOOP_HOME}/bin:$PATH

設置hdfs環境變量

編輯/opt/hadoop-2.8.0/ect/hadoop/hadoop-env.sh

#export JAVA_HOME=/usr/local/java/jdk1.8.0_121
#export HADOOP_HOME=/opt/hadoop/hadoop-2.7.3
#hadoop進程的最大heapsize包括namenode/datanode/ secondarynamenode等,默認1000M
#export HADOOP_HEAPSIZE=    
#namenode的初始heapsize,默認取上面的值,按需要分配
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""  
#JVM啟動參數,默認為空
#export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true" 
#還可以單獨配置各個組件的內存:
#export HADOOP_NAMENODE_OPTS=
#export HADOOP_DATANODE_OPTS
#export HADOOP_SECONDARYNAMENODE_OPTS
#設置hadoop日誌,默認是$HADOOP_HOME/log
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER 
export HADOOP_LOG_DIR=/var/log/hadoop/

根據自己系統的規劃來設置各個參數.要註意namenode所用的blockmap和namespace空間都在heapsize中,所以生產環境要設較大的heapsize.
註意所有組件使用的內存和,生產給linux系統留5-15%的內存(一般留10G).根據自己系統的規劃來設置各個參數.要註意namenode所用的blockmap和namespace空間都在heapsize中,所以生產環境要設較大的heapsize.

註意所有組件使用的內存和,生產給linux系統留5-15%的內存(一般留10G).

設置yarn環境變量

編輯/opt/hadoop-2.8.0/ect/hadoop/yarn-env.sh

#export JAVA_HOME=/usr/local/java/jdk1.8.0_121
#JAVA_HEAP_MAX=-Xmx1000m 
#YARN_HEAPSIZE=1000   #yarn 守護進程heapsize
#export YARN_RESOURCEMANAGER_HEAPSIZE=1000 #單獨設置RESOURCEMANAGER的HEAPSIZE
#export YARN_TIMELINESERVER_HEAPSIZE=1000  #單獨設置TIMELINESERVER(jobhistoryServer)的HEAPSIZE
#export YARN_RESOURCEMANAGER_OPTS=     #單獨設置RESOURCEMANAGER的JVM選項
#export YARN_NODEMANAGER_HEAPSIZE=1000 #單獨設置NODEMANAGER的HEAPSIZE
#export YARN_NODEMANAGER_OPTS=         #單獨設置NODEMANAGER的JVM選項
export YARN_LOG_DIR=/var/log/yarn #設置yarn的日誌目錄

根據環境配置,這裏不設置,生產環境註意JVM參數及日誌文件位置

設置mapreduce環境變量

# export JAVA_HOME=/home/y/libexec/jdk1.6.0/
#export HADOOP_JOB_HISTORYSERVER_HEAPSIZE=1000
#export HADOOP_MAPRED_ROOT_LOGGER=INFO,RFA
#export HADOOP_JOB_HISTORYSERVER_OPTS=
#export HADOOP_MAPRED_LOG_DIR="" # Where log files are stored.  $HADOOP_MAPRED_HOME/logs by default.
#export HADOOP_JHS_LOGGER=INFO,RFA # Hadoop JobSummary logger.
#export HADOOP_MAPRED_PID_DIR= # The pid files are stored. /tmp by default.
#export HADOOP_MAPRED_IDENT_STRING= #A string representing this instance of hadoop. $USER by default
#export HADOOP_MAPRED_NICENESS= #The scheduling priority for daemons. Defaults to 0.
export export HADOOP_MAPRED_LOG_DIR=/var/log/yarn

根據環境配置,這裏不設置,生產環境註意JVM參數及日誌文件位置

修改hadoop配置

參下的設置在官網http://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/ClusterSetup.html 都可以找到

設置core-site.xml

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://hadoop1:9000</value>
                <description>HDFS 端口</description>
        </property>
        <property>
                <name>io.file.buffer.size</name>
                <value>131072</value>
        </property>
        <property>
                <name>fs.trash.interval</name>
                <value>1440</value>
                <description>啟動hdfs回收站,回收站保留時間1440分鐘</description>
        </property>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/opt/hadoop-2.8.0/tmp</value>
                <description>默認值/tmp/hadoop-${user.name},修改成持久化的目錄</description>
        </property>
</configuration>

core-site.xml裏有眾多的參數,但只修改這兩個就能啟動,其它參數請參考官方文檔.

設置hdfs-site.xml

這裏只設置以下只個參數:

<configuration>
        <property>
                <name>dfs.replication</name>
                <value>1</value>
                <description>數據塊的備份數量,生產建議為3</description>
        </property>
        <property>
                <name>dfs.namenode.name.dir</name>
                <value>/opt/hadoop-2.8.0/namenodedir</value>
                <description>保存namenode元數據的目錄,生產上放在raid中</description>
        </property>
        <property>
                <name>dfs.blocksize</name>
                <value>134217728</value> 
        <description>數據塊大小,128M,根據業務場景設置,大文件多就設更大值.</description>
        </property>
        <property>
                <name>dfs.namenode.handler.count</name>
                <value>100</value>
        <description>namenode處理的rpc請求數,大集群設置更大的值</description>
        </property>
        <property>
                <name>dfs.datanode.data.dir</name>
                <value>/opt/hadoop-2.8.0/datadir</value>
                <description>datanode保存數據目錄,生產上設置成每個磁盤的路徑,不建議用raid</description>
        </property>
</configuration>

設置yarn-site.xml

這裏只設置以下只個參數,其它參數請參考官網.

<configuration>
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>hadoop1</value>
                <description>設置resourcemanager節點</description>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
                <description>設置nodemanager的aux服務</description>
        </property>
        <property>
                <name>yarn.scheduler.minimum-allocation-mb</name>
                <value>32</value>
                <description>每個container的最小大小MB</description>
        </property>
        <property>
                <name>yarn.scheduler.maximum-allocation-mb</name>
                <value>128</value>
                <description>每個container的最大大小MB</description>
        </property>
        <property>
                <name>yarn.nodemanager.resource.memory-mb</name>
                <value>1024</value>
                <description>為nodemanager分配的最大內存MB</description>
        </property>
        <property>
                <name>yarn.nodemanager.local-dirs</name>
                <value>/home/yarn/nm-local-dir</value>
                <description>nodemanager本地目錄</description>
        </property>
        <property>
                <name>yarn.nodemanager.resource.cpu-vcores</name>
                <value>1</value>
                <description>每個nodemanger機器上可用的CPU,默認為-1,即讓yarn自動檢測CPU個數,但是當前yarn無法檢測,實際上該值是8</description>
        </property>
</configuration>

生產上請設置:
ResourceManager的參數:

Parameter Value Notes
yarn.resourcemanager.address ResourceManager host:port for clients to submit jobs. host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. resourcemanager的地址,格式 主機:端口
yarn.resourcemanager.scheduler.address ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources. host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. 調度器地址 ,覆蓋yarn.resourcemanager.hostname
yarn.resourcemanager.resource-tracker.address ResourceManager host:port for NodeManagers. host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. datanode像rm報告的端口, 覆蓋 yarn.resourcemanager.hostname
yarn.resourcemanager.admin.address ResourceManager host:port for administrative commands. host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. RM管理地址,覆蓋 yarn.resourcemanager.hostname
yarn.resourcemanager.webapp.address ResourceManager web-ui host:port. host:port If set, overrides the hostname set in yarn.resourcemanager.hostname. RM web地址,有默認值
yarn.resourcemanager.hostname ResourceManager host. host Single hostname that can be set in place of setting allyarn.resourcemanager*address resources. Results in default ports for ResourceManager components. RM的主機,使用默認端口
yarn.resourcemanager.scheduler.class ResourceManager Scheduler class. CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler
yarn.scheduler.minimum-allocation-mb Minimum limit of memory to allocate to each container request at the Resource Manager. In MBs 最小容器內存(每個container最小內存)
yarn.scheduler.maximum-allocation-mb Maximum limit of memory to allocate to each container request at the Resource Manager. In MBs 最大容器內存(每個container最大內存)
yarn.resourcemanager.nodes.include-path /yarn.resourcemanager.nodes.exclude-path List of permitted/excluded NodeManagers. If necessary, use these files to control the list of allowable NodeManagers. 哪些datanode可以被RM管理

NodeManager的參數:

Parameter Value Notes
yarn.nodemanager.resource.memory-mb Resource i.e. available physical memory, in MB, for given NodeManager Defines total available resources on the NodeManager to be made available to running containers Yarn在NodeManager最大內存
yarn.nodemanager.vmem-pmem-ratio Maximum ratio by which virtual memory usage of tasks may exceed physical memory The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio. 任務使用的虛擬內存超過被允許的推理內存的比率,超過則kill掉
yarn.nodemanager.local-dirs Comma-separated list of paths on the local filesystem where intermediate data is written. Multiple paths help spread disk i/o. mr運行時中間數據的存放目錄,建議用多個磁盤分攤I/O,,默認是HADOOP_YARN_HOME/log
yarn.nodemanager.log-dirs Comma-separated list of paths on the local filesystem where logs are written. Multiple paths help spread disk i/o. mr任務日誌的目錄,建議用多個磁盤分攤I/O,,默認是HADOOP_YARN_HOME/log/userlog
yarn.nodemanager.log.retain-seconds 10800 Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled.
yarn.nodemanager.remote-app-log-dir /logs HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled.
yarn.nodemanager.aux-services mapreduce_shuffle Shuffle service that needs to be set for Map Reduce applications. shuffle服務類型

設置mapred-site.xml

<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
                <description>使用yarn來管理mr</description>
        </property>
        <property>
                <name>mapreduce.jobhistory.address</name>
                <value>hadoop2:10020</value>
                <description>jobhistory主機的地址</description>
        </property>
        <property>
                <name>mapreduce.jobhistory.webapp.address</name>
                <value>hadoop2:19888</value>
                <description>jobhistory web的主機地址</description>
        </property>
        <property>
                <name>mapreduce.jobhistory.intermediate-done-dir</name>
                <value>/opt/hadoop/hadoop-2.8.0/mrHtmp</value>
                <description>正在的mr任務監控內容的存放目錄</description>
        </property>
        <property>
                <name>mapreduce.jobhistory.done-dir</name>
                <value>/opt/hadoop/hadoop-2.8.0/mrhHdone</value>
                <description>執行完畢的mr任務監控內容的存放目錄</description>
        </property>
</configuration>

設置slave文件

在/opt/hadoop-2.8.0/ect/hadoop/slave中寫上從節點
hadoop3
hadoop4
hadoop5

分發配置

將 /etc/profile /opt/* 復制到其它節點上

$ scp [email protected]:/etc/profile /etc
$ scp -r [email protected]:/opt/* /opt/

建議先壓縮再傳….

啟動hdfs

格式化namenode

$HADOOP_HOME/bin/hdfs namenode -format

啟動hdfs

以hdfs使用 $HADOOP_HOME/sbin/start-dfs.sh啟動整個hdfs集群或者,使用

$HADOOP_HOME/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode #啟動單個namenode 
$HADOOP_HOME/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode #啟動單個datanode

啟動日誌會寫在$HADOOP_HOME/log下,可以在hadoop-env.sh裏設置日誌路徑

檢查hdfs啟動情況

打 http://hadoop1:50070 或者執行 hdfs dfs -mkdir /test測試

啟動yarn

在hadoop1上啟resourcemanager:

yarn $ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager

在hadoop3 hadoop4 hadoop5上啟動nodemanager:

yarn $ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager

如果設置了slave文件並且以yarn配置了ssh互信,那可以在任意一個節點執行:start-yarn.sh即可啟動整個集群
然後打開RM頁面:
技術分享
如果啟動有問題,查看在yarn-evn.sh設置的YARN_LOG_DIR下的日誌查找原因.註意yarn啟動時用的目錄的權限.

測試mr任務

[[email protected] hadoop-2.8.0]$ hdfs dfs -mkdir -p /user/hdfs/input
[[email protected] hadoop-2.8.0]$ hdfs dfs -put etc/hadoop/ /user/hdfs/input
[[email protected] hadoop-2.8.0]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar grep input output ‘dfs[a-z.]+‘
17/06/27 04:16:45 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/hdfs/.staging/job_1498507021248_0003
java.io.IOException: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=1536, maxMemory=128
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:279)

然後就報錯了.請求的最大內存是1536MB,最大內存是128M.1536是MR任務默認請求的最小資源.最大資源是128M?我的集群明明有3G的資源.這裏信息應該是錯誤的,當一個container的最在內存不能滿足單個個map任務用的最小內存時會報錯,報的是container的內存大小而不是集群的總內存.當前的集群配置是,每個container最小使用32MB內存,最大使用128MB內存,而一個map默認最小使用1024MB的內存.
現在,修改下每個map和reduce任務用的最小資源:
修改mapred-site.xml,添加:

        <property>
                <name>mapreduce.map.memory.mb</name>
                <value>128</value>
                <description>map任務最小使用的內存</description>
        </property>        
        <property>
                <name>mapreduce.reduce.memory.mb</name>
                <value>128</value>
                <description>reduce任務最小使用的內存</description>
        </property>            
        <property>
                <name>yarn.app.mapreduce.am.resource.mb</name>
                <value>128</value>
                <description>mapreduce任務默認使用的內存</description>
        </property>    

再次執行:

[[email protected] hadoop-2.8.0]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar grep input output ‘dfs[a-z.]+‘
17/06/27 05:04:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
……..
17/06/27 05:04:36 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/hdfs/.staging/job_1498510574463_0006
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://hadoop1:9000/user/hdfs/grep-temp-2069706136
……..

資源的問題解決了,也驗證了我的想法.但是這次又報了一個錯誤,缺少目錄.在2.6.3以及2.7.3中,我都測試過,沒發現這個問題,暫且不管個.至於MR的可用性,以後會再用其它方式驗證.懷疑jar包有問題.

hadoop本地庫

不知道大家註意到沒有,每次執行hdfs命令時,都會報:

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

這是由於不能使用本地庫的原因.hadoop依賴於linux上一次本地庫,比如zlib等來提高效率.
關於本地庫,請看我的另一篇文章:

hdfs yarn和mapreduce參數

關於參數,我會另起一篇介紹比較重要的參數

下一篇,設置HDFS的HA



來自為知筆記(Wiz)

2.安裝hdfs yarn