hadoop偽叢集的安裝,及基本概念。
導讀
偽叢集的意思就是說我們可以在多臺計算機上面安裝hadoop,但是不具有高可用和共容錯,這適用於開發環境。
我們首先下載hadoop的安裝包,我使用的cdh版本的5.14.0,你可以在該網址找到他,
首先我們說一下hadoop的配置檔案的分類:
hadoop的配置檔案可以分為兩種型別的配置檔案。
一種是隻讀的預設配置如: core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml
另一種是我們指定的配置檔案:conf/core-site.xml, conf/hdfs-site.xml, conf/yarn-site.xml and conf/mapred-site.xml.
此外我們還可對這兩個指令碼進行配置:conf/hadoop-env.sh and yarn-env.sh.
詳細配置
我們首先來配置conf/hadoop-env.sh and yarn-env.sh。
第一步配置 JAVA_HOME,
第二步:可選配置,管理員可以為獨立的守護程序進行詳細的配置:
Daemon | Environment Variable |
NameNode | HADOOP_NAMENODE_OPTS |
DataNode | HADOOP_DATANODE_OPTS |
Secondary NameNode | HADOOP_SECONDARYNAMENODE_OPTS |
ResourceManager | YARN_RESOURCEMANAGER_OPTS |
NodeManager | YARN_NODEMANAGER_OPTS |
WebAppProxy | YARN_PROXYSERVER_OPTS |
Map Reduce Job History Server | HADOOP_JOB_HISTORYSERVER_OPTS |
例如我們可以配置NameNode的 parallelGC:
export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC ${HADOOP_NAMENODE_OPTS}"
還可以進行一些其他的有用的配置:
HADOOP_LOG_DIR / YARN_LOG_DIR - 指定他們日誌的儲存路徑,如果不存在的話自動建立。
HADOOP_HEAPSIZE / YARN_HEAPSIZE - 指定最大堆疊大小。
Daemon | Environment Variable |
ResourceManager | YARN_RESOURCEMANAGER_HEAPSIZE |
NodeManager | YARN_NODEMANAGER_HEAPSIZE |
WebAppProxy | YARN_PROXYSERVER_HEAPSIZE |
Map Reduce Job History Server | HADOOP_JOB_HISTORYSERVER_HEAPSIZE |
現在我們可以進行這些守護程序的詳細配置了:
conf/core-site.xml
Parameter | Value | Notes |
---|---|---|
fs.defaultFS | NameNode URI | hdfs://host:port/ |
io.file.buffer.size | 131072 | Size of read/write buffer used in SequenceFiles. |
fs.defaultFS:這個路徑是我們訪問分散式檔案系統的路徑。
conf/hdfs-site.xml
- Configurations for NameNode:
Parameter Value Notes dfs.namenode.name.dir Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
Configurations for DataNode:
Parameter | Value | Notes |
---|---|---|
dfs.datanode.data.dir | Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks. |
If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. |
conf/yarn-site.xml
- Configurations for ResourceManager and NodeManager:
Parameter Value Notes yarn.acl.enable true / false Enable ACLs? Defaults to false. yarn.admin.acl Admin ACL ACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which means anyone. Special value of just space means no one has access. yarn.log-aggregation-enable false Configuration to enable or disable log aggregation
-
- Configurations for ResourceManager:
Parameter Value Notes yarn.resourcemanager.address ResourceManager host:port for clients to submit jobs. host:port
If set, overrides the hostname set in yarn.resourcemanager.hostname.yarn.resourcemanager.scheduler.address ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources. host:port
If set, overrides the hostname set in yarn.resourcemanager.hostname.yarn.resourcemanager.resource-tracker.address ResourceManager host:port for NodeManagers. host:port
If set, overrides the hostname set in yarn.resourcemanager.hostname.yarn.resourcemanager.admin.address ResourceManager host:port for administrative commands. host:port
If set, overrides the hostname set in yarn.resourcemanager.hostname.yarn.resourcemanager.webapp.address ResourceManager web-ui host:port. host:port
If set, overrides the hostname set in yarn.resourcemanager.hostname.yarn.resourcemanager.hostname ResourceManager host. host
Single hostname that can be set in place of setting all yarn.resourcemanager*addressresources. Results in default ports for ResourceManager components.yarn.resourcemanager.scheduler.class ResourceManager Scheduler class. CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler yarn.scheduler.minimum-allocation-mb Minimum limit of memory to allocate to each container request at the Resource Manager. In MBs yarn.scheduler.maximum-allocation-mb Maximum limit of memory to allocate to each container request at the Resource Manager. In MBs yarn.resourcemanager.nodes.include-path / yarn.resourcemanager.nodes.exclude-path List of permitted/excluded NodeManagers. If necessary, use these files to control the list of allowable NodeManagers. - Configurations for NodeManager:
Parameter Value Notes yarn.nodemanager.resource.memory-mb Resource i.e. available physical memory, in MB, for given NodeManager Defines total available resources on the NodeManager to be made available to running containers yarn.nodemanager.vmem-pmem-ratio Maximum ratio by which virtual memory usage of tasks may exceed physical memory The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio. yarn.nodemanager.local-dirs Comma-separated list of paths on the local filesystem where intermediate data is written. Multiple paths help spread disk i/o. yarn.nodemanager.log-dirs Comma-separated list of paths on the local filesystem where logs are written. Multiple paths help spread disk i/o. yarn.nodemanager.log.retain-seconds 10800 Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled. yarn.nodemanager.remote-app-log-dir /logs HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled. yarn.nodemanager.remote-app-log-dir-suffix logs Suffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} Only applicable if log-aggregation is enabled. yarn.nodemanager.aux-services mapreduce_shuffle Shuffle service that needs to be set for Map Reduce applications. - Configurations for History Server (Needs to be moved elsewhere):
Parameter Value Notes yarn.log-aggregation.retain-seconds -1 How long to keep aggregation logs before deleting them. -1 disables. Be careful, set this too small and you will spam the name node. yarn.log-aggregation.retain-check-interval-seconds -1 Time between checks for aggregated log retention. If set to 0 or a negative value then the value is computed as one-tenth of the aggregated log retention time. Be careful, set this too small and you will spam the name node.
- Configurations for ResourceManager:
- conf/mapred-site.xml
- Configurations for MapReduce Applications:
Parameter Value Notes mapreduce.framework.name yarn Execution framework set to Hadoop YARN. mapreduce.map.memory.mb 1536 Larger resource limit for maps. mapreduce.map.java.opts -Xmx1024M Larger heap-size for child jvms of maps. mapreduce.reduce.memory.mb 3072 Larger resource limit for reduces. mapreduce.reduce.java.opts -Xmx2560M Larger heap-size for child jvms of reduces. mapreduce.task.io.sort.mb 512 Higher memory-limit while sorting data for efficiency. mapreduce.task.io.sort.factor 100 More streams merged at once while sorting files. mapreduce.reduce.shuffle.parallelcopies 50 Higher number of parallel copies run by reduces to fetch outputs from very large number of maps. - Configurations for MapReduce JobHistory Server:
Parameter Value Notes mapreduce.jobhistory.address MapReduce JobHistory Server host:port Default port is 10020. mapreduce.jobhistory.webapp.address MapReduce JobHistory Server Web UI host:port Default port is 19888. mapreduce.jobhistory.intermediate-done-dir /mr-history/tmp Directory where history files are written by MapReduce jobs. mapreduce.jobhistory.done-dir /mr-history/done Directory where history files are managed by the MR JobHistory Server.
- Configurations for MapReduce Applications:
基本配置
我負責給出這些配置,你如何配置在你,下面我給出一個可以執行環境的基本配置:
#hadoop-env.sh的配置配置javahome
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://node-master:9000</value>
</property>
</configuration>
hdfs-site.conf
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>256</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>256</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1536</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>1536</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
slaves
node01
node02
node03
解釋
slaves
我們需要說一下slaves
這個配置檔案的作用:
通常,您選擇群集中的一臺計算機作為NameNode,並選擇一臺計算機作為ResourceManager。其餘的機器既充當DataNode又充當NodeManager,並稱為從裝置。這個配置檔案就是指定執行DataNode和NodeManager的節點的主機名。
記憶體
還有yarn-set.xml中的關於記憶體的額配置和 mapred-site.xml中關於記憶體的配置:
在yarn叢集中執行的任務有兩種型別:
一種是Application Master (AM)他負責監視應用程式並協調叢集中的分散式執行程式。
一種是executors 他通過AM建立,並執行job,對於MapReduce jobs,他們進行map,reduce並行的操作,請注意我們的yarn上面可不止能執行mapreduce哦,歸根到底,mapreduce程式只是一個類呼叫yarm叢集的程式。
兩者都在slave nodes上的containers 中執行。每個slave node都執行一個NodeManager守護程式,該守護程式負責在節點上建立container 。整個叢集由ResourceManager管理,ResourceManager根據容量要求和當前費用排程所有slave nodes上的容器分配。
說了這麼多都不如看一張圖明顯:
官方解釋如下:
-
How much memory can be allocated for YARN containers on a single node. This limit should be higher than all the others; otherwise, container allocation will be rejected and applications will fail. However, it should not be the entire amount of RAM on the node.
This value is configured in
yarn-site.xml
withyarn.nodemanager.resource.memory-mb
. -
How much memory a single container can consume and the minimum memory allocation allowed. A container will never be bigger than the maximum, or else allocation will fail and will always be allocated as a multiple of the minimum amount of RAM.
Those values are configured in
yarn-site.xml
withyarn.scheduler.maximum-allocation-mb
andyarn.scheduler.minimum-allocation-mb
. -
How much memory will be allocated to the ApplicationMaster. This is a constant value that should fit in the container maximum size.
This is configured in
mapred-site.xml
withyarn.app.mapreduce.am.resource.mb
. -
How much memory will be allocated to each map or reduce operation. This should be less than the maximum size.
This is configured in
mapred-site.xml
with propertiesmapreduce.map.memory.mb
andmapreduce.reduce.memory.mb
.
執行
接下來就可以執行叢集了,跟我們的作業系統一樣,用之前得先進行格式化:
hdfs namenode -format
執行hdfs:
start-dfs.sh
執行yarn:
start-yarn.sh