Hadoop3.0安裝以及新特性介紹
Apache Hadoop 3.0.0在前一個主要發行版本(hadoop-2.x)中包含了許多重要的增強功能
環境安裝:
192.168.18.160 CDH1
192.168.18.161 CDH2
192.168.18.162 CDH3
192.168.18.163 CDH4
1,java8是必須
所有hadoop 的jar都是利用java8的執行時版本進行編譯的。依然在使用java7或者更低版本的使用者必須升級到Java8.
2,HDFS支援糾刪碼(Erasure Coding)
與副本相比糾刪碼是一種更節省空間的資料持久化儲存方法。標準編碼(比如Reed-Solomon(10,4))會有1.4 倍的空間開銷;然而HDFS副本則會有3倍的空間開銷。因為糾刪碼額外開銷主要是在重建和執行遠端讀,它傳統用於儲存冷資料,即不經常訪問的資料。當部署這個新特性時使用者應該考慮糾刪碼的網路和CPU 開銷。
3,MapReduce任務級本地優化
MapReduce添加了Map輸出collector的本地實現。對於shuffle密集型的作業來說,這將會有30%以上的效能提升。更多內容請參見 MAPREDUCE-2841
4,支援多於2個的NameNodes
最初的HDFS NameNode high-availability實現僅僅提供了一個active NameNode和一個Standby NameNode;並且通過將編輯日誌複製到三個JournalNodes上,這種架構能夠容忍系統中的任何一個節點的失敗。然而,一些部署需要更高的容錯度。我們可以通過這個新特性來實現,其允許使用者執行多個Standby NameNode。比如通過配置三個NameNode和五個JournalNodes,這個系統可以容忍2個節點的故障,而不是僅僅一個節點。HDFS high-availability文件已經對這些資訊進行了更新,我們可以閱讀這篇文件瞭解如何配置多於2個NameNodes。,
5,多個服務的預設埠被改變
在此之前,多個Hadoop服務的預設埠都屬於Linux的臨時埠範圍(32768-61000)。這就意味著我們的服務在啟動的時候可能因為和其他應用程式產生埠衝突而無法啟動。現在這些可能會產生衝突的埠已經不再屬於臨時埠的範圍,這些埠的改變會影響NameNode, Secondary NameNode, DataNode以及KMS。與此同時,官方文件也進行了相應的改變,具體可以參見 HDFS-9427以及HADOOP-12811。
6,Intra-datanode均衡器
一個DataNode可以管理多個磁碟,正常寫入操作,各磁碟會被均勻填滿。然而,當新增或替換磁碟時可能導致此DataNode內部的磁碟儲存的資料嚴重內斜。這種情況現有的HDFS balancer是無法處理的。這種情況是由新intra-DataNode平衡功能來處理,通過hdfs diskbalancer CLI來呼叫。更多請參考HDFS Commands Guide,
7,重寫守護程序以及任務的堆記憶體管理
Hadoop守護程序和MapReduce任務的堆記憶體管理髮生了一系列變化。
HADOOP-10950:介紹了配置守護整合heap大小的新方法。主機記憶體大小可以自動調整,HADOOP_HEAPSIZE 已棄用。
MAPREDUCE-5785:map和reduce task堆大小的配置方法,所需的堆大小不再需要通過任務配置和Java選項實現。已經指定的現有配置不受此更改影響。
8,HDFS Router-Based Federation
HDFS Router-Based Federation 添加了一個 RPC路由層,提供了多個 HDFS 名稱空間的聯合檢視。與現有 ViewFs 和 HDFS Federation 功能類似,不同之處在於掛載表(mount table)由伺服器端(server-side)的路由層維護,而不是客戶端。這簡化了現有 HDFS客戶端 對 federated cluster 的訪問。 詳細請參見:HDFS-10467,
9,YARN Resource Types
YARN 資源模型(YARN resource model)已被推廣為支援使用者自定義的可數資源型別(support user-defined countable resource types),不僅僅支援 CPU 和記憶體。比如叢集管理員可以定義諸如 GPUs、軟體許可證(software licenses)或本地附加儲存器(locally-attached storage)之類的資源。YARN 任務可以根據這些資源的可用性進行排程。詳細請參見: YARN-3926。
10,基於API來配置 Capacity Scheduler 佇列的配置
OrgQueue 擴充套件了 capacity scheduler ,通過 REST API 提供了以程式設計的方式來改變佇列的配置,This enables automation of queue configuration management by administrators in the queue’s administer_queue ACL.。詳細請參見:YARN-5734
環境安裝:
1、關閉防火牆
service iptables stop
2、配置免密碼登入
ssh-keygen -t rsa 這個應該網上很多了,在這裡不做過多的介紹了
3、解壓Hadoop
[[email protected] ~]$ tar -zxvf hadoop-3.0.0.tar.gz
4、hadoop配置
hadoop3.0需要配置的檔案有core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml、hadoop-env.sh、workers
修改core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://cdh1:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:///opt/hadoop3/tmp</value>
</property>
</configuration>
修改hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop3/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop3/hdfs/data</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>cdh2:9001</value>
</property>
</configuration>
workers中設定slave節點,將slave機器的名稱寫入
cdh2
cdh3
cdh4
mapred-site配置
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>
/opt/hadoop-3.0.0/etc/hadoop,
/opt/hadoop-3.0.0/share/hadoop/common/*,
/opt/hadoop-3.0.0/share/hadoop/common/lib/*,
/opt/hadoop-3.0.0/share/hadoop/hdfs/*,
/opt/hadoop-3.0.0/share/hadoop/hdfs/lib/*,
/opt/hadoop-3.0.0/share/hadoop/mapreduce/*,
/opt/hadoop-3.0.0/share/hadoop/mapreduce/lib/*,
/opt/hadoop-3.0.0/share/hadoop/yarn/*,
/opt/hadoop-3.0.0/share/hadoop/yarn/lib/*
</value>
</property>
</configuration>
上面的mapreduce.application.classpath一開始沒有配置,導致使用mapreduce時報錯
Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
yarn-site.xml配置
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandle</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>cdh1:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>cdh1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>cdh1:8040</value>
</property>
</configuration>
hadoop-env.sh中配置java_home
export JAVA_HOME=/opt/jdk1.8.0_111
格式化namenode
[elk@cdh1 bin]$ hdfs namenode -format
如果看到了標註的字說明格式化成功了
啟動hdfs 和yarn
[elk@cdh1 sbin]$ ./start-all.sh
測試。命令基本和Hadoop2一樣的
[elk@cdh1 sbin]$ hadoop fs -ls /
[elk@cdh1 sbin]$ hadoop fs -mkdir /user
[elk@cdh1 sbin]$ hadoop fs -ls /
drwxr-xr-x - elk supergroup 0 2017-12-26 23:24 /user
執行MapReduce的時候失敗
[2017-12-26 23:36:47.058]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
Please check whether your etc/hadoop/mapred-site.xml contains the below configuration:
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
</property>
執行成功
[[email protected] mapreduce]$
[[email protected] mapreduce]$ hadoop jar hadoop-mapreduce-examples-3.0.0.jar wordcount /user/passwd /output
2017-12-26 23:43:58,173 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-12-26 23:43:59,210 INFO client.RMProxy: Connecting to ResourceManager at cdh1/192.168.18.160:8040
2017-12-26 23:43:59,817 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/elk/.staging/job_1514302988215_0002
2017-12-26 23:44:01,017 INFO input.FileInputFormat: Total input files to process : 1
2017-12-26 23:44:01,198 INFO mapreduce.JobSubmitter: number of splits:1
2017-12-26 23:44:01,238 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2017-12-26 23:44:01,387 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1514302988215_0002
2017-12-26 23:44:01,389 INFO mapreduce.JobSubmitter: Executing with tokens: []
2017-12-26 23:44:01,608 INFO conf.Configuration: resource-types.xml not found
2017-12-26 23:44:01,608 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2017-12-26 23:44:01,890 INFO impl.YarnClientImpl: Submitted application application_1514302988215_0002
2017-12-26 23:44:01,944 INFO mapreduce.Job: The url to track the job: http://cdh1:8088/proxy/application_1514302988215_0002/
2017-12-26 23:44:01,945 INFO mapreduce.Job: Running job: job_1514302988215_0002
2017-12-26 23:44:11,098 INFO mapreduce.Job: Job job_1514302988215_0002 running in uber mode : false
2017-12-26 23:44:11,101 INFO mapreduce.Job: map 0% reduce 0%
2017-12-26 23:44:19,223 INFO mapreduce.Job: map 100% reduce 0%
2017-12-26 23:44:25,269 INFO mapreduce.Job: map 100% reduce 100%
2017-12-26 23:44:25,290 INFO mapreduce.Job: Job job_1514302988215_0002 completed successfully
2017-12-26 23:44:25,468 INFO mapreduce.Job: Counters: 53
File System Counters
FILE: Number of bytes read=1963
FILE: Number of bytes written=415199
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1758
HDFS: Number of bytes written=1741
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=4962
Total time spent by all reduces in occupied slots (ms)=3408
Total time spent by all map tasks (ms)=4962
Total time spent by all reduce tasks (ms)=3408
Total vcore-milliseconds taken by all map tasks=4962
Total vcore-milliseconds taken by all reduce tasks=3408
Total megabyte-milliseconds taken by all map tasks=5081088
Total megabyte-milliseconds taken by all reduce tasks=3489792
Map-Reduce Framework
Map input records=35
Map output records=55
Map output bytes=1885
Map output materialized bytes=1963
Input split bytes=93
Combine input records=55
Combine output records=54
Reduce input groups=54
Reduce shuffle bytes=1963
Reduce input records=54
Reduce output records=54
Spilled Records=108
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=100
CPU time spent (ms)=2130
Physical memory (bytes) snapshot=523571200
Virtual memory (bytes) snapshot=5573931008
Total committed heap usage (bytes)=443023360
Peak Map Physical memory (bytes)=302100480
Peak Map Virtual memory (bytes)=2781454336
Peak Reduce Physical memory (bytes)=221470720
Peak Reduce Virtual memory (bytes)=2792476672
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1665
File Output Format Counters
Bytes Written=1741