[Hadoop]那些年踩過的Hadoop坑
1. DataNode未啟動
1.1 問題原因
這個問題一般是由於兩次或兩次以上的格式化NameNode造成的。jps命令發現沒有datanode啟動,所以去Hadoop的日誌檔案下檢視日誌(/opt/hadoop-2.7.2/logs/hadoop-xiaosi-datanode-Qunar.log),每個人的日誌檔案都是不一樣的:
2016-06-12 20:01:31,374 WARN org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException: Incompatible clusterIDs in /home/xiaosi/config/hadoop/tmp/dfs/data: namenode clusterID = CID-67134f3c-0dcd-4e29-a629-a823d6c04732; datanode clusterID = CID-cf2f0387-3b3b-4bd8-8b10-6f5baecccdcf
2016-06-12 20:01:31,375 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to localhost/127.0.0.1:9000. Exiting.
java.io.IOException: All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1358)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
at java.lang.Thread.run(Thread.java:724)
2016-06-12 20:01:31,377 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to localhost/127.0.0.1:9000
2016-06-12 20:01:31,388 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool <registering> (Datanode Uuid unassigned)
2016-06-12 20:01:33,389 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2016-06-12 20:01:33,391 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2016-06-12 20:01:33,392 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/* ***********************************************************
SHUTDOWN_MSG: Shutting down DataNode at Qunar/127.0.0.1
************************************************************/
2016-06-13 12:56:00,753 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
/************************************************************
從日誌檔案中我們捕捉到Incompatible這個單詞,意思是“不相容的”,所以我們可以看出是datanode的clusterID出錯了,最後導致shutDown。
1.2 解決方案
檢視hadoop路徑下的配置檔案hdfs-site.xml(/opt/hadoop-2.7.2/etc/hadoop/hdfs-site.xml):
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/xiaosi/config/hadoop/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/xiaosi/config/hadoop/tmp/dfs/data</value>
</property>
</configuration>
我們可以看到datanode和namenode不再預設路徑,而是自己設定過的路徑。根據設定的路徑,進入datanode的dfs.datanode.data.dir的current目錄,修改其中的VERSION檔案:
#Wed May 25 11:19:08 CST 2016
storageID=DS-92ce5ab0-115f-45ef-b7f1-cf6540cc8bfa
#clusterID=CID-cf2f0387-3b3b-4bd8-8b10-6f5baecccdcf
clusterID=CID-67134f3c-0dcd-4e29-a629-a823d6c04732
cTime=0
datanodeUuid=261d557d-4f5b-4006-9a64-39c544b6b962
storageType=DATA_NODE
layoutVersion=-56
修改clusterID與/opt/hadoop-2.7.2/logs/hadoop-xiaosi-datanode-Qunar.log中namenode的clusterID一致。
最後重新啟動Hadoop:
[email protected]:/opt/hadoop-2.7.2/sbin$ ./start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [localhost]
localhost: namenode running as process 3689. Stop it first.
localhost: starting datanode, logging to /opt/hadoop-2.7.2/logs/hadoop-xiaosi-datanode-Qunar.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: secondarynamenode running as process 4131. Stop it first.
starting yarn daemons
resourcemanager running as process 7192. Stop it first.
localhost: nodemanager running as process 7331. Stop it first.
看最後的執行結果:
[email protected]:/opt/hadoop-2.7.2/sbin$ jps
4131 SecondaryNameNode
7192 ResourceManager
7331 NodeManager
3689 NameNode
9409 Jps
8989 DataNode
7818 RunJar
從上面可以看到我們的dataNode已經跑起來了。
2. NameNode未啟動
2.1 問題原因
2016-12-04 14:50:39,879 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /home/xiaosi/tmp/hadoop/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:327)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:215)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:975)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:585)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:645)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:812)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:796)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1493)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1559)
2.2 解決方案
在配置完成後,執行hadoop前,要初始化HDFS系統,在bin/目錄下執行如下命令:
./bin/hdfs namenode -format
3. NodeManager未啟動
3.1 問題原因
檢視Hadoop log日誌:
2017-01-23 14:28:53,279 FATAL org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Failed to initialize mapreduce.shuffle
java.lang.IllegalArgumentException: The ServiceName: mapreduce.shuffle set in yarn.nodemanager.aux-services is invalid.The valid service name should only contain a-zA-Z0-9_ and can not start with numbers
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:114)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:245)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:261)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:495)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:543)
從上面異常我們可以知道yarn.nodemanager.aux-services的配置值mapreduce.shuffle有問題。檢視原配置:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
3.2 解決方案
修改yarn-site.xml配置檔案,做如下修改:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
重啟即可
4. Unhealthy Node local-dirs are bad
4.1 問題原因
在執行作業時,作業一直卡在下面語句不能執行:
17/01/23 21:43:21 INFO mapreduce.Job: Running job: job_1485165672363_0004
1/1 local-dirs are bad: /tmp/hadoop-hduser/nm-local-dir;
1/1 log-dirs are bad: /usr/local/hadoop/logs/userlogs
4.2 解決方案
引起local-dirs are bad的最常見原因是由於節點上的磁碟使用率超出了max-disk-utilization-per-disk-percentage(預設值90.0%)。
清理不健康節點上的磁碟空間或者降低引數設定的閾值:
<property>
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>98.5</value>
</property>