Error starting NodeManager-Service NodeManager failed in state INITED; cause: java.lang.Null
阿新 • • 發佈:2019-02-01
版本:CDH 5.2
1、異常日誌
2014-11-25 14:21:27,873 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Adding container_1415870304563_7717_01_000011 to recently stopped containers 2014-11-25 14:21:27,874 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) 2014-11-25 14:21:27,883 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService: org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit 2014-11-25 14:21:27,884 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) 2014-11-25 14:21:27,886 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics system... 2014-11-25 14:21:27,886 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system stopped. 2014-11-25 14:21:27,886 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system shutdown complete. 2014-11-25 14:21:27,886 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) 2014-11-25 14:21:27,889 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG: /************************************************************
2、分析:
levelDB document:
LevelDB never writes in place: it always appends to a log file, or merges existing files together to produce new ones. So an OS crash will cause a partially written log record (or a few partially written log records). LevelDB recovery code uses checksums to
detect this and will skip the incomplete records.
Based on above information, if the incomplete record is the CONTAINER_REQUEST_KEY_SUFFIX record used to store container startRequest, this issue will happen. NM can't protect OS crash. This means we must add the error handling code to avoid NM shutdown due to NPE. This justify the patch.
3、解決方案 刪除/tmp/hadoop-yarn/ 後重新啟動NodeManager