1. 程式人生 > >Error starting NodeManager-Service NodeManager failed in state INITED; cause: java.lang.Null

Error starting NodeManager-Service NodeManager failed in state INITED; cause: java.lang.Null

版本:CDH 5.2

1、異常日誌

2014-11-25 14:21:27,873 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Adding container_1415870304563_7717_01_000011 to recently stopped containers
2014-11-25 14:21:27,874 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED; cause: java.lang.NullPointerException
java.lang.NullPointerException
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
2014-11-25 14:21:27,883 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService: org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit
2014-11-25 14:21:27,884 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED; cause: java.lang.NullPointerException
java.lang.NullPointerException
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
2014-11-25 14:21:27,886 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics system...
2014-11-25 14:21:27,886 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system stopped.
2014-11-25 14:21:27,886 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system shutdown complete.
2014-11-25 14:21:27,886 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
java.lang.NullPointerException
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
        at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
2014-11-25 14:21:27,889 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
/************************************************************


2、分析:

levelDB document:
LevelDB never writes in place: it always appends to a log file, or merges existing files together to produce new ones. So an OS crash will cause a partially written log record (or a few partially written log records). LevelDB recovery code uses checksums to detect this and will skip the incomplete records.

Based on above information, if the incomplete record is the CONTAINER_REQUEST_KEY_SUFFIX record used to store container startRequest, this issue will happen. NM can't protect OS crash. This means we must add the error handling code to avoid NM shutdown due to NPE. This justify the patch.

3、解決方案 刪除/tmp/hadoop-yarn/ 後重新啟動NodeManager
4、參考資料