線上問題排查-HBase寫數據出現NotServingRegionException(Region ... is not online)異常
阿新 • • 發佈:2018-11-15
cti method rec tail warn current 程序 執行 too
今天線上遇到一個問題:有一臺服務器的cpu持續沖高,排查發現是我們的一個java應用進程造成的,該進程在向hbase中寫入數據時,日誌不斷地打印下面的異常:
org.apache.hadoop.hbase.NotServingRegionException: Region iot_flow_cdr_201811,4379692584601-2101152593-20181115072326-355,1536703383699.82804f639798d0502dd64e6e47d75d84. is not online on shqz-ps-iot3-cdr-dn01,60020,1524812940505 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2921) at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1053) at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2096) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33656) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108) at java.lang.Thread.run(Thread.java:745)
排查思路如下:
- 查看hbase的請求數量是否過高:通過hbase的web控制界面查看RegionServer的請求數,如下圖
可以看到,Request Per Second並不高,排除這個原因。 - 檢查表iot_flow_cdr_201811信息是否正常
(1) 檢查該表是否存在一致性問題
hbase hbck -details iot_flow_cdr_201811
確實發現了不一致的異常
8 inconsistencies detected
(2) 嘗試修復該問題
hbase hbck -repair iot_flow_cdr_201811
執行該命令出現下述錯誤
18/11/15 11:28:15 WARN util.HBaseFsck: Got AccessDeniedException when preCheckPermission org.apache.hadoop.hbase.security.AccessDeniedException: Permission denied: action=WRITE path=hdfs://nameservice1/hbase/.hbase-snapshot user=root at org.apache.hadoop.hbase.util.FSUtils.checkAccess(FSUtils.java:1797) at org.apache.hadoop.hbase.util.HBaseFsck.preCheckPermission(HBaseFsck.java:1932) at org.apache.hadoop.hbase.util.HBaseFsck.exec(HBaseFsck.java:4734) at org.apache.hadoop.hbase.util.HBaseFsck$HBaseFsckTool.run(HBaseFsck.java:4562) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:4550) Current user root does not have write perms to hdfs://nameservice1/hbase/.hbase-snapshot. Please rerun hbck as hdfs user hbase
根據提示可以看到,錯誤原因是沒有權限Permission denied
然後我們以hbase用戶身份執行該命令
sudo - hbase hbase hbck -repair iot_flow_cdr_201811
這次執行成功了,等命令執行完成後,修復了inconsistencies(數據不一致)的錯誤。
最後重啟應用,觀察日誌,程序正常執行,NotServingRegionException異常不再出現了,服務器cpu也恢復了正常。
線上問題排查-HBase寫數據出現NotServingRegionException(Region ... is not online)異常