1. 程式人生 > >Phoenix導致HBase叢集region offline故障解決

Phoenix導致HBase叢集region offline故障解決

1. 故障描述

最近我們的測試HBase叢集經常因為機房網路維護而崩潰(服務宕機),在叢集崩潰以後重啟HBase叢集發現有大部分region沒有online的情況,導致查詢表資料的時候會出現如下異常:

org.apache.hadoop.hbase.NotServingRegionException:...

我們HBase叢集環境大概如下所示:

192.168.27.230 HMaster HRegionServer
192.168.27.231 HRegionServer
192.168.27.232 HRegionServer

即3臺機器 1個HMaster 3個RegionSever的形式

我們還使用了Phoenix來建立HBase表,並建立了二級索引表,加上Phoenix自帶的系統表,整個HBase叢集一共有8個表。

檢視regionserver上面的log,發現大量如下異常資訊:

2016-05-26 14:57:46,928 INFO org.apache.hadoop.hbase.client.AsyncProcess: #5, table=C_PICRECORD_IDX_COLLISION, attempt=21/350 failed=2ops, last exception: org.apache.hadoop.hbase.exceptions.RegionOpeningException: org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region C_PICRECORD_IDX_COLLISION,\x05\x00\x00\x00\x00\x00,1463624374612.2fc54757a3fea6f3f32526382947e8e6. is opening on cbds2,60020,1464245619847
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2788) at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:887) at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1858) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService
$2.callBlockingMethod(ClientProtos.java:31451) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2035) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) on cbds2,60020,1464230060200, tracking started null, retrying after=20179ms, replay=2ops 2016-05-26 14:57:47,110 INFO org.apache.hadoop.hbase.client.AsyncProcess: #4, table=C_PICRECORD_IDX, attempt=21/350 failed=2ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region C_PICRECORD_IDX,\x06\x00\x00\x00\x00,1463621907697.3ae160acd7f317598908c7a778f7e8b8. is not online on cbds2,60020,1464245619847 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2791) at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:887) at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1858) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:31451) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2035) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) on cbds2,60020,1464230060200, tracking started null, retrying after=20131ms, replay=2ops 2016-05-26 14:57:47,233 INFO org.apache.hadoop.hbase.client.AsyncProcess: #4, table=C_PICRECORD_IDX, attempt=21/350 failed=2ops, last exception: org.apache.hadoop.hbase.exceptions.RegionOpeningException: org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region C_PICRECORD_IDX,\x07\x00\x00\x00\x00,1463621907697.2ef56bd25a9b77a920a314b9b0308c01. is opening on cbds1,60020,1464245619628 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2788) at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:887) at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1858) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:31451) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2035) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) on cbds1,60020,1464230060318, tracking started null, retrying after=20153ms, replay=2ops

2. 嘗試解決

根據網上的做法使用hbase hbck命令檢查是否存在資料不一致的情況導致region沒能online。

$ sudo -u hbase hbase hbck

從輸出資訊看到狀態確實為不一致(INCONSISTENT),並且在輸出日誌裡面還有說明出現了region空洞(region hole)

Number of empty REGIONINFO_QUALIFIER rows in hbase:meta: 0
16/05/20 17:44:24 INFO util.HBaseFsck: getHTableDescriptors == tableNames => [SYSTEM.CATALOG, SYSTEM.SEQUENCE, C_PICRECORD_IDX, SYSTEM.STATS, SYSTEM.FUNCTION, hbase:namespace, C_PICRECORD_IDX_COLLISION, C_PICRECORD]
16/05/20 17:44:24 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x501820b0 connecting to ZooKeeper ensemble=cbds0:2181,cbds1:2181,cbds2:2181
16/05/20 17:44:24 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=cbds0:2181,cbds1:2181,cbds2:2181 sessionTimeout=120000 watcher=hconnection-0x501820b00x0, quorum=cbds0:2181,cbds1:2181,cbds2:2181, baseZNode=/hbase
16/05/20 17:44:24 INFO zookeeper.ClientCnxn: Opening socket connection to server cbds0/192.168.27.230:2181. Will not attempt to authenticate using SASL (unknown error)
16/05/20 17:44:24 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.27.230:45549, server: cbds0/192.168.27.230:2181
16/05/20 17:44:24 INFO zookeeper.ClientCnxn: Session establishment complete on server cbds0/192.168.27.230:2181, sessionid = 0x254cd859d7c0021, negotiated timeout = 120000
16/05/20 17:44:24 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
16/05/20 17:44:25 INFO client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
16/05/20 17:44:25 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x254cd859d7c0021
16/05/20 17:44:25 INFO zookeeper.ZooKeeper: Session: 0x254cd859d7c0021 closed
16/05/20 17:44:25 INFO zookeeper.ClientCnxn: EventThread shut down
Number of Tables: 8
16/05/20 17:44:26 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x45c68eef connecting to ZooKeeper ensemble=cbds0:2181,cbds1:2181,cbds2:2181
16/05/20 17:44:26 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=cbds0:2181,cbds1:2181,cbds2:2181 sessionTimeout=120000 watcher=hconnection-0x45c68eef0x0, quorum=cbds0:2181,cbds1:2181,cbds2:2181, baseZNode=/hbase
16/05/20 17:44:26 INFO zookeeper.ClientCnxn: Opening socket connection to server cbds2/192.168.27.232:2181. Will not attempt to authenticate using SASL (unknown error)
16/05/20 17:44:26 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.27.230:42234, server: cbds2/192.168.27.232:2181
16/05/20 17:44:26 INFO zookeeper.ClientCnxn: Session establishment complete on server cbds2/192.168.27.232:2181, sessionid = 0x354cd859d7e0026, negotiated timeout = 120000
16/05/20 17:44:26 INFO zookeeper.RecoverableZooKeeper: Process identifier=hbase Fsck connecting to ZooKeeper ensemble=cbds0:2181,cbds1:2181,cbds2:2181
16/05/20 17:44:26 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=cbds0:2181,cbds1:2181,cbds2:2181 sessionTimeout=120000 watcher=hbase Fsck0x0, quorum=cbds0:2181,cbds1:2181,cbds2:2181, baseZNode=/hbase
16/05/20 17:44:26 INFO zookeeper.ClientCnxn: Opening socket connection to server cbds2/192.168.27.232:2181. Will not attempt to authenticate using SASL (unknown error)
16/05/20 17:44:26 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.27.230:42235, server: cbds2/192.168.27.232:2181
16/05/20 17:44:26 INFO zookeeper.ClientCnxn: Session establishment complete on server cbds2/192.168.27.232:2181, sessionid = 0x354cd859d7e0027, negotiated timeout = 120000
16/05/20 17:44:26 INFO zookeeper.ZooKeeper: Session: 0x354cd859d7e0027 closed
16/05/20 17:44:26 INFO zookeeper.ClientCnxn: EventThread shut down
16/05/20 17:44:26 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x354cd859d7e0026
16/05/20 17:44:26 INFO zookeeper.ZooKeeper: Session: 0x354cd859d7e0026 closed
16/05/20 17:44:26 INFO zookeeper.ClientCnxn: EventThread shut down
ERROR: Region { meta => SYSTEM.SEQUENCE,~\x00\x00\x00,1463446230816.00205e8836e8b8fa31e135938942b681., hdfs => hdfs://cbds0:8020/hbase/data/default/SYSTEM.SEQUENCE/00205e8836e8b8fa31e135938942b681, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x0C,1463484729784.02f5ac13925b451557f6ed3d0beac52e., hdfs => hdfs://cbds0:8020/hbase/data/default/C_PICRECORD/02f5ac13925b451557f6ed3d0beac52e, deployed => , replicaId => 0 } not deployed on any region server.

....

16/05/20 17:44:27 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: There is a hole in the region chain between \x05\x00\x00\x00\x00\x00 and \x06\x00\x00\x00\x00\x00.  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: There is a hole in the region chain between \x07\x00\x00\x00\x00\x00 and \x09\x00\x00\x00\x00\x00.  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: There is a hole in the region chain between \x0B\x00\x00\x00\x00\x00 and \x0C\x00\x00\x00\x00\x00.  You need to create a new .regioninfo and region dir in hdfs to plug the hole.

....

16/05/20 17:44:27 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: Found inconsistency in table SYSTEM.SEQUENCE
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
16/05/20 17:44:27 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: Found inconsistency in table SYSTEM.FUNCTION
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
16/05/20 17:44:27 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: Found inconsistency in table C_PICRECORD_IDX
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
16/05/20 17:44:27 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: Found inconsistency in table SYSTEM.STATS
16/05/20 17:44:28 INFO zookeeper.RecoverableZooKeeper: Process identifier=hbase Fsck connecting to ZooKeeper ensemble=cbds0:2181,cbds1:2181,cbds2:2181
16/05/20 17:44:28 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=cbds0:2181,cbds1:2181,cbds2:2181 sessionTimeout=120000 watcher=hbase Fsck0x0, quorum=cbds0:2181,cbds1:2181,cbds2:2181, baseZNode=/hbase
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Opening socket connection to server cbds0/192.168.27.230:2181. Will not attempt to authenticate using SASL (unknown error)
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.27.230:45672, server: cbds0/192.168.27.230:2181
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Session establishment complete on server cbds0/192.168.27.230:2181, sessionid = 0x254cd859d7c0022, negotiated timeout = 120000
16/05/20 17:44:28 INFO zookeeper.ZooKeeper: Session: 0x254cd859d7c0022 closed
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: EventThread shut down
16/05/20 17:44:28 INFO zookeeper.RecoverableZooKeeper: Process identifier=hbase Fsck connecting to ZooKeeper ensemble=cbds0:2181,cbds1:2181,cbds2:2181
16/05/20 17:44:28 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=cbds0:2181,cbds1:2181,cbds2:2181 sessionTimeout=120000 watcher=hbase Fsck0x0, quorum=cbds0:2181,cbds1:2181,cbds2:2181, baseZNode=/hbase
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Opening socket connection to server cbds1/192.168.27.231:2181. Will not attempt to authenticate using SASL (unknown error)
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.27.230:44512, server: cbds1/192.168.27.231:2181
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: Session establishment complete on server cbds1/192.168.27.231:2181, sessionid = 0x154cd859feb002a, negotiated timeout = 120000
16/05/20 17:44:28 INFO zookeeper.ZooKeeper: Session: 0x154cd859feb002a closed
16/05/20 17:44:28 INFO zookeeper.ClientCnxn: EventThread shut down

Summary:
  hbase:meta is okay.
    Number of regions: 1
    Deployed on:  cbds2,60020,1463557954277
Table C_PICRECORD_IDX_COLLISION is inconsistent.
    Number of regions: 11
    Deployed on:  cbds0,60020,1463557954330 cbds1,60020,1463557953442 cbds2,60020,1463557954277
  SYSTEM.CATALOG is okay.
    Number of regions: 1
    Deployed on:  cbds2,60020,1463557954277
  C_PICRECORD is okay.
    Number of regions: 0
    Deployed on: 
  hbase:namespace is okay.
    Number of regions: 1
    Deployed on:  cbds0,60020,1463557954330
  SYSTEM.SEQUENCE is okay.
    Number of regions: 159
    Deployed on:  cbds0,60020,1463557954330 cbds1,60020,1463557953442 cbds2,60020,1463557954277
  SYSTEM.FUNCTION is okay.
    Number of regions: 1
    Deployed on:  cbds2,60020,1463557954277
Table C_PICRECORD_IDX is inconsistent.
    Number of regions: 11
    Deployed on:  cbds0,60020,1463557954330 cbds1,60020,1463557953442 cbds2,60020,1463557954277
  SYSTEM.STATS is okay.
    Number of regions: 1
    Deployed on:  cbds1,60020,1463557953442
199 inconsistencies detected.
Status: INCONSISTENT

2.1【嘗試一】

嘗試使用hbase hbck -fix等命令來進行修復,結果失敗。

2.2【嘗試二】

通過清空zookeeper上/hbase資訊並重啟hbase叢集以後,沒能解決問題,並且hbase master隔了一段時間就掛掉

報錯資訊如下所示:

Failed to become active master
java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned
    at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:98)
    at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:897)
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:739)
    at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:169)
    at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1479)
    at java.lang.Thread.run(Thread.java:745)

....

Unhandled exception. Starting shutdown.
java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned
    at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:98)
    at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:897)
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:739)
    at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:169)
    at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1479)
    at java.lang.Thread.run(Thread.java:745)

3. 解決方案

3.1 【解決方案一】

最終在google上找到以下資料,跟我們出現的情況類似:

照著連結上面的思路進行如下步驟:

1 ) 停掉hbase叢集
2)刪除hbase在hdfs目錄下所有表目錄下的recovered.edits

eg :

hadoop fs -rm -R /hbase/data/default/SYSTEM.CATALOG/*/recovered.edits
hadoop fs -rm -R /hbase/data/default/SYSTEM.FUNCTION/*/recovered.edits
hadoop fs -rm -R /hbase/data/default/SYSTEM.SEQUENCE/*/recovered.edits
hadoop fs -rm -R /hbase/data/default/SYSTEM.STATS/*/recovered.edits
hadoop fs -rm -R /hbase/data/default/SYSTEM.CATALOG/*/recovered.edits
hadoop fs -rm -R /hbase/data/default/C_PICRECORD_IDX_COLLISION/*/recovered.edits
hadoop fs -rm -R /hbase/data/default/C_PICRECORD_IDX/*/recovered.edits
hadoop fs -rm -R /hbase/data/default/C_PICRECORD/*/recovered.edits

3)重啟hbase叢集,所有的region就都online了

注意,這種通過刪除recovered.edits的方式來恢復叢集,會丟失部分資料。

3.2【解決方案二】

在Google上面有人遇到了在使用phoenix本地索引的時候重啟HBase集群后出現了跟我們類似的情況:

根據上面的思路,我們需要在叢集所有RegionServer的hbase-site.xml配置檔案裡面增加如下配置:

<property>   <name>hbase.regionserver.executor.openregion.threads</name> <value>100</value> 
</property>

然後重啟HBase叢集就可以了。

這種解決方案應該不會丟資料,推薦使用這種方式來恢復。