1. 程式人生 > 其它 >redis叢集慢查詢引起的redis-clusterdown

redis叢集慢查詢引起的redis-clusterdown

https://blog.csdn.net/carlosfu/article/details/84752805

 一、現象

 

    1. 報警:我們的redis私有云有幾百個redis例項,對於每個例項的存活,都有定時任務會每隔5分鐘,重試多次ping命令去檢查它的存活性,比如多次都ping不通,這時候就會收到報警簡訊: xxx系統-例項(10.10.xx.xx:xxxx)-由執行中變為心跳停止。這時候我就會去線上看一下出現了哪些問題。

   

    2. 檢測程式碼:比較簡易,但是比較有效。

 

  1.   public boolean isRun(final String ip, final int port) {
  2.   boolean isRun = new IdempotentConfirmer() {
  3.   private int timeOutFactor = 1;
  4.    
  5.   @Override
  6.   public boolean execute() {
  7.   Jedis jedis = new Jedis(ip, port);
  8.   try {
  9.   jedis.getClient().setConnectionTimeout(Protocol.DEFAULT_TIMEOUT * (timeOutFactor++));
  10.   jedis.getClient().setSoTimeout(Protocol.DEFAULT_TIMEOUT * (timeOutFactor++));
  11.   String pong = jedis.ping();
  12.   return pong != null && pong.equalsIgnoreCase("PONG");
  13.   } catch (JedisDataException e) {
  14.   String message = e.getMessage();
  15.   logger.warn(e.getMessage());
  16.   if (StringUtils.isNotBlank(message) && message.startsWith("LOADING")) {
  17.   return true;
  18.   }
  19.   return false;
  20.   } catch (Exception e) {
  21.   logger.warn("{}:{} error message is {} ", ip, port, e.getMessage());
  22.   return false;
  23.   } finally {
  24.   jedis.close();
  25.   }
  26.   }
  27.   }.run();
  28.   return isRun;
  29.   }
    一個簡易的冪等操作器:
  1.   package com.sohu.cache.util;
  2.    
  3.   import org.slf4j.Logger;
  4.   import org.slf4j.LoggerFactory;
  5.    
  6.   /**
  7.   * 冪等操作器
  8.   */
  9.   public abstract class IdempotentConfirmer {
  10.   private final Logger logger = LoggerFactory.getLogger(this.getClass());
  11.    
  12.   private int retry = 3;
  13.    
  14.   protected IdempotentConfirmer(int retry) {
  15.   this.retry = retry;
  16.   }
  17.    
  18.   public IdempotentConfirmer() {
  19.   }
  20.    
  21.   public abstract boolean execute();
  22.    
  23.   public boolean run() {
  24.   while (retry-- > 0) {
  25.   try {
  26.   boolean isOk = execute();
  27.   if (isOk){
  28.   return true;
  29.   }
  30.   } catch (Exception e) {
  31.   logger.error(e.getMessage(), e);
  32.   continue;
  33.   }
  34.   }
  35.   return false;
  36.   }
  37.   }

 

    

二、Redis-Cluster的拓撲結構

 

1. 拓撲圖:

 

 2. 拓撲表      

nodeid ip:port replication status
b6073bedf256d45e1dce97cd9242bb4789d52343 10.10.81.94:7497 master
efc1dfb19e120a2f41e0cec3d03d02f43e168238 10.10.79.150:6389

slave

792a356d3cd1474b8f1374aba1e5ceced1c6868d  10.10.81.95:7497

master

472823a5f04ad10027b3cde2672cc33f5accdf99 10.10.79.157:6390 slave
19544b078897d8999b68a41f2538a20cbf84288d 10.10.83.180:6382 master
74d5cda02571651bba185cbc1b10d5810d048ac3 10.10.81.96:6391 slave
b95759d39b6714544917e4aefa383b4de80f871c 10.10.81.97:7499 master
3379f2da275825c9ce4e476820254ccaf7b997d1 10.10.78.52:6398 slave
5711268a4881b723273e618894ba9276b27a297f 10.10.81.98:7496 master
5b2aa80014f168fd26c246104b5f6adf6426275e 10.10.78.53:6396 slave

 

 

三、日誌分析:

   

   1. [10:35:09] 10.10.81.97:7499(b95759d39b6714544917e4aefa383b4de80f871c)發現10.10.81.94:7497(b6073bedf256d45e1dce97cd9242bb4789d52343)不可達:

4051:M 26 Nov 10:35:09.768 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached).

    

 

   2.10.10.81.97:7499(b95759d39b6714544917e4aefa383b4de80f871c)通過gossip協議廣播10.10.81.94:7497(b6073bedf256d45e1dce97cd9242bb4789d52343)失敗的資訊,其他例項收到了這個流言:

29177:S 26 Nov 10:35:10.939 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343

  

   3.投票:(在redisCluster中,只有Master會參與投票,因為從節點只是負責備份用),進入新的epoch

9611:M 26 Nov 10:35:12.068 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18

   

   redis原始碼:

  1.   // 主機已經是一個出錯節點了,自己作為從機可以升級為主機
  2.   void clusterHandleSlaveFailover(void) {
  3.   ......
  4.   // 故障修復超時,重新啟動故障修復
  5.   if (auth_age > auth_retry_time) { // 兩次故障修復間隔不能過短
  6.   // 更新一些時間
  7.   ......
  8.   redisLog(REDIS_WARNING,
  9.   "Start of election delayed for %lld milliseconds "
  10.   "(rank #%d, offset %lld).",
  11.   server.cluster->failover_auth_time - mstime(),
  12.   server.cluster->failover_auth_rank,
  13.   replicationGetSlaveOffset());
  14.   // 告知其他從機
  15.   /* Now that we have a scheduled election, broadcast our offset
  16.   * to all the other slaves so that they'll updated their offsets
  17.   * if our offset is better. */
  18.   clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_SLAVES);
  19.   return;
  20.   }
  21.   ......
  22.   // 開頭投票
  23.   /* Ask for votes if needed. */
  24.   if (server.cluster->failover_auth_sent == 0) {
  25.   server.cluster->currentEpoch++;
  26.   server.cluster->failover_auth_epoch = server.cluster->currentEpoch;
  27.   redisLog(REDIS_WARNING,"Starting a failover election for epoch %llu.",
  28.   (unsigned long long) server.cluster->currentEpoch);
  29.   clusterRequestFailoverAuth();
  30.   server.cluster->failover_auth_sent = 1;
  31.   clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|
  32.   CLUSTER_TODO_UPDATE_STATE|
  33.   CLUSTER_TODO_FSYNC_CONFIG);
  34.   return; /* Wait for replies. */
  35.   }
  36.   }

     

 

   4. 投票結果: 

   (1) 10.10.79.150:6389是新的master, 成功執行了failover

  1.   3173:S 26 Nov 10:35:11.003 # Start of election delayed for 515 milliseconds (rank #0, offset 21516181352).
  2.   3173:S 26 Nov 10:35:11.606 # Starting a failover election for epoch 18.
  3.   3173:S 26 Nov 10:35:11.660 # Failover election won: I'm the new master.
  4.   3173:S 26 Nov 10:35:11.660 # configEpoch set to 18 after successful failover

   

   (2) 原來的master:10.10.81.94:7497(主) ,收到了判決書:節點失敗(fail), 授權10.10.79.150:6389為它的master, 修改自己的cluster nodes, 沒有了slave, cluster status ok, 開始主從複製...

  1.   3022:M 26 Nov 10:35:15.560 # Cluster state changed: fail
  2.   3022:M 26 Nov 10:35:15.562 # Failover auth denied to efc1dfb19e120a2f41e0cec3d03d02f43e168238: its master is up
  3.   3022:M 26 Nov 10:35:15.595 # Configuration change detected. Reconfiguring myself as a replica of efc1dfb19e120a2f41e0cec3d03d02f43e168238
  4.   3022:S 26 Nov 10:35:15.595 # Connection with slave 10.10.79.150:6389 lost.
  5.   3022:S 26 Nov 10:35:15.616 # Cluster state changed: ok
  6.   .................

   

   (3)新的主從通訊成功,主從關係確定

 

 5. 其他機器開始,認定新的slave

  1.   7925:S 26 Nov 10:36:20.879 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached).
  2.   7925:S 26 Nov 10:35:33.603 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

 

 6. 完整日誌 

    (1). 10.10.81.94:7497(主)

  1.   3022:M 26 Nov 10:35:15.560 # Cluster state changed: fail
  2.   3022:M 26 Nov 10:35:15.562 # Failover auth denied to efc1dfb19e120a2f41e0cec3d03d02f43e168238: its master is up
  3.   3022:M 26 Nov 10:35:15.595 # Configuration change detected. Reconfiguring myself as a replica of efc1dfb19e120a2f41e0cec3d03d02f43e168238
  4.   3022:S 26 Nov 10:35:15.595 # Connection with slave 10.10.79.150:6389 lost.
  5.   3022:S 26 Nov 10:35:15.616 # Cluster state changed: ok
  6.   3022:S 26 Nov 10:35:16.061 * Connecting to MASTER 10.10.79.150:6389
  7.   3022:S 26 Nov 10:35:16.061 * MASTER <-> SLAVE sync started
  8.   3022:S 26 Nov 10:35:16.062 * Non blocking connect for SYNC fired the event.
  9.   3022:S 26 Nov 10:35:16.062 * Master replied to PING, replication can continue...
  10.   3022:S 26 Nov 10:35:16.062 * Partial resynchronization not possible (no cached master)
  11.   3022:S 26 Nov 10:35:16.062 * Full resync from master: 80ac45a2b236fc44ebbef3d851e535bcbdd6367b:21516181353
  12.   3022:S 26 Nov 10:35:45.273 * MASTER <-> SLAVE sync: receiving 494097946 bytes from master
  13.   3022:S 26 Nov 10:35:51.052 * MASTER <-> SLAVE sync: Flushing old data
  14.   3022:S 26 Nov 10:36:22.165 * MASTER <-> SLAVE sync: Loading DB in memory
  15.   3022:S 26 Nov 10:37:44.355 * MASTER <-> SLAVE sync: Finished with success
  16.   3022:S 26 Nov 10:37:44.440 * Background append only file rewriting started by pid 4036
  17.   3022:S 26 Nov 10:38:34.988 * AOF rewrite child asks to stop sending diffs.
  18.   4036:C 26 Nov 10:38:34.988 * Parent agreed to stop sending diffs. Finalizing AOF...
  19.   4036:C 26 Nov 10:38:34.988 * Concatenating 0.02 MB of AOF diff received from parent.
  20.   4036:C 26 Nov 10:38:34.989 * SYNC append only file rewrite performed
  21.   4036:C 26 Nov 10:38:35.029 * AOF rewrite: 5 MB of memory used by copy-on-write
  22.   3022:S 26 Nov 10:38:35.127 * Background AOF rewrite terminated with success
  23.   3022:S 26 Nov 10:38:35.127 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
  24.   3022:S 26 Nov 10:38:35.128 * Background AOF rewrite finished successfully

      

    (2) 10.10.79.150:6389(從)

  1.   3173:S 26 Nov 10:35:10.934 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
  2.   3173:S 26 Nov 10:35:11.003 # Start of election delayed for 515 milliseconds (rank #0, offset 21516181352).
  3.   3173:S 26 Nov 10:35:11.606 # Starting a failover election for epoch 18.
  4.   3173:S 26 Nov 10:35:11.660 # Failover election won: I'm the new master.
  5.   3173:S 26 Nov 10:35:11.660 # configEpoch set to 18 after successful failover
  6.   3173:M 26 Nov 10:35:11.660 # Connection with master lost.
  7.   3173:M 26 Nov 10:35:11.660 * Caching the disconnected master state.
  8.   3173:M 26 Nov 10:35:11.660 * Discarding previously cached master state.
  9.   3173:M 26 Nov 10:35:15.605 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
  10.   3173:M 26 Nov 10:35:16.071 * Slave 10.10.81.94:7497 asks for synchronization
  11.   3173:M 26 Nov 10:35:16.071 * Full resync requested by slave 10.10.81.94:7497
  12.   3173:M 26 Nov 10:35:16.071 * Starting BGSAVE for SYNC with target: disk
  13.   3173:M 26 Nov 10:35:16.188 * Background saving started by pid 23531
  14.   3173:M 26 Nov 10:35:18.091 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
  15.   3173:M 26 Nov 10:35:31.056 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
  16.   3173:M 26 Nov 10:35:40.040 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
  17.   23531:C 26 Nov 10:35:45.048 * DB saved on disk
  18.   23531:C 26 Nov 10:35:45.113 * RDB: 6 MB of memory used by copy-on-write
  19.   3173:M 26 Nov 10:35:45.281 * Background saving terminated with success
  20.   3173:M 26 Nov 10:35:51.059 * Synchronization with slave 10.10.81.94:7497 succeeded
  21.   3173:M 26 Nov 10:36:11.107 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
  22.   3173:M 26 Nov 10:36:22.277 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

 

   (3) 10.10.81.95:7497(主)

  1.   14795:M 26 Nov 10:35:10.309 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
  2.   14795:M 26 Nov 10:35:11.034 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18
  3.   14795:M 26 Nov 10:35:14.979 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
  4.   14795:M 26 Nov 10:36:20.480 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached).
  5.   14795:M 26 Nov 10:36:31.652 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

   (4) 10.10.79.157:6390(從)

  1.   29177:S 26 Nov 10:35:10.939 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
  2.   29177:S 26 Nov 10:35:15.609 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
  3.   29177:S 26 Nov 10:36:11.111 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
  4.   29177:S 26 Nov 10:36:22.281 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

 

 

   (5) 10.10.83.180:6382(主)    

  1.   9611:M 26 Nov 10:35:11.342 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
  2.   9611:M 26 Nov 10:35:12.068 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18
  3.   9611:M 26 Nov 10:35:16.013 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
  4.   9611:M 26 Nov 10:36:11.516 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
  5.   9611:M 26 Nov 10:36:22.686 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

 

    

   (6) 10.10.81.96:6391(從)

  1.   7925:S 26 Nov 10:35:10.260 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
  2.   7925:S 26 Nov 10:35:14.930 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
  3.   7925:S 26 Nov 10:36:20.879 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached).
  4.   7925:S 26 Nov 10:35:33.603 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

 

 

   (7) 10.10.81.97:7499(主) 

  1.   4051:M 26 Nov 10:35:09.768 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached).
  2.   4051:M 26 Nov 10:35:10.507 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18
  3.   4051:M 26 Nov 10:35:14.452 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
  4.   4051:M 26 Nov 10:36:09.955 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
  5.   4051:M 26 Nov 10:36:21.125 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

 

      

   (8) 10.10.78.52:6398(從)

  

  1.   22972:S 26 Nov 10:35:10.944 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
  2.   22972:S 26 Nov 10:35:15.635 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
  3.   22972:S 26 Nov 10:36:11.116 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
  4.   22972:S 26 Nov 10:36:22.286 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

 

 

   (9) 10.10.81.98:7496(主)

   

  1.   2062:M 26 Nov 10:35:10.908 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
  2.   2062:M 26 Nov 10:35:11.634 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18
  3.   2062:M 26 Nov 10:35:15.599 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
  4.   2062:M 26 Nov 10:36:11.080 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
  5.   2062:M 26 Nov 10:36:22.250 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

 

   (10) 10.10.78.53:6396(從)

  1.   19527:S 26 Nov 10:35:10.944 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
  2.   19527:S 26 Nov 10:35:15.615 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
  3.   19527:S 26 Nov 10:36:11.117 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
  4.   19527:S 26 Nov 10:36:22.287 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.
 

 

四、原因查詢

redis節點被認為失敗,可能有如下原因:

1. 機器本身問題:負載過高,網路阻塞,cpu過高等等。

   通過tsar命令,查詢機器當時各個指標,未發現有什麼問題:

  1.   Time ---cpu-- ---mem-- ---tcp-- -----traffic---- --sda--- --sdb--- --sdc--- --sdd--- --sde--- --sdf--- --sdg--- --sdh--- --sdi--- --sdj--- --sdk--- --sdl--- --sdm--- ---load-
  2.   Time util util retran bytin bytout util util util util util util util util util util util util util load1
  3.   26/11/15-10:20 4.07 67.89 0.00 1.7M 1.7M 0.05 0.00 1.75 0.00 8.07 0.00 0.00 0.00 0.00 0.00 0.38 0.25 0.00 1.77
  4.   26/11/15-10:25 4.11 67.89 0.00 1.7M 1.6M 0.01 0.00 1.95 0.00 7.45 0.00 0.00 0.00 0.00 0.00 3.47 0.27 0.00 0.94
  5.   26/11/15-10:30 4.37 67.89 0.00 1.7M 1.7M 0.18 0.00 2.00 0.00 7.79 0.00 0.00 0.00 0.00 0.00 5.31 0.15 0.00 0.94
  6.   26/11/15-10:35 4.86 68.38 0.00 1.8M 1.7M 0.47 0.00 2.40 0.00 7.99 0.00 0.00 0.00 0.00 0.00 3.06 0.15 0.00 1.22
  7.   26/11/15-10:40 7.29 67.66 0.00 2.9M 1.6M 1.55 0.00 2.92 0.00 8.46 0.00 0.00 0.00 0.00 0.00 0.87 0.11 0.00 1.14
  8.   26/11/15-10:45 4.08 67.68 0.00 1.2M 1.6M 0.09 0.00 0.34 0.00 9.12 0.00 0.00 0.00 0.00 0.00 0.25 1.35 0.00 0.46

 

2. cluster-node-timeout: 這個我們設定的是15秒,一直未發現問題,從別人的分享看,這個數字已經足夠了。

 

3. 被判斷失敗的redis本身的問題:阻塞操作(比較常見的有: aof阻塞、慢查詢):從redis的日誌看沒有發生aof阻塞的情況,但是從慢查詢的日誌中,發現瞭如下,瞬間就明白了,許多慢查詢,redis的單執行緒什麼都堵住了。

 

 

五、注意:
     本文的日誌時間做過糾正,實際有些機器沒有做時間同步,造成日誌實際對不上,不利排查,而且在分散式系統中,時間的一致性通常非常重要,所以這個要注意。