1. 程式人生 > >myshard問題 - 數據不一致

myshard問題 - 數據不一致

myshard

業務說,為什麽10號機房缺少這條數據,其他機房卻有?

mysql> select * from tbl_groupinfo where gid=xxxxxxx limit 10;
+------------+--------------+-------------+---------------------+------------+--------------+------------+-------------+-----------------+--------+--------+----------+-------------+-------------------+---------------+--------------+-----------+----------+----------+---------------------+-----------+
| sid        | tm_timestamp | tm_lasttime | gid                 | group_name | default_flag | group_attr | group_owner | group_extension | is_del | app_id | mic_seat | invite_perm | invite_media_perm | pub_id_search | apply_verify | public_id | introduc | topic_id | __version           | __deleted |
+------------+--------------+-------------+---------------------+------------+--------------+------------+-------------+-----------------+--------+--------+----------+-------------+-------------------+---------------+--------------+-----------+----------+----------+---------------------+-----------+
| xxxxxxxxxx |   1495773704 |  1495773704 | xxxxxxxxxxx | 處對象     |            0 |          5 |  3611732366 | vx:wtc2033      |      0 |     18 |        8 |           0 |                 0 |             1 |            0 |         0 |          |        0 | 6126694332813803019 |         0 |
+------------+--------------+-------------+---------------------+-------

大概斷定,10號機房的數據同步是有問題的,先看這條記錄,是從哪個機房插入的,然後再看10號機房與該機房之間的同步是否有問題,使用8827登錄,獲取這條數據的版本號__version,由函數轉換得到這條數據,來自14號機房插入的, 日期:2017-05-26 05:03:03 機房號:14 端口號:11


這相當於MySQL裏的binlog,會記錄每條SQL,來自於哪個server-id,目的是為了防止循環復制,myshard不僅在binlog記錄server-id,每條記錄都帶有版本號,包含了從哪個機房,哪個端口寫入的,什麽時候寫入的


到這裏,知道14號機房寫入的數據,無法同步到10號機房,可以去14號看一下同步命令

[[email protected] local]# echo stat | /scripts/nc_myshard 0 14505 |egrep "speed|behind|offset"
shard_local             Read_offset             48494420885     
shard_local             Read_speed              33373           
shard_local             Read_bytes_behind       0                    
sync_r12m0              Read_offset             48494420885     
sync_r12m0              Read_speed              33373           
sync_r12m0              Read_bytes_behind       0               
sync_r13m0              Read_offset             48494420885     
sync_r13m0              Read_speed              33373           
sync_r13m0              Read_bytes_behind       0               
sync_r1m0               Read_offset             48494420885     
sync_r1m0               Read_speed              33373           
sync_r1m0               Read_bytes_behind       0               
sync_r3m0               Read_offset             48494420885     
sync_r3m0               Read_speed              33373           
sync_r3m0               Read_bytes_behind       0               
shard_remote            Read_offset             52080697507     
shard_remote            Read_speed              27290           
shard_remote            Read_bytes_behind       0

發現沒有r10m0這個機房來拉取數據,那證明同步有問題了,去10號機房看同步的日誌,看到不斷去重連14號機房這個點

[[email protected] db_sync_HelloSrv_r10m0_d]# zcat db_sync_xxxxxxxx_r10m0_d.log.13.gz|grep xxx.xxx.xxx.144|more                                     
May 13 15:05:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:05:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:06:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:06:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:07:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:07:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:07:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:08:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:08:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:09:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:09:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:09:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:10:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:10:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:10:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
May 13 15:11:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0
May 13 15:11:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1
May 13 15:12:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2

看到有很多日誌,不斷重試去連接14號機房,其中最早的重連發生在

db_sync_xxxxxxxx_r10m0_d.log.13.gz

這個文件,而這個文件在5月14日記錄的

-rw-r--r--. 1 root adm 174K May 13 00:10 db_sync_xxxxxxxx_r10m0_d.log.14.gz 
-rw-r--r--. 1 root adm 300K May 14 00:10 db_sync_xxxxxxxx_r10m0_d.log.13.gz 
-rw-r--r--. 1 root adm 230K May 15 00:10 db_sync_xxxxxxxx_r10m0_d.log.12.gz 
-rw-r--r--. 1 root adm 234K May 16 00:10 db_sync_xxxxxxxx_r10m0_d.log.11.gz 
-rw-r--r--. 1 root adm 260K May 17 00:10 db_sync_xxxxxxxx_r10m0_d.log.10.gz 
-rw-r--r--. 1 root adm 261K May 18 00:10 db_sync_xxxxxxxx_r10m0_d.log.9.gz  
-rw-r--r--. 1 root adm 260K May 19 00:10 db_sync_xxxxxxxx_r10m0_d.log.8.gz  
-rw-r--r--. 1 root adm 258K May 20 00:10 db_sync_xxxxxxxx_r10m0_d.log.7.gz  
-rw-r--r--. 1 root adm 260K May 21 00:10 db_sync_xxxxxxxx_r10m0_d.log.6.gz  
-rw-r--r--. 1 root adm 268K May 22 00:10 db_sync_xxxxxxxx_r10m0_d.log.5.gz  
-rw-r--r--. 1 root adm 254K May 23 00:10 db_sync_xxxxxxxx_r10m0_d.log.4.gz  
-rw-r--r--. 1 root adm 259K May 24 00:10 db_sync_xxxxxxxx_r10m0_d.log.3.gz  
-rw-r--r--. 1 root adm 262K May 25 00:10 db_sync_xxxxxxxx_r10m0_d.log.2.gz  
-rw-r--r--. 1 root adm 262K May 26 00:10 db_sync_xxxxxxxx_r10m0_d.log.1.gz

一般重連只有2種可能,一個是14號機房沒有開放白名單,不允許10號機房訪問,但之前搭建成功,肯定白名單是開放了,很可能防火墻出問題,於是在14號機房,進行

iptables -n -L|grep 10號機房的IP

發現電信IP是開放了規則,但是聯通的IP是沒有開放防火墻規則,這是雙線機房,而我在5月12日部署的環境,說明部署環境2天後,因為網絡質量,電信通道無法連接,改為了聯通通道了,而聯通IP沒有授權,這就導致10號機房無法順利連接14號機房了,但是當時業務沒有使用這個數據庫,昨天5月25日,業務開始部署進程在14號機房,發現數據沒同步,才找DBA的。我於是馬上加入防火墻規則,然後重啟同步進程,重新拉取數據,但10號機房還是在報錯不斷重連,不是加防火墻就通過了嗎?


然而在14號機房可以看到另外一個錯誤

May 26 15:41:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3159] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:41:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3161] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:41:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3163] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:41:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3234] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:05 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4411] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:15 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4416] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4560] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4656] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4657] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:42:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4730] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:05 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5476] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:15 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5478] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5508] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5511] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5554] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
May 26 15:43:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5557] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132


myshard問題 - 數據不一致