1. 程式人生 > >mysql複製那點事 - Seconds_behind_Master引數調查筆記

mysql複製那點事 - Seconds_behind_Master引數調查筆記

目錄

  • mysql複製那點事 - Seconds_behind_Master引數調查筆記
    • 0. 參考文獻
    • 1. 問題背景
    • 2. 調查結論
    • 3. 調查與分析過程
      • 3.1 輪轉binlog時的執行邏輯
      • 3.2 da執行與本例項serverid一致的binlog時的執行邏輯
      • 3.3 小結
    • 4. 總結

mysql複製那點事 - Seconds_behind_Master引數調查筆記


0. 參考文獻

序號 文獻
1 MySQL 5.7 MTS原始碼分析
2 MySQL · 答疑解惑 · 備庫Seconds_Behind_Master計算
3 MySQL · 答疑釋惑 · server_id為0的Rotate
4 【案例】主從替換之後的複製風暴
5 MySQL 複製原始碼解析
6 MySQL主庫02設定宕機導致的主庫資料丟失解決方法和原因
7 Mysql rpl_slave.cc:handle_slave_io 原始碼的一些個人分析
8 mysql slave 備庫延遲是怎麼得到的
9 MySQL 編譯安裝並且開啟DEBUG模式
10 Bug #72376:Seconds_behind_master distorted because of previous_gtid event
11 MySQL Binlog解析(1)
12 理解MySQL——複製(Replication)

ps:如上的參考資料並非完全在本文中被引用,其他沒有被引用的資料在本文的形成的工作中,提供了思路上和其他方面的參考,因此一併列入參考文獻。感謝如上的文獻作者提供的參考。


1. 問題背景

部門當前的資料庫架構是雙主模式,既線上由2臺互為主從的資料庫搭建而成的叢集。高可用通過vip和headbeat來做保證。通常情況下,vip掛在主(本文稱之為da)上,當da發生了異常比如宕機等問題的時候,vip自動漂移至從(本文稱之為dp)。架構如下圖所示:

可以看出,所有的寫入和讀取操作都在da上進行。da產生的binlog會被dp拉取重放。同時dp也會產生binlog並被da拉取重放(注:這裡不會產生迴圈複製(既da產生的binlog事件在dp執行過了之後,再被da讀取到再執行),因為binlog中每個event都有個serverid標記,標記是哪個server產生的事物。當da讀取到binlog的時候,發現serverid和本機的serverid一致,便會跳過當前的binlog event)。正常情況下,da的Seconds_behind_Master引數,應該保持在0或者很小的數值才對。但是監控系統發現,da的Seconds_behind_Master引數經常發生跳變,上一秒為0下一秒就可能為一個很大值(比如10000)。例如下面的資料,是通過指令碼抓取到的線上Seconds_behind_Master情況:

2019-06-27 13:26:00
          Read_Master_Log_Pos: 472851480
               Relay_Log_File: da01-relay-bin.002323
                Relay_Log_Pos: 371
        Relay_Master_Log_File: dp01-bin.004267
          Exec_Master_Log_Pos: 472851480
        Seconds_Behind_Master: 0
2019-06-27 13:26:01
          Read_Master_Log_Pos: 473594815
               Relay_Log_File: da01-relay-bin.002323
                Relay_Log_Pos: 371
        Relay_Master_Log_File: dp01-bin.004267
          Exec_Master_Log_Pos: 473594649
        Seconds_Behind_Master: 622
2019-06-27 13:26:02
          Read_Master_Log_Pos: 474422859
               Relay_Log_File: da01-relay-bin.002323
                Relay_Log_Pos: 371
        Relay_Master_Log_File: dp01-bin.004267
          Exec_Master_Log_Pos: 474422859
        Seconds_Behind_Master: 0

2. 調查結論

前文講到了本文調查的問題背景,在這裡先給出下整個問題的調查結論。在sql/rpl_slave.cc 中計算Seconds_Behind_Master邏輯如下 (文獻2):

if ((mi->get_master_log_pos() == mi->rli->get_group_master_log_pos()) && (!strcmp(mi->get_master_log_name(), mi->rli->get_group_master_log_name()))) { 
    if (mi->slave_running == MYSQL_SLAVE_RUN_CONNECT) 
    protocol->store(0LL); 
    else 
    protocol->store_null(); 
} else { 
  long time_diff= ((long)(time(0) - mi->rli->last_master_timestamp) - mi->clock_diff_with_master);

  protocol->store((longlong)(mi->rli->last_master_timestamp ? max(0L, time_diff) : 0)); 
}

當Exec_Master_Log_Pos <Read_Master_Log_Pos 的時候進入else裡面。在da的io程序拉取dp的binlog的時候,會發現dp的binlog的serverid是自身產生的。 所以不會去更新last_master_timestamp。如果dp的binlog發生過輪轉,會產生一個ROTATE_EVENT事件更新這個last_master_timestamp。因此當sql程序比io程序慢的時候, 計算出來的time_diff 便是當前時間與上次ROTATE_EVENT事件的差值。復現方法:

  1. dp上flush logs
  2. da上stop slave sql_thread;
  3. da上DML資料。此時觀察show slave status 可以看到Read_Master_Log_Pos 很超前。
  4. da上start slave ;show slave status \G 可以看到Seconds_Behind_Master變大。

3. 調查與分析過程

從前文的結果中可以看出,跟Seconds_Behind_Master(下文簡稱SBM)值有關的mysql 變數就是last_master_timestamp。因此需要追蹤下在收到的binlog的serverid和mysql程序的serverid一致的情況下,last_master_timestamp是如何更新的。根據前文的復現方法,需要去確認如下的2個點:

  1. dp上flush logs的時候,da獲取到dp的binglog做的操作。
  2. 收到的serverid和本機一致的binlog的時候,da的mysql程序的操作。

為了便於觀察mysql的執行邏輯,本文參考文獻9的辦法,編譯了一個debug版本。在mysql執行的過程中,tail -f mysqld.trace檔案觀察mysql命令執行的過程。

3.1 輪轉binlog時的執行邏輯

3.1.1 寫入ROTATE_EVENT

在dp 執行flush logs 或者輪轉binlog的時候,會產生一個ROTATE_EVENT事件(具體的解析可以參考文獻11)。在mysqld.trace檔案中,可以觀察到如下的資訊,標誌著da此時接收到了dp的binlog輪轉事件的ROTATE_EVENT。

handle_slave_io: info: IO thread received event of type Rotate

相應的程式碼在sql/rpl_slave.cc檔案的5819行中:

 5817       THD_STAGE_INFO(thd, stage_queueing_master_event_to_the_relay_log);
 5818       event_buf= (const char*)mysql->net.read_pos + 1;
 5819       DBUG_PRINT("info", ("IO thread received event of type %s",
 5820                  Log_event::get_type_str(
 5821                             (Log_event_type)event_buf[EVENT_TYPE_OFFSET])));
          
          ... ...
              
 5831
 5832       /* XXX: 'synced' should be updated by queue_event to indicate
 5833          whether event has been synced to disk */
 5834       bool synced= 0;
 5835       if (queue_event(mi, event_buf, event_len))
 5836       {
 5837         mi->report(ERROR_LEVEL, ER_SLAVE_RELAY_LOG_WRITE_FAILURE,
 5838                    ER(ER_SLAVE_RELAY_LOG_WRITE_FAILURE),
 5839                    "could not queue event from master");
 5840         goto err;
 5841       }

可以看到在sql程序接收到一個binlog之後,會呼叫queue_event進行寫入relaylog。queue_event處理ROTATE_EVENT關鍵程式碼如下:

 8236  switch (event_type) {
         
      ... ...
           
 8251  case binary_log::ROTATE_EVENT:
 8252  {
 8253     Rotate_log_event rev(buf, checksum_alg != binary_log::BINLOG_CHECKSUM_ALG_OFF ?
 8254                          event_len - BINLOG_CHECKSUM_LEN : event_len,
 8255                          mi->get_mi_description_event());
 8256 
 8257     if (unlikely(process_io_rotate(mi, &rev)))
 8258     {
 8259       mi->report(ERROR_LEVEL, ER_SLAVE_RELAY_LOG_WRITE_FAILURE,
 8260                  ER(ER_SLAVE_RELAY_LOG_WRITE_FAILURE),
 8261                  "could not queue event from master");
 8262       goto err;
 8263     }

在switch中判斷binlog如果是ROTATE_EVENT,則呼叫process_io_rotate進行處理。其中buf是sql進行拉取到的binlog的快取,mi是代表了masterinfo。process_io_rotate主要的工作是更新master_info 的相關資訊,例如下一個binglog的位置,master的binlog位置資訊等。可以在mysqld.trace 檔案中觀察到如下的資訊:

process_io_rotate: info: new (master_log_name, master_log_pos): ('dp-bin.000029', 4)

process_io_rotate函式註釋資訊如下:

 7767 /**
 7768   Used by the slave IO thread when it receives a rotate event from the
 7769   master.
 7770
 7771   Updates the master info with the place in the next binary log where
 7772   we should start reading.  Rotate the relay log to avoid mixed-format
 7773   relay logs.
 7774
 7775   @param mi master_info for the slave
 7776   @param rev The rotate log event read from the master
 7777
 7778   @note The caller must hold mi->data_lock before invoking this function.
 7779
 7780   @retval 0 ok
 7781   @retval 1 error
 <!--7782 */-->
 7783 static int process_io_rotate(Master_info *mi, Rotate_log_event *rev)
     
... ...
     
 7830 }

在完成相關的工作之後,來到queue_event函式的最後,寫入ROTATE_EVENT,在mysqld.trace中可以看到如下的資訊:

queue_event: info: master_log_pos: 4
harvest_bytes_written: info: counter: 769  bytes_written: 44
queue_event: info: error: 0

相關程式碼如下:

 8656   {
 8657     bool is_error= false;
 8658     /* write the event to the relay log */
 8659     if (likely(rli->relay_log.append_buffer(buf, event_len, mi) == 0))
 8660     {
 8661       mi->set_master_log_pos(mi->get_master_log_pos() + inc_pos);
 8662       DBUG_PRINT("info", ("master_log_pos: %lu", (ulong) mi->get_master_log_pos()));
 8663       rli->relay_log.harvest_bytes_written(rli, true/*need_log_space_lock=true*/);
     
          ... ...
              
          }
         } 
  

到此,ROTATE_EVENT事件被完全寫入到了relaylog中。借用一張經典mysql複製原理圖(文獻12)說明這個過程,即完成了圖中從BinaryLog 讀取到RelayLog的寫入過程。

3.1.2 sql 程序讀取和執行ROTATE_EVENT

在前文的描述中,可以觀察到在RelayLog的寫入過程中並沒有去更新last_master_timestamp相關資訊。因此更新last_master_timestamp的工作,只可能在sql程序中去完成。接下來本節將探索下ROTATE_EVENT在sql程序中的執行過程。

mysql複製程式碼中,sql程序的入口函式是handle_slave_sql:

 7158 /**
 7159   Slave SQL thread entry point.
 7160 
 7161   @param arg Pointer to Relay_log_info object that holds information
 7162   for the SQL thread.
 7163 
 7164   @return Always 0.
 7165 */
 7166 extern "C" void *handle_slave_sql(void *arg)

在handle_slave_sql中,關鍵的程式碼是while (!sql_slave_killed(thd,rli))這個迴圈:

 7438   while (!sql_slave_killed(thd,rli))
 7439   {
 7440     THD_STAGE_INFO(thd, stage_reading_event_from_the_relay_log);
 7441     DBUG_ASSERT(rli->info_thd == thd);
 7442     THD_CHECK_SENTRY(thd);
 7443
 7444     if (saved_skip && rli->slave_skip_counter == 0)
 7445     {
     
             ... ...
                 
 7456     }
 7457
 7458     if (exec_relay_log_event(thd,rli))
 7459     {
 7460        ... ...
 7525       }
 7526       goto err;
 7527     }
 7528   }

     

可以發現,每次迴圈的時候,都會去執行exec_relay_log_event這個函式:

 5098   Top-level function for executing the next event in the relay log.
 5099   This is called from the SQL thread.
 5100
 5101   This function reads the event from the relay log, executes it, and
 5102   advances the relay log position.  It also handles errors, etc.
 5103
 5104   This function may fail to apply the event for the following reasons:
 5105
 5106    - The position specfied by the UNTIL condition of the START SLAVE
 5107      command is reached.
 5108
 5109    - It was not possible to read the event from the log.
 5110
 5111    - The slave is killed.
 5112
 5113    - An error occurred when applying the event, and the event has been
 5114      tried slave_trans_retries times.  If the event has been retried
 5115      fewer times, 0 is returned.
 5116
 5117    - init_info or init_relay_log_pos failed. (These are called
 5118      if a failure occurs when applying the event.)
 5119
 5120    - An error occurred when updating the binlog position.
 5121
 5122   @retval 0 The event was applied.
 5123
 5124   @retval 1 The event was not applied.
 5125 */
 5126 static int exec_relay_log_event(THD* thd, Relay_log_info* rli)

在exec_relay_log_event中,執行對應的relaylog之後,會相應的更新last_master_timestamp。在這裡並行複製模式和非並行複製模式下,更新last_master_timestamp的方式是不同的。在這裡先介紹下非並行複製模式下更新last_master_timestamp的步驟。

3.1.2.1 非並行複製模式下更新last_master_timestamp

在exec_relay_log_event中判斷是否是並行複製是通過is_parallel_exec函式實現的。如果是並行複製模式則返回True,否則返回False。在非並行複製的模式下,會進入如下的程式碼執行:

 5175     /*
 5176       Even if we don't execute this event, we keep the master timestamp,
 5177       so that seconds behind master shows correct delta (there are events
 5178       that are not replayed, so we keep falling behind).
 5179
 5180       If it is an artificial event, or a relay log event (IO thread generated
 5181       event) or ev->when is set to 0, or a FD from master, or a heartbeat
 5182       event with server_id '0' then  we don't update the last_master_timestamp.
 5183
 5184       In case of parallel execution last_master_timestamp is only updated when
 5185       a job is taken out of GAQ. Thus when last_master_timestamp is 0 (which
 5186       indicates that GAQ is empty, all slave workers are waiting for events from
 5187       the Coordinator), we need to initialize it with a timestamp from the first
 5188       event to be executed in parallel.
 5189     */
 5190     if ((!rli->is_parallel_exec() || rli->last_master_timestamp == 0) &&
 5191          !(ev->is_artificial_event() || ev->is_relay_log_event() ||
 5192           (ev->common_header->when.tv_sec == 0) ||
 5193           ev->get_type_code() == binary_log::FORMAT_DESCRIPTION_EVENT ||
 5194           ev->server_id == 0))
 5195     {
 5196       rli->last_master_timestamp= ev->common_header->when.tv_sec +
 5197                                   (time_t) ev->exec_time;
 5198       DBUG_ASSERT(rli->last_master_timestamp >= 0);
 5199     }

可以看到,last_master_timestamp的值便是ev->common_header->when.tv_sec和(time_t) ev->exec_time的和。在程式碼中加上點除錯資訊,重新編譯:

      DBUG_PRINT("info", ("before rli->last_master_timestamp = %lu", rli->last_master_timestamp));
      DBUG_PRINT("info", ("before rli->is_parallel_exec() = %d", int(rli->is_parallel_exec())));
    if ((!rli->is_parallel_exec() || rli->last_master_timestamp == 0) &&
         !(ev->is_artificial_event() || ev->is_relay_log_event() ||
          (ev->common_header->when.tv_sec == 0) ||
          ev->get_type_code() == binary_log::FORMAT_DESCRIPTION_EVENT ||
          ev->server_id == 0))
    {
      rli->last_master_timestamp= ev->common_header->when.tv_sec +
                                  (time_t) ev->exec_time;

      DBUG_PRINT("info", ("after rli->last_master_timestamp = %lu", rli->last_master_timestamp));
      DBUG_PRINT("info", ("after rli->is_parallel_exec() = %d", int(rli->is_parallel_exec())));
      DBUG_ASSERT(rli->last_master_timestamp >= 0);
    }

可以在mysqld.trace中觀察到如下的除錯資訊,印證了前文關於非並行複製的結論。同時也可以看出,非並行複製下last_master_timestamp的更新是在binlogevent執行之前。

exec_relay_log_event: info: before rli->last_master_timestamp = 0
exec_relay_log_event: info: before rli->is_parallel_exec() = 0
exec_relay_log_event: info: after rli->last_master_timestamp = 1562201989
exec_relay_log_event: info: after rli->is_parallel_exec() = 0

3.1.2.2 並行複製模式下更新last_master_timestamp

前文介紹了非並行複製模式下的last_master_timestamp的更新邏輯。本節將介紹下在並行複製模式下,last_master_timestamp是如何更新的。

在並行複製模式下last_master_timestamp的更新邏輯更為複雜。同時對於ROTATE_EVENT事件和普通的binlog事件更新模式也有所不同。接下來首先介紹下普通的binlog事件下last_master_timestamp的更新邏輯:

3.1.2.2.1 普通的binlog事件下last_master_timestamp的更新邏輯

在文獻1和文獻2中對於並行複製邏輯的分析如下:

​ 並行複製有一個分發佇列gaq,sql執行緒將binlog事務讀取到gaq,然後再分發給worker執行緒執行。並行複製時,binlog事件是併發穿插執行的,gaq中有一個checkpoint點稱為lwm, lwm之前的binlog都已經執行,而lwm之後的binlog有些執行有些沒有執行。 假設worker執行緒數為2,gap有1,2,3,4,5,6,7,8個事務。worker 1已執行的事務為1 4 6, woker 2執行的事務為2 3 ,那麼lwm為4。

​ 並行複製更新gap checkpiont時,會推進lwm點,同時更新last_master_timestamp為lwm所在事務結束的event的時間。因此,並行複製是在事務執行完成後才更新last_master_timestamp,更新是以事務為單位。同時更新gap checkpiont還受slave_checkpoint_period引數的影響。

​ coordinator會呼叫函式mts_checkpoint_routune定期做“checkpoint”操作,將Relay_log_info.gaq中已經執行結束的事務移出佇列,移除方式為從佇列頭開始檢查,如果done為真則移出,否則停止掃描,並標記Low Water Mark為移出的事務中最大的sequence_number(lock-interval終止點)

可以看出,並行複製下last_master_timestamp的更新邏輯與lwm和mts_checkpoint_routune函式有關。mts_checkpoint_routune函式頭如下:

/**
   Processing rli->gaq to find out the low-water-mark (lwm) coordinates
   which is stored into the cental recovery table.

   @param rli            pointer to Relay-log-info of Coordinator
   @param period         period of processing GAQ, normally derived from
                         @c mts_checkpoint_period
   @param force          if TRUE then hang in a loop till some progress
   @param need_data_lock False if rli->data_lock mutex is aquired by
                         the caller.

   @return FALSE success, TRUE otherwise
*/
bool mts_checkpoint_routine(Relay_log_info *rli, ulonglong period,
                            bool force, bool need_data_lock)

在coordinator中會定期呼叫mts_checkpoint_routine將Relay_log_info.gaq中已經執行結束的事務移出佇列,並且更新Low Water Mark為移出的事務中最大的sequence_number(參見mts_checkpoint_routine中如下):

 do
  {
    if (!is_mts_db_partitioned(rli))
      mysql_mutex_lock(&rli->mts_gaq_LOCK);

    cnt= rli->gaq->move_queue_head(&rli->workers);

    if (!is_mts_db_partitioned(rli))
      mysql_mutex_unlock(&rli->mts_gaq_LOCK);
#ifndef DBUG_OFF
    if (DBUG_EVALUATE_IF("check_slave_debug_group", 1, 0) &&
        cnt != opt_mts_checkpoint_period)
      sql_print_error("This an error cnt != mts_checkpoint_period");
#endif
  } while (!sql_slave_killed(rli->info_thd, rli) &&
           cnt == 0 && force &&
           !DBUG_EVALUATE_IF("check_slave_debug_group", 1, 0) &&
           (my_sleep(rli->mts_coordinator_basic_nap), 1));

在函式的最後,會執行last_master_timestamp的更新邏輯:

  /*
    Update the rli->last_master_timestamp for reporting correct Seconds_behind_master.

    If GAQ is empty, set it to zero.
    Else, update it with the timestamp of the first job of the Slave_job_queue
    which was assigned in the Log_event::get_slave_worker() function.
  */
  ts= rli->gaq->empty()
    ? 0
    : reinterpret_cast<Slave_job_group*>(rli->gaq->head_queue())->ts;
  rli->reset_notified_checkpoint(cnt, ts, need_data_lock, true);

可以看出如果GAP為空(既work的佇列為空)則更新last_master_timestamp為0,否則更新為佇列中第一個是元素的timestamp。

3.1.2.2.2 ROTATE_EVENT事件下last_master_timestamp的更新模式

前文提到在並行複製模式下,普通的binlog事件會被coordinator程序分發到work中去執行。並且在mts_checkpoint_routine中去推進Low Water Mark和更新last_master_timestamp。但是如果執行的binlog是一個ROTATE_EVENT事件,則coordinator程序不會將事件分發到work中,而是在coordinator程序中自己執行並更新last_master_timestamp。

對於ROTATE_EVENT事件的執行邏輯,入口依然是exec_relay_log_event函式。在函式中呼叫了apply_event_and_update_pos函式執行binlog事件。

 5247     /* ptr_ev can change to NULL indicating MTS coorinator passed to a Worker */
 5248     exec_res= apply_event_and_update_pos(ptr_ev, thd, rli);
 5249     /*
 5250       Note: the above call to apply_event_and_update_pos executes
 5251       mysql_mutex_unlock(&rli->data_lock);
 5252     */
 5253
 5254     /* For deferred events, the ptr_ev is set to NULL
 5255         in Deferred_log_events::add() function.
 5256         Hence deferred events wont be deleted here.
 5257         They will be deleted in Deferred_log_events::rewind() funciton.
 5258     */

在apply_event_and_update_pos函式中,會首先呼叫apply_event去判斷當前的binlog事件是否可以被分發到work程序執行。在這裡收到的binlog事件是ROTATE_EVENT,因此apply_event返回的exec_res為0,ev->worker==rli(rli 代表了coordinator程序),因此不會進入下面的if邏輯中:

 4709   if (reason == Log_event::EVENT_SKIP_NOT)
 4710   {
 4711     // Sleeps if needed, and unlocks rli->data_lock.
 4712     if (sql_delay_event(ev, thd, rli))
 4713       DBUG_RETURN(SLAVE_APPLY_EVENT_AND_UPDATE_POS_OK);
 4714
 4715     exec_res= ev->apply_event(rli);
 4717     if (!exec_res && (ev->worker != rli))
 4718     {
                        ... ...
          }

在mysqld.trace 中可以觀察到如下的輸出:

Log_event::shall_skip: info: skip reason=0=NOT
LOG_EVENT:apply_event: info: event_type=Rotate
apply_event_and_update_pos: info: apply_event error = 0

apply_event函式定義和註釋如下:

/**
   Scheduling event to execute in parallel or execute it directly.
   In MTS case the event gets associated with either Coordinator or a
   Worker.  A special case of the association is NULL when the Worker
   can't be decided yet.  In the single threaded sequential mode the
   event maps to SQL thread rli.

   @note in case of MTS failure Coordinator destroys all gathered
         deferred events.

   @return 0 as success, otherwise a failure.
*/
int Log_event::apply_event(Relay_log_info *rli)

因為無法被分發到work執行,因此會進入如下的執行邏輯中:

 4832   DBUG_PRINT("info", ("apply_event error = %d", exec_res));
 4833   if (exec_res == 0)
 4834   {
 4835     /*
 4836       Positions are not updated here when an XID is processed. To make
 4837       a slave crash-safe, positions must be updated while processing a
 4838       XID event and as such do not need to be updated here again.
 4839
 4840       However, if the event needs to be skipped, this means that it
 4841       will not be processed and then positions need to be updated here.
 4842
 4843       See sql/rpl_rli.h for further details.
 4844     */
 4845     int error= 0;
 4846     if (*ptr_ev &&
 4847         (ev->get_type_code() != binary_log::XID_EVENT ||
 4848          skip_event || (rli->is_mts_recovery() && !is_gtid_event(ev) &&
 4849          (ev->ends_group() || !rli->mts_recovery_group_seen_begin) &&
 4850           bitmap_is_set(&rli->recovery_groups, rli->mts_recovery_index))))
 4851     {
 4852 #ifndef DBUG_OFF
 4853       /*
 4854         This only prints information to the debug trace.
 4855
 4856         TODO: Print an informational message to the error log?
 4857       */
            ... ...
   
 4873       error= ev->update_pos(rli);
                    
                    ... ...
           }
         }

可以發現,在此呼叫了update_pos進行更新。因為當前的binlog事件是ROTATE_EVENT,因此會呼叫如下的程式碼:

/*
  Got a rotate log event from the master.

  This is mainly used so that we can later figure out the logname and
  position for the master.

  We can't rotate the slave's BINlog as this will cause infinitive rotations
  in a A -> B -> A setup.
  The NOTES below is a wrong comment which will disappear when 4.1 is merged.

  This must only be called from the Slave SQL thread, since it calls
  flush_relay_log_info().

  @retval
    0   ok
*/
int Rotate_log_event::do_update_pos(Relay_log_info *rli)

在do_update_pos 更新last_master_timestamp 的關鍵邏輯如下:

if (rli->is_parallel_exec())
{
      bool real_event= server_id && !is_artificial_event();
      rli->reset_notified_checkpoint(0,
                           real_event ?
                           common_header->when.tv_sec +
                           (time_t) exec_time : 0,
                           true/*need_data_lock=true*/,
                           real_event? true : false);
}

可以在mysqld.trace 中觀察到如下的輸出:

Rotate_log_event::do_update_pos: info: server_id=248; ::server_id=236
Rotate_log_event::do_update_pos: info: new_log_ident: dp-bin.000063
Rotate_log_event::do_update_pos: info: pos: 4
Rotate_log_event::do_update_pos: info: old group_master_log_name: 'dp-bin.000062'  old group_master_log_pos: 154
Relay_log_info::inc_group_relay_log_pos: info: log_pos: 4  group_master_log_pos: 154
Rotate_log_event::do_update_pos: info: new group_master_log_name: 'dp-bin.000063'  new group_master_log_pos: 4
Rotate_log_event::do_update_pos: info: ------> 1562684581(此處為本文加入的除錯資訊,在reset_notified_checkpoint更新last_master_timestamp的時候打印出。)

至此,對於並行複製和非並行複製下當收到ROTATE_EVENT事件時,更新last_master_timestamp的邏輯分析全部完成。

3.2 da執行與本例項serverid一致的binlog時的執行邏輯

前文分析了當mysql收到一個ROTATE_EVENT事件的時候所執行的邏輯。本節將分析下在mysql收到與自己的serverid一致的binlog事件的時候的執行邏輯。

3.2.1 io thread 處理與本例項serverid一致的binlog

首先來看下io thread 接收到與本身serverid一致的binlog的時候所做的操作。在handle_slave_io函式中,會在while迴圈中不斷的呼叫queue_even函式。

 5740     while (!io_slave_killed(thd,mi))
 5741     {
 5742       ulong event_len;
           ... ...
 5750       event_len= read_event(mysql, mi, &suppress_warnings);
          
           ... ...
             
 5813       /* XXX: 'synced' should be updated by queue_event to indicate
 5814          whether event has been synced to disk */
 5815       bool synced= 0;
 5816       if (queue_event(mi, event_buf, event_len))
 5817       {
 5818         mi->report(ERROR_LEVEL, ER_SLAVE_RELAY_LOG_WRITE_FAILURE,
 5819                    ER(ER_SLAVE_RELAY_LOG_WRITE_FAILURE),
 5820                    "could not queue event from master");
 5821         goto err;
 5822       }   
           ... ...
          }

在queue_even函式中,對於收到與自己serverid一致的binlog的處理邏輯如下:

 8534   /*
 8535      If this event is originating from this server, don't queue it.
 8536      We don't check this for 3.23 events because it's simpler like this; 3.23
 8537      will be filtered anyway by the SQL slave thread which also tests the
 8538      server id (we must also keep this test in the SQL thread, in case somebody
 8539      upgrades a 4.0 slave which has a not-filtered relay log).
 8540
 8541      ANY event coming from ourselves can be ignored: it is obvious for queries;
 8542      for STOP_EVENT/ROTATE_EVENT/START_EVENT: these cannot come from ourselves
 8543      (--log-slave-updates would not log that) unless this slave is also its
 8544      direct master (an unsupported, useless setup!).
 8545   */
   
        ... ...
   
 8560   if ((s_id == ::server_id && !mi->rli->replicate_same_server_id) ||
 8561       /*
 8562         the following conjunction deals with IGNORE_SERVER_IDS, if set
 8563         If the master is on the ignore list, execution of
 8564         format description log events and rotate events is necessary.
 8565       */
 8566       (mi->ignore_server_ids->dynamic_ids.size() > 0 &&
 8567        mi->shall_ignore_server_id(s_id) &&
 8568        /* everything is filtered out from non-master */
 8569        (s_id != mi->master_id ||
 8570         /* for the master meta information is necessary */
 8571         (event_type != binary_log::FORMAT_DESCRIPTION_EVENT &&
 8572          event_type != binary_log::ROTATE_EVENT))))
 8573   {
 8574     /*
 8575       Do not write it to the relay log.
 8576       a) We still want to increment mi->get_master_log_pos(), so that we won't
 8577       re-read this event from the master if the slave IO thread is now
 8578       stopped/restarted (more efficient if the events we are ignoring are big
 8579       LOAD DATA INFILE).
 8580       b) We want to record that we are skipping events, for the information of
 8581       the slave SQL thread, otherwise that thread may let
 8582       rli->group_relay_log_pos stay too small if the last binlog's event is
 8583       ignored.
 8584       But events which were generated by this slave and which do not exist in
 8585       the master's binlog (i.e. Format_desc, Rotate & Stop) should not increment
 8586       mi->get_master_log_pos().
 8587       If the event is originated remotely and is being filtered out by
 8588       IGNORE_SERVER_IDS it increments mi->get_master_log_pos()
 8589       as well as rli->group_relay_log_pos.
 8590     */
 8591     if (!(s_id == ::server_id && !mi->rli->replicate_same_server_id) ||
 8592         (event_type != binary_log::FORMAT_DESCRIPTION_EVENT &&
 8593          event_type != binary_log::ROTATE_EVENT &&
 8594          event_type != binary_log::STOP_EVENT))
 8595     {
 8596       mi->set_master_log_pos(mi->get_master_log_pos() + inc_pos);
 8597       memcpy(rli->ign_master_log_name_end, mi->get_master_log_name(), FN_REFLEN);
 8598       DBUG_ASSERT(rli->ign_master_log_name_end[0]);
 8599       rli->ign_master_log_pos_end= mi->get_master_log_pos();
 8600     }
 8601     rli->relay_log.signal_update(); // the slave SQL thread needs to re-check
 8602     DBUG_PRINT("info", ("master_log_pos: %lu, event originating from %u server, ignored",
 8603                         (ulong) mi->get_master_log_pos(), uint4korr(buf + SERVER_ID_OFFSET)));
 8604   }

結合註釋,可以看出當mysql收到與本例項serverid一致的binlog的時候,不會將當前的binlog事件寫入relaylog中。同時會完成如下的2個事情:(ps :這是一個關鍵的地方,在下文中將提到它的作用)

  1. 將mi->get_master_log_name() 拷貝到rli->ign_master_log_name_end中。
  2. rli->relay_log.signal_update()。此處會新signal_cnt這個變數的值。

在mysqld.trace 中可以觀察到如下的日誌資訊:

queue_event: info: master_log_pos: 219, event originating from 236 server, ignored
queue_event: info: error: 0

3.2.2 sql thread 處理與本例項serverid一致的binlog

接下里看看sql程序對binlog的處理過程。入口的函式還是exec_relay_log_event,在exec_relay_log_event函式中會呼叫next_event函式獲取一個可執行的binlog事件。在這裡next_event對於非並行複製會有一個特殊的處理:

 9175         if (!rli->is_parallel_exec())
 9176           rli->last_master_timestamp= 0;

如果是非並行複製,則當讀取一個binlog的時候,都會把last_master_timestamp設定成0 。因此在非並行複製下,收到與本例項serverid一致的binlog的時候,mysqld.trace中可以觀察到rli->last_master_timestamp的值會一直為0:

next_event: info: seeing an ignored end segment
handle_slave_io: info: IO thread received event of type Query
exec_relay_log_event: info: ================================before rli->last_master_timestamp = 0
exec_relay_log_event: info: ================================before rli->is_parallel_exec() = 0
apply_event_and_update_pos: info: thd->options: ; rli->last_event_start_time: 0
Log_event::shall_skip: info: ev->server_id=0, ::server_id=236, rli->replicate_same_server_id=0, rli->slave_skip_counter=0
Log_event::shall_skip: info: skip reason=0=NOT
LOG_EVENT:apply_event: info: event_type=Rotate

... ...
... ...

exec_relay_log_event: info: ================================before rli->last_master_timestamp = 0
exec_relay_log_event: info: ================================before rli->is_parallel_exec() = 0
apply_event_and_update_pos: info: thd->options: ; rli->last_event_start_time: 0
Log_event::shall_skip: info: ev->server_id=0, ::server_id=236, rli->replicate_same_server_id=0, rli->slave_skip_counter=0
Log_event::shall_skip: info: skip reason=0=NOT
LOG_EVENT:apply_event: info: event_type=Rotate
apply_event_and_update_pos: info: apply_event error = 0
apply_event_and_update_pos: info: OPTION_BEGIN: 0; IN_STMT: 0
Rotate_log_event::do_update_pos: info: server_id=0; ::server_id=236
Rotate_log_event::do_update_pos: info: new_log_ident: dp-bin.000069
Rotate_log_event::do_update_pos: info: pos: 282
MYSQL_BIN_LOG::signal_update: info: signal_cnt : 13
queue_event: info: master_log_pos: 326, event originating from 236 server, ignored
queue_event: info: error: 0

之後的邏輯中,如果發現rli->ign_master_log_name_end[0]不為空(對應了上一個小節中的第一項:將mi->get_master_log_name() 拷貝到rli->ign_master_log_name_end中),則構造一個serverid為0的Rotate event並返回:

 9181         if (rli->ign_master_log_name_end[0])
 9182         {
 9183           /* We generate and return a Rotate, to make our positions advance */
 9184           DBUG_PRINT("info",("seeing an ignored end segment"));
 9185           ev= new Rotate_log_event(rli->ign_master_log_name_end,
 9186                                    0, rli->ign_master_log_pos_end,
 9187                                    Rotate_log_event::DUP_NAME);
 9188           rli->ign_master_log_name_end[0]= 0;
 9189           mysql_mutex_unlock(log_lock);
 9190           if (unlikely(!ev))
 9191           {
 9192             errmsg= "Slave SQL thread failed to create a Rotate event "
 9193               "(out of memory?), SHOW SLAVE STATUS may be inaccurate";
 9194             goto err;
 9195           }
 9196           ev->server_id= 0; // don't be ignored by slave SQL thread
 9197           DBUG_RETURN(ev);
 9198         }

緊接著呼叫 apply_event_and_update_pos函式。在apply_event_and_update_pos函式中,如前所述如果是ROTATE EVENT 則返回0,不會被work程序並行執行,並進入update_pos邏輯中。在update_pos邏輯中關鍵的程式碼如下(sql/log_event.cc 檔案):

 6649     if (rli->is_parallel_exec())
 6650     {
 6651       bool real_event= server_id && !is_artificial_event();
 6652       rli->reset_notified_checkpoint(0,
 6653                            real_event ?
 6654                            common_header->when.tv_sec +
 6655                            (time_t) exec_time : 0,
 6656                            true/*need_data_lock=true*/,
 6657                            real_event? true : false);
 6658     }

可以見得real_event在server_id是0 的時候為false。因此當進入reset_notified_checkpoint函式後,因為update_timestamp條件(傳入的值便是real_event的值)為false便不會更新last_master_timestamp。

 262 /**
 263    This method is called in mts_checkpoint_routine() to mark that each
 264    worker is required to adapt to a new checkpoint data whose coordinates
 265    are passed to it through GAQ index.
 266
 267    Worker notices the new checkpoint value at the group commit to reset
 268    the current bitmap and starts using the clean bitmap indexed from zero
 269    of being reset checkpoint_seqno.
 270
 271     New seconds_behind_master timestamp is installed.
 272
 273    @param shift            number of bits to shift by Worker due to the
 274                            current checkpoint change.
 275    @param new_ts           new seconds_behind_master timestamp value
 276                            unless zero. Zero could be due to FD event
 277                            or fake rotate event.
 278    @param need_data_lock   False if caller has locked @c data_lock
 279    @param update_timestamp if true, this function will update the
 280                            rli->last_master_timestamp.
 281 */
 282 void Relay_log_info::reset_notified_checkpoint(ulong shift, time_t new_ts,
 283                                                bool need_data_lock,
 284                                                bool update_timestamp)
 285 {
 
       ... ...
       
 330   if (update_timestamp)
 331   {
 332     if (need_data_lock)
 333       mysql_mutex_lock(&data_lock);
 334     else
 335       mysql_mutex_assert_owner(&data_lock);
 336     last_master_timestamp= new_ts;
 337     if (need_data_lock)
 338       mysql_mutex_unlock(&data_lock);
 339   }
 340 }

3.3 小結

從上文的分析可以得出如下的2個結論 :

  1. 並行複製下會發現last_master_timestamp的值,一直是ROTATE EVENT發生的時間。當Exec_Master_Log_Pos < Read_Master_Log_Pos的時候,便會發現Seconds_Behind_Master的值突變(突變的值是當前時間和ROTATE EVENT發生的時間差的秒數)
  2. 在非並行複製下,因為 rli->last_master_timestamp會在next_event被置為0。同時之後的程式碼也不會去更新這個引數的值,因此rli->last_master_timestamp一直為0 。show slave status的時候,當Exec_Master_Log_Pos < Read_Master_Log_Pos的時候,Seconds_Behind_Master的值不會突變。

4. 總結

本文分析了並行複製和非並行複製下,Seconds_Behind_Master引數值更新的相關邏輯。限於本文的作者水平有限,文中的錯誤在所難免,懇請大家批評指