1. 程式人生 > >(1) S3 connect reset by peer

(1) S3 connect reset by peer

問題描述 && 初步分析定位

問題描述

s3cmd -q get s3://zh_test/data_30.tar - | tar -xvf - -C ./

在使用上面的命令,從S3中get 一個tar包,並同時解壓到本地ECFS分散式儲存的卷中時,會發生報錯,導致get中斷。

問題復現頻率非常高

s3cmd 日誌


#s3cmd -q get s3://zh_test/data_30.tar - | tar -xvf - -C ./

data_30G



!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

    An unexpected error has occurred.

  Please try reproducing the error using

  the latest s3cmd code from the git master

  branch found at:

    https://github.com/s3tools/s3cmd

  and have a look at the known issues list:

    https://github.com/s3tools/s3cmd/wiki/Common-known-issues-and-their-solutions

  If the error persists, please report the

  following lines (removing any private

  info as necessary) to:

   
[email protected]
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Invoked as: /usr/bin/s3cmd -q get s3://zh_test/data_30.tar - Problem: error: [Errno 104] Connection reset by peer S3cmd: 2.0.2 python: 2.7.5 (default, May 3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)] environment LANG=en_US.UTF-8 Traceback (most recent call last): File "/usr/bin/s3cmd", line 3092, in <module> rc = main() File "/usr/bin/s3cmd", line 3001, in main rc = cmd_func(args) File "/usr/bin/s3cmd", line 590, in cmd_object_get response = s3.object_get(uri, dst_stream, destination, start_position = start_position, extra_label = seq_label) File "/usr/lib/python2.7/site-packages/S3/S3.py", line 716, in object_get response = self.recv_file(request, stream, labels, start_position) File "/usr/lib/python2.7/site-packages/S3/S3.py", line 1742, in recv_file data = http_response.read(this_chunk) File "/usr/lib64/python2.7/httplib.py", line 602, in read s = self.fp.read(amt) File "/usr/lib64/python2.7/socket.py", line 380, in read data = self._sock.recv(left) error: [Errno 104] Connection reset by peer
  • 關鍵點:Problem: error: [Errno 104] Connection reset by peer

rgw 日誌


2018-09-03 17:39:15.494376 7fbcdb8c3700 0 ERROR: flush_read_list(): d->client_cb->handle_data() returned -5

2018-09-03 17:39:15.494449 7fbcdb8c3700 1 ====== req done req=0x7fbcdb8bd110 op status=-5 http_status=200 ======

2018-09-03 17:39:15.494533 7fbcdb8c3700 1 civetweb: 0x7fbd29634000: 10.10.40.2 - - [03/Sep/2018:17:39:15 +0800] "GET /zh_test/data_30.tar HTTP/1.1" 200 0 - -

2018-09-03 17:41:08.154381 7fbcdb0c2700 1 ====== starting new request req=0x7fbcdb0bc110 =====

2018-09-03 17:41:08.182134 7fbcdb0c2700 0 ERROR: flush_read_list(): d->client_cb->handle_data() returned -5

2018-09-03 17:41:08.182225 7fbcdb0c2700 1 ====== req done req=0x7fbcdb0bc110 op status=-5 http_status=200 ======

2018-09-03 17:41:08.182306 7fbcdb0c2700 1 civetweb: 0x7fbd2963e000: 10.10.40.2 - - [03/Sep/2018:17:41:08 +0800] "GET /zh_test/data_30.tar HTTP/1.1" 200 0 - -

2018-09-03 17:41:15.372568 7fbcda0c0700 1 ====== starting new request req=0x7fbcda0ba110 =====

2018-09-03 17:41:15.398971 7fbcda0c0700 0 ERROR: flush_read_list(): d->client_cb->handle_data() returned -5

2018-09-03 17:41:15.399043 7fbcda0c0700 1 ====== req done req=0x7fbcda0ba110 op status=-5 http_status=200 ======

2018-09-03 17:41:15.399127 7fbcda0c0700 1 civetweb: 0x7fbd29643000: 10.10.40.2 - - [03/Sep/2018:17:41:15 +0800] "GET /zh_test/data_30.tar HTTP/1.1" 200 0 - -

2018-09-03 17:41:18.954849 7fbcda8c1700 1 ====== starting new request req=0x7fbcda8bb110 =====

2018-09-03 17:41:18.981131 7fbcda8c1700 0 ERROR: flush_read_list(): d->client_cb->handle_data() returned -5

2018-09-03 17:41:18.981216 7fbcda8c1700 1 ====== req done req=0x7fbcda8bb110 op status=-5 http_status=200 ======

2018-09-03 17:41:18.981333 7fbcda8c1700 1 civetweb: 0x7fbd29639000: 10.10.40.2 - - [03/Sep/2018:17:41:18 +0800] "GET /zh_test/data_30.tar HTTP/1.1" 200 0 - -

2018-09-03 17:42:43.691370 7fbcd98bf700 1 ====== starting new request req=0x7fbcd98b9110 =====

2018-09-03 17:42:43.734738 7fbcd98bf700 0 ERROR: flush_read_list(): d->client_cb->handle_data() returned -5

2018-09-03 17:42:43.734827 7fbcd98bf700 1 ====== req done req=0x7fbcd98b9110 op status=-5 http_status=200 ======

2018-09-03 17:42:43.734936 7fbcd98bf700 1 civetweb: 0x7fbd29648000: 10.10.40.2 - - [03/Sep/2018:17:42:43 +0800] "GET /zh_test/data_30.tar HTTP/1.1" 200 0 - -

2018-09-03 18:59:45.404547 7fbcd90be700 1 ====== starting new request req=0x7fbcd90b8110 =====

2018-09-03 18:59:45.404854 7fbcd90be700 1 ====== req done req=0x7fbcd90b8110 op status=0 http_status=403 ======

2018-09-03 18:59:45.404912 7fbcd90be700 1 civetweb: 0x7fbd2964d000: 10.10.20.192 - - [03/Sep/2018:18:59:45 +0800] "GET /zh_test/data_30.tar HTTP/1.1" 403 0 - -

2018-09-03 19:00:40.854024 7fbcd80bc700 1 ====== starting new request req=0x7fbcd80b6110 =====

  • 關鍵點:2018-09-03 17:41:15.398971 7fbcda0c0700 0 ERROR: flush_read_list(): d->client_cb->handle_data() returned -5

flush_read_list() 函式堆疊

  • flush_read_list()

    • d->client_cb->handle_data() —> RGWGetObj_ObjStore_S3::send_response_data()

      • 網路傳送: send()

從以上資訊分析,可以看出錯誤與網路相關,接下來我們進行抓包分析:

tcpdump 抓包分析

  • s3 從服務端get時,流量是一陣一陣的,中間會有10s左右的空窗期

  • 以上兩張截圖,均為下載流量空窗期的報文互動情況
  • 可以看出,資料下載暫停的主要原因為:接收端的 socket recv buff 滿了,導致TCP層面傳送 TCP ZeroWindow報文
  • 而在此過程中,傳送端定時傳送probe報文,詢問視窗是否恢復
  • 在接收端recv buff空閒後,接收端傳送 TCP Window Updata報文,恢復資料傳送

出問題時候:

  • 從以上的抓包分析來看,最後TCP連線,被rgw reset了

問題分析

  • s3下載流量斷斷續續的原因分析:
    • s3cmd -q get s3://zh_test/data_30.tar - | tar -xvf - -C ./
    • 使用以上命令進行下載時,資料會全部寫入到 VFS的 cache上
    • 等VFS cache達到一定的量後,再耍到後端盤中
    • 在觸發刷後端盤的同時,會導致tar無法繼續往下寫,從而導致socket recv buff滿,從而觸發 ZeroWindow報文
  • TCP connection reset 原因分析:
    • 猜測為接收端緩衝區滿,導致傳送端傳送失敗
    • 但為什麼有時候能繼續恢復傳送,有時候會被reset,原因暫時不清楚,留待後續分析

解決方案

由於我們使用的是在xtrabackup中,使用xbstream 方式來替代tar進行備份的打包

  • xbstream 目前會使用O_DIRECT方式進行寫資料,會繞過Linux buffer/cache
  • 這樣會導致s3的流量是連續的,非斷斷續續的
  • TCP的buffer也不會有一段時間是滿的

PS: s3 配合tar 和xbstream 進行上傳、下載的對比測試,請參考另一篇文章

問題遺留

TCP connection reset 原因未知???