(1) S3 connect reset by peer
阿新 • • 發佈:2018-12-10
問題描述 && 初步分析定位
問題描述
s3cmd -q get s3://zh_test/data_30.tar - | tar -xvf - -C ./
在使用上面的命令,從S3中get 一個tar包,並同時解壓到本地ECFS分散式儲存的卷中時,會發生報錯,導致get中斷。
問題復現頻率非常高
s3cmd 日誌
#s3cmd -q get s3://zh_test/data_30.tar - | tar -xvf - -C ./ data_30G !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! An unexpected error has occurred. Please try reproducing the error using the latest s3cmd code from the git master branch found at: https://github.com/s3tools/s3cmd and have a look at the known issues list: https://github.com/s3tools/s3cmd/wiki/Common-known-issues-and-their-solutions If the error persists, please report the following lines (removing any private info as necessary) to:
[email protected] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Invoked as: /usr/bin/s3cmd -q get s3://zh_test/data_30.tar - Problem: error: [Errno 104] Connection reset by peer S3cmd: 2.0.2 python: 2.7.5 (default, May 3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)] environment LANG=en_US.UTF-8 Traceback (most recent call last): File "/usr/bin/s3cmd", line 3092, in <module> rc = main() File "/usr/bin/s3cmd", line 3001, in main rc = cmd_func(args) File "/usr/bin/s3cmd", line 590, in cmd_object_get response = s3.object_get(uri, dst_stream, destination, start_position = start_position, extra_label = seq_label) File "/usr/lib/python2.7/site-packages/S3/S3.py", line 716, in object_get response = self.recv_file(request, stream, labels, start_position) File "/usr/lib/python2.7/site-packages/S3/S3.py", line 1742, in recv_file data = http_response.read(this_chunk) File "/usr/lib64/python2.7/httplib.py", line 602, in read s = self.fp.read(amt) File "/usr/lib64/python2.7/socket.py", line 380, in read data = self._sock.recv(left) error: [Errno 104] Connection reset by peer
- 關鍵點:
Problem: error: [Errno 104] Connection reset by peer
rgw 日誌
2018-09-03 17:39:15.494376 7fbcdb8c3700 0 ERROR: flush_read_list(): d->client_cb->handle_data() returned -5 2018-09-03 17:39:15.494449 7fbcdb8c3700 1 ====== req done req=0x7fbcdb8bd110 op status=-5 http_status=200 ====== 2018-09-03 17:39:15.494533 7fbcdb8c3700 1 civetweb: 0x7fbd29634000: 10.10.40.2 - - [03/Sep/2018:17:39:15 +0800] "GET /zh_test/data_30.tar HTTP/1.1" 200 0 - - 2018-09-03 17:41:08.154381 7fbcdb0c2700 1 ====== starting new request req=0x7fbcdb0bc110 ===== 2018-09-03 17:41:08.182134 7fbcdb0c2700 0 ERROR: flush_read_list(): d->client_cb->handle_data() returned -5 2018-09-03 17:41:08.182225 7fbcdb0c2700 1 ====== req done req=0x7fbcdb0bc110 op status=-5 http_status=200 ====== 2018-09-03 17:41:08.182306 7fbcdb0c2700 1 civetweb: 0x7fbd2963e000: 10.10.40.2 - - [03/Sep/2018:17:41:08 +0800] "GET /zh_test/data_30.tar HTTP/1.1" 200 0 - - 2018-09-03 17:41:15.372568 7fbcda0c0700 1 ====== starting new request req=0x7fbcda0ba110 ===== 2018-09-03 17:41:15.398971 7fbcda0c0700 0 ERROR: flush_read_list(): d->client_cb->handle_data() returned -5 2018-09-03 17:41:15.399043 7fbcda0c0700 1 ====== req done req=0x7fbcda0ba110 op status=-5 http_status=200 ====== 2018-09-03 17:41:15.399127 7fbcda0c0700 1 civetweb: 0x7fbd29643000: 10.10.40.2 - - [03/Sep/2018:17:41:15 +0800] "GET /zh_test/data_30.tar HTTP/1.1" 200 0 - - 2018-09-03 17:41:18.954849 7fbcda8c1700 1 ====== starting new request req=0x7fbcda8bb110 ===== 2018-09-03 17:41:18.981131 7fbcda8c1700 0 ERROR: flush_read_list(): d->client_cb->handle_data() returned -5 2018-09-03 17:41:18.981216 7fbcda8c1700 1 ====== req done req=0x7fbcda8bb110 op status=-5 http_status=200 ====== 2018-09-03 17:41:18.981333 7fbcda8c1700 1 civetweb: 0x7fbd29639000: 10.10.40.2 - - [03/Sep/2018:17:41:18 +0800] "GET /zh_test/data_30.tar HTTP/1.1" 200 0 - - 2018-09-03 17:42:43.691370 7fbcd98bf700 1 ====== starting new request req=0x7fbcd98b9110 ===== 2018-09-03 17:42:43.734738 7fbcd98bf700 0 ERROR: flush_read_list(): d->client_cb->handle_data() returned -5 2018-09-03 17:42:43.734827 7fbcd98bf700 1 ====== req done req=0x7fbcd98b9110 op status=-5 http_status=200 ====== 2018-09-03 17:42:43.734936 7fbcd98bf700 1 civetweb: 0x7fbd29648000: 10.10.40.2 - - [03/Sep/2018:17:42:43 +0800] "GET /zh_test/data_30.tar HTTP/1.1" 200 0 - - 2018-09-03 18:59:45.404547 7fbcd90be700 1 ====== starting new request req=0x7fbcd90b8110 ===== 2018-09-03 18:59:45.404854 7fbcd90be700 1 ====== req done req=0x7fbcd90b8110 op status=0 http_status=403 ====== 2018-09-03 18:59:45.404912 7fbcd90be700 1 civetweb: 0x7fbd2964d000: 10.10.20.192 - - [03/Sep/2018:18:59:45 +0800] "GET /zh_test/data_30.tar HTTP/1.1" 403 0 - - 2018-09-03 19:00:40.854024 7fbcd80bc700 1 ====== starting new request req=0x7fbcd80b6110 =====
- 關鍵點:
2018-09-03 17:41:15.398971 7fbcda0c0700 0 ERROR: flush_read_list(): d->client_cb->handle_data() returned -5
flush_read_list() 函式堆疊
-
flush_read_list()
-
d->client_cb->handle_data() —> RGWGetObj_ObjStore_S3::send_response_data()
- 網路傳送: send()
-
從以上資訊分析,可以看出錯誤與網路相關,接下來我們進行抓包分析:
tcpdump 抓包分析
- s3 從服務端get時,流量是一陣一陣的,中間會有10s左右的空窗期
- 以上兩張截圖,均為下載流量空窗期的報文互動情況
- 可以看出,資料下載暫停的主要原因為:接收端的 socket recv buff 滿了,導致TCP層面傳送
TCP ZeroWindow
報文 - 而在此過程中,傳送端定時傳送
probe
報文,詢問視窗是否恢復 - 在接收端recv buff空閒後,接收端傳送
TCP Window Updata
報文,恢復資料傳送
出問題時候:
- 從以上的抓包分析來看,最後TCP連線,被rgw reset了
問題分析
- s3下載流量斷斷續續的原因分析:
s3cmd -q get s3://zh_test/data_30.tar - | tar -xvf - -C ./
- 使用以上命令進行下載時,資料會全部寫入到 VFS的 cache上
- 等VFS cache達到一定的量後,再耍到後端盤中
- 在觸發刷後端盤的同時,會導致tar無法繼續往下寫,從而導致socket recv buff滿,從而觸發
ZeroWindow
報文
- TCP connection reset 原因分析:
- 猜測為接收端緩衝區滿,導致傳送端傳送失敗
- 但為什麼有時候能繼續恢復傳送,
有時候會被reset,原因暫時不清楚,留待後續分析
解決方案
由於我們使用的是在xtrabackup中,使用xbstream 方式來替代tar進行備份的打包
- xbstream 目前會使用O_DIRECT方式進行寫資料,會繞過Linux buffer/cache
- 這樣會導致s3的流量是連續的,非斷斷續續的
- TCP的buffer也不會有一段時間是滿的
PS: s3 配合tar 和xbstream 進行上傳、下載的對比測試,請參考另一篇文章
問題遺留
TCP connection reset 原因未知???