利用keepalive和timeout來判斷死連線
問題是這樣出現的,
操作:客戶端正在向服務端請求資料的時候,突然拔掉客戶端的網線。
現象:客戶端死等,服務端socket一直存在。
在網上搜索後,需要設定KEEPALIVE屬性。
於是就在客戶端和服務端都設定了KEEPALIVE屬性。
程式碼如下:
int keepalive = 1; // 開啟keepalive int keepidle = 10; // 空閒10s開始傳送檢測包(系統預設2小時) int keepinterval = 1; // 傳送檢測包間隔 (系統預設75s) int keepcount = 5; // 傳送次數如果5次都沒有迴應,就認定peer端斷開了。(系統預設9次) setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE,&keepalive, sizeof(keepalive)); setsockopt(fd, IPPROTO_TCP, TCP_KEEPIDLE,&keepidle, sizeof(keepidle)); setsockopt(fd, IPPROTO_TCP, TCP_KEEPINTVL,&keepinterval, sizeof(keepinterval)); setsockopt(fd, IPPROTO_TCP, TCP_KEEPCNT,&keepcount, sizeof(keepcount));
這樣的情況下,客戶端沒有問題了,可以主動關閉,但是服務端還是在死等,也就是說keepalive沒起作用。
其實我也沒有查到原因,插一句題外話,百度搜索真是不好用(偏偏google被封了,公司也不肯買vpn,有種淡淡的憂傷)。
後來我用了一個沒有被封的google ip搜尋到了這樣一個屬性,TCP_USER_TIMEOUT (since Linux 2.6.37)。
This option takes an unsigned int as anargument. When the
value is greater than 0, it specifies themaximum amount of
time in milliseconds that transmitted datamay remain
unacknowledged before TCP will forciblyclose the
corresponding connection and returnETIMEDOUT to the
application. If the option value is specified as 0, TCPwill
to use the system default.
Increasing user timeouts allows a TCPconnection to survive
extended periods without end-to-endconnectivity. Decreasing
user timeouts allows applications to"fail fast", if so
desired. Otherwise, failure may take up to 20 minutes with
the current system defaults in a normal WANenvironment.
This option can be set during any state ofa TCP connection,
but is only effective during thesynchronized states of a
connection (ESTABLISHED, FIN-WAIT-1,FIN-WAIT-2, CLOSE-WAIT,
CLOSING, and LAST-ACK). Moreover, when used with the TCP
keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will
overridekeepalive to determine when to close a connection due
to keepalivefailure.
The option has no effect on when TCPretransmits a packet, nor
when a keepalive probe is sent.
This option, like many others, will beinherited by the socket
returned by accept(2), if it was set on thelistening socket.
Further details on the user timeout featurecan be found in
RFC 793 and RFC 5482 ("TCP UserTimeout Option").
所以我們在服務端加上了TCP_USER_TIMEOUT屬性,問題就解決了。
unsigned int timeout = 10000; // 10s
setsockopt(fd, IPPROTO_TCP, TCP_USER_TIMEOUT, &timeout, sizeof(timeout));
後來又搜尋了一下,在下面的文章裡找到了印證。
使用TCP KEEP-ALIVE與TCP_USER_TIMEOUT機制判斷通訊對端是否存活
第一個問題:
在對端的網線被拔、網絡卡被解除安裝或者禁用的時候,對端沒有機會向本地作業系統傳送TCP RST或者FIN包來關閉連線。這時候作業系統不會認為對端已經掛了。所以在呼叫send函式的時候,返回的仍然是我們指定傳送的資料位元組數。當我們無法通過send的返回值來判斷對端是否存活的情況下,就要使用TCP Keep-alive機制了。
在《Unix網路程式設計(卷一)》中提到,使用SO_KEEPALIVE套接字選項啟用對套接字的保活(Keep-alive)機制。
給一個TCP套介面設定保持存活(keepalive)選項後,如果在2小時內在此套介面的任一方向都沒有資料交換,TCP就自動給對方發一個保持存活探測分節(keepalive probe)。
TCP提供了這種機制幫我們判斷對端是否存活,如果對端沒有對KeepAlive包進行正常的響應,則會導致下一次對套接字的send或者recv出錯。應用程式就可以檢測到這個異常。
第二個問題:
如果傳送方傳送的資料包沒有收到接收方回覆的ACK資料包,則TCP Keep-alive機制就不會被啟動,而TCP會啟動超時重傳機制,這樣就使得TCP Keep-alive機制在未收到ACK包時失效。