1. 程式人生 > >Windows Networking 4: CloudMonitor 引發的網路問題排查一則

Windows Networking 4: CloudMonitor 引發的網路問題排查一則

事情起因

近期一個比較有意思的問題,有使用者反饋說他們的伺服器一直在連線阿里雲的服務,把埠都佔滿了,導致網路中斷,業務連線建立不起來,telnet本地埠也不通。

按照產品設計的角度來說,這種情況一般不可能發生,但是使用者的netstat的結果來看,的確顯示大部分的連線是跟阿里雲的伺服器建立的。

image

由於伺服器都是非阿里雲實例,無法直接登陸排查,因此,準備了排查方案。

問題排查 + 分析

  1. 類似的從本地telnet本地埠不通的問題一般是動態埠用光。預設Windows Server 2008 之後的系統使用49152 - 65536範圍共計 16384 個埠。一般情況下是用不完的,如果有用光的情況常常伴隨Port Leak(埠洩露)問題。

    netsh int ipv4 show dyn tcp

    Protocol tcp Dynamic Port Range

    Start Port : 49152
    Number of Ports : 16384

    驗證方案是使用命令 netsh int ipv4 set dynamicport tcp start=10000 num=50000 來開放更多的動態埠使用。

    由於 netsh 直接通過 NSI 介面呼叫操作 Windows Kernel 的 TCPIP 方法來修改系統配置,該命令立時生效。

  2. 若需要確認埠是否有洩露,一般是需要抓dump來分析埠洩露的原因。但這類問題我們其實處理過不少,絕大多數的原因是三方驅動的問題。因此,建議先檢查一下作業系統中是否有三方軟體,如果是的話,嘗試解除安裝後再觀察。

    不過,使用者很快反饋是沒有任何三方軟體。

  3. 考慮到系統中多數動態埠在 TIME_WAIT 狀態,可以嘗試通過 TcpTimedWaitDelay 登錄檔來縮短 TIME_WAIT 的釋放時間。

    https://docs.microsoft.com/en-us/biztalk/technical-guides/settings-that-can-be-modified-to-improve-network-performance

    Key: HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters
    Value: TcpTimedWaitDelay
    Data Type: REG_DWORD
    Range: 30-300 (decimal)
    Default value: 0x78 (120 decimal)
    Recommended value: 30

    登錄檔的配置需要重啟才會生效。

  4. 使用者一直對他們伺服器訪問的阿里雲伺服器 IP 存有疑慮,不過大多數是 443 埠的訪問,抓包很難證明確切是哪個應用產生的連線。因此,我們需要 Windows 提供的 ETL Trace 方法來獲取應用程式對 TCPIP 等 Network API 的呼叫資訊。

    netsh trace start provider={2F07E2EE-15DB-40F1-90EF-9D7BA282188A} keywords=0xffffffffffffffff level=0xff provider={E53C6823-7BB8-44BB-90DC-3F86090D48A6} keywords=0xffffffffffffffff level=0xff provider={7D44233D-3055-4B9C-BA64-0D47CA40A232} keywords=0xffffffffffffffff level=0xff provider={50B3E73C-9370-461D-BB9F-26F32D68887D} keywords=0xffffffffffffffff level=0xff provider={43D1A55C-76D6-4F7E-995C-64C711E5CAFE} keywords=0xffffffffffffffff level=0xff maxSize=500MB fileMode=circular persistent=no overwrite=yes report=yes correlation=yes traceFile=c:NetworkTrace.etl capture=yes packettruncatebytes=128

    從收集的日誌看,我們發現請求對應的Process ID = 0x0600 = 1536,

     16767 [3]0600.061C::?2018?-?12?-?24 20:15:22.978 [Microsoft-Windows-TCPIP]TCP: endpoint 0xFFFFFA80C4CF0CF0 (Family=IPV6 , PID=1536) created with status = STATUS_SUCCESS. 
     16768 [3]0600.061C::?2018?-?12?-?24 20:15:22.978 [Microsoft-Windows-Winsock-AFD]socket: 1: Process 0xFFFFFA80C3FB0860 (0x600), Endpoint 0xFFFFFA80C259FE50, Family 0, Type 0, Protocol 0, Seq 1013, Status STATUS_SUCCESS 
     16769 [3]0600.061C::?2018?-?12?-?24 20:15:22.978 [Microsoft-Windows-Winsock-AFD]Socket option: 4: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Option FIONBIO , Value 0x1, Seq 11002, Status STATUS_SUCCESS 
     16770 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-Winsock-AFD]bind: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Address ::, Seq 7010, Status STATUS_SUCCESS 
     16771 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: endpoint/connection 0xFFFFFA80C4CF0CF0 acquired port number 10010. 
     16772 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: endpoint (sockaddr=[::]:10010) bound. 
     16773 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-Winsock-AFD]bind: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Address [::]:10010, Seq 7022, Status STATUS_SUCCESS 
     16774 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-Winsock-AFD]connect: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Address [::ffff:120.55.35.39]:443, Seq 5023, Status STATUS_SUCCESS 
     16775 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from ClosedState  to SynSentState , SndNxt = 0. 
     16776 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 attempted to acquire weak reference on port number 10010 inherited from endpoint 0xFFFFFA80C4CF0CF0. Successful = TRUE . 
     16777 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: Tcb 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) requested to connect. 
     16779 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: Tcb 0xFFFFFA80C3956B20 is going to output SYN with ISN = 2438595996, RcvWnd = 8192, RcvWndScale = 8. 
     16780 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) connect proceeding. 
     16789 [7]0000.0000::?2018?-?12?-?24 20:15:23.008 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from SynSentState  to EstablishedState , SndNxt = 2438595997. 
     16790 [7]0000.0000::?2018?-?12?-?24 20:15:23.008 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20: Received data with number of bytes = 1. ThSeq = 2246241665. 
     16791 [7]0000.0000::?2018?-?12?-?24 20:15:23.008 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) connect completed. PID = 1536. 
     16793 [3]0600.061C::?2018?-?12?-?24 20:15:23.008 [Microsoft-Windows-Winsock-AFD]connect: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 5024, Status STATUS_SUCCESS 
     16794 [3]0600.061C::?2018?-?12?-?24 20:15:23.008 [Microsoft-Windows-Winsock-AFD]Socket option: 4: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Option FIONBIO , Value 0x0, Seq 11002, Status STATUS_SUCCESS 
     16795 [3]0600.061C::?2018?-?12?-?24 20:15:23.009 [Microsoft-Windows-TCPIP]TCP: Option TCP_OFFLOAD_NOT_PREFERRED  is going to be set for connection 0xFFFFFA80C3956B20. 
     16796 [3]0600.061C::?2018?-?12?-?24 20:15:23.019 [Microsoft-Windows-Winsock-AFD]send: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3047, Status STATUS_SUCCESS 
     16797 [3]0600.061C::?2018?-?12?-?24 20:15:23.019 [Microsoft-Windows-Winsock-AFD]send: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3056, Status STATUS_SUCCESS 
     16798 [3]0600.061C::?2018?-?12?-?24 20:15:23.019 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 send posted posted 279 bytes at 2438595997. 
     16807 [3]0600.061C::?2018?-?12?-?24 20:15:23.019 [Microsoft-Windows-Winsock-AFD]send: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3051, Status STATUS_SUCCESS 
     16815 [6]0000.0000::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]send: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3024, Status STATUS_SUCCESS 
     16818 [7]3ECC.2FC8::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]Data indication: 3: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer 0xFFFFFA80C2C56830, Length 1452, Seq 9000 
     16819 [7]3ECC.2FC8::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 delivery 0xFFFFFA80C3956C80 indicated 0x5AC bytes accepted 0x5AC bytes, status = STATUS_SUCCESS. RcvNxt = 2246241665. 
     16820 [7]3ECC.2FC8::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20: Received data with number of bytes = 1452. ThSeq = 2246241665. 
     16823 [3]0600.061C::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]recv: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0x0, Length 5, Seq 4115, Status STATUS_SUCCESS 
     16824 [6]0000.0000::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]Data indication: 3: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer 0xFFFFFA80C2D36F50, Length 1452, Seq 9000 
     16825 [3]0600.061C::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]recv: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0x0, Length 5, Seq 4116, Status STATUS_SUCCESS 
     16826 [6]0000.0000::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 delivery 0xFFFFFA80C3956C80 indicated 0x5AC bytes accepted 0x5AC bytes, status = STATUS_SUCCESS. RcvNxt = 2246243117. 
     16827 [6]0000.0000::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20: Received data with number of bytes = 1452. ThSeq = 2246243117. 
     ...
     17644 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from EstablishedState  to FinWait1State , SndNxt = 2438604647. 
     17650 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]socket cleanup: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2002, Status STATUS_SUCCESS 
     17651 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-TCPIP]TCP: endpoint (sockaddr=[::]:10010) closed. 
     17652 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-TCPIP]TCP: endpoint/connection 0xFFFFFA80C4CF0CF0 released port number 10010. WeakReference = FALSE . 
     17653 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]socket cleanup: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2003, Status STATUS_SUCCESS 
     17654 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]closesocket: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2000, Status STATUS_SUCCESS 
     17655 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]closesocket: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2001, Status STATUS_SUCCESS 
     17664 [7]0000.0000::?2018?-?12?-?24 20:15:28.549 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from FinWait1State  to ClosingState , SndNxt = 2438604648. 
     17665 [7]0000.0000::?2018?-?12?-?24 20:15:28.549 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 delivery 0xFFFFFA80C3956C80 delivering FIN. RcvNxt = 2246247426. 
     17666 [7]0000.0000::?2018?-?12?-?24 20:15:28.549 [Microsoft-Windows-Winsock-AFD]disconnect indicated: 3: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 12001 
     17671 [7]0000.0000::?2018?-?12?-?24 20:15:28.549 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from ClosingState  to TimeWaitState , SndNxt = 2438604648. 
     17673 [7]0000.0000::?2018?-?12?-?24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) disconnect completed. 
     17675 [7]0000.0000::?2018?-?12?-?24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) close issued. 
     17676 [7]0000.0000::?2018?-?12?-?24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) shutdown initiated (The transport connection was aborted by the local system.). PID = 1536. 
     17677 [7]0000.0000::?2018?-?12?-?24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from TimeWaitState  to ClosedState , SndNxt = 2438604648. 
     17678 [2]0004.0058::?2018?-?12?-?24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: endpoint/connection 0xFFFFFA80C3956B20 released port number 10010. WeakReference = TRUE . 
    

    PID = 1536 通過 Process Explorer 我們能找到對應的Java.exe,

    image

  5. 由於通過第四步我們找到去往阿里雲伺服器 120.55.35.x 的連線請求是 CloudMonitor 觸發的。

    考慮到沒有配置 TcpTimedWaitDelay 的作業系統預設使用 2MSL (120s) TIME_WAIT timeout 時間,如果一個系統近 15000 的埠都處於 TIME_WAIT 狀態,理論上來說 Cloud Monitor 至少需要每秒產生 125 條以上的連線。而無論從 Cloud Monitor 的日誌還是之前的 ETL Trace ,我們都沒有發現此類行為,Cloud Monitor 程式比較規律的每幾秒鐘建立連線用於提交 "指標"。

    41931919752 INFO 2018-12-24 17:57:17.748 [ricGatherServiceHttp] 提交指標完成,耗時:78ms. SystemInfo [serialNumber=------------------------------------, hostname=----------, localIPs=[----------], name=Windows Server -------, version=---, arch=amd64]

    41931935181 INFO 2018-12-24 17:57:33.177 [ricGatherServiceHttp] 提交指標完成,耗時:47ms. SystemInfo [serialNumber=------------------------------------, hostname=----------, localIPs=[----------], name=Windows Server -------, version=---, arch=amd64]

    41931950625 INFO 2018-12-24 17:57:48.621 [ricGatherServiceHttp] 提交指標完成,耗時:47ms. SystemInfo [serialNumber=------------------------------------, hostname=----------, localIPs=[----------], name=Windows Server -------, version=---, arch=amd64]

  6. 問題比較蹊蹺,排除一切可能後,剩下的唯一可能就是 Windows 系統在 2MSL 後沒有釋放 TIME_WAIT 的埠,而後續的排查也確認了這一點。可惜,我們從 ETL Trace 日誌中無法獲知具體原因,但是使用者無心的一句話卻提醒了我們,

    "這臺伺服器運行了很長時間,一直都沒有出問題"

    具體有多長?我們通過事件日誌 Eventlog 6013 確認已經運行了 43225197s 近 500 多天。這引起我們的注意,

    All the TCP/IP ports that are in a TIME_WAIT status are not closed after 497 days from system startup in Windows Vista, in Windows 7, in Windows Server 2008 and in Windows Server 2008 R2
    https://support.microsoft.com/en-us/help/2553549/all-the-tcp-ip-ports-that-are-in-a-time-wait-status-are-not-closed-aft

至此,整個問題得到了完整的分析和解決。

原因分析

簡單說一下這個 bug 的產生的原因,在 Windows 系統中,使用當前系統執行時間加上 TIMEOUT 時間來決定什麼時候釋放 TIME_WAIT 狀態的 TCP Endpoint。在沒有安裝補丁的系統中,32位暫存器所能存放的最大值 0xFFFFFFFF,暫存器中存放的數值為 "時間 * 0n100",也就是說在系統執行超過 "0XFFFFFFFF/0n100" 秒之後,就可能發生問題。

0xFFFFFFFF/(0n86400 * 0n100)= 497.1026961805...