1. 程式人生 > >記kafka partition資料量過大導致不能正確重啟

記kafka partition資料量過大導致不能正確重啟

某臺kafka伺服器負載過高,機器掛掉一段是時間後,kill掉佔用記憶體的程序,然後重啟kafka服務,但是一直不能完成啟動和資料同步,日誌如下
fset 0 to broker BrokerEndPoint(11,192.168.207.79,9092)] ) (kafka.server.ReplicaFetcherManager)
[2016-04-26 19:16:33,274] INFO [ReplicaFetcherManager on broker 13] Removed fetcher for partitions [ifindnotice_lp_queue,3],[newifindreport_lp_queue,1],[eventoutputqueue,0],[newifindreport_lp_queue,0],[NewEventOutputQueueLuyang,0],[weibo_dlstat_queue,0],[test_dlstat_queue,0],[investqa_lp_queue,2],[forum_yuqing_queue,3],[soniu_dlstat_queue,0] (kafka.server.ReplicaFetcherManager)
[2016-04-26 19:16:53,909] WARN [ReplicaFetcherThread-0-14], Error in fetch 
[email protected]
Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread) [2016-04-26 19:16:53,910] WARN [ReplicaFetcherThread-0-12], Error in fetch [email protected] Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread) [2016-04-26 19:16:53,912] WARN [ReplicaFetcherThread-0-11], Error in fetch kafka.server.ReplicaFetcherThrea
[email protected]
Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread) [2016-04-26 19:17:25,917] WARN [ReplicaFetcherThread-0-14], Error in fetch [email protected] Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread) [2016-04-26 19:17:25,918] WARN [ReplicaFetcherThread-0-11], Error in fetch
[email protected]
Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread) [2016-04-26 19:17:25,920] WARN [ReplicaFetcherThread-0-12], Error in fetch [email protected] Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread) [2016-04-26 19:17:48,844] INFO [Group Metadata Manager on Broker 13]: Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager) [2016-04-26 19:17:57,924] WARN [ReplicaFetcherThread-0-14], Error in fetch [email protected] Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread) [2016-04-26 19:17:57,925] WARN [ReplicaFetcherThread-0-11], Error in fetch [email protected] Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread) [2016-04-26 19:17:57,925] WARN [ReplicaFetcherThread-0-12], Error in fetch [email protected] Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms (kafka.server.ReplicaFetcherThread) [2016-04-26 19:18:17,984] INFO [ReplicaFetcherManager on broker 13] Removed fetcher for partitions [__consumer_offsets,30] (kafka.server.ReplicaFetcherManager) [2016-04-26 19:18:17,985] INFO [Group Metadata Manager on Broker 13]: Loading offsets and group metadata from [__consumer_offsets,30] (kafka.coordinator.GroupMetadataManager) [2016-04-26 19:18:17,999] INFO [Group Metadata Manager on Broker 13]: Finished loading offsets from [__consumer_offsets,30] in 14 milliseconds. (kafka.coordinator.GroupMetadataManager)

從上面日誌發現kafka重啟後,ReplicaFetcherTrhead 有錯誤日誌 Possible cause: java.net.SocketTimeoutException: No response received within 30000 ms ,表明資料同步時,發生超時

[2016-04-26 19:38:28,999] INFO Client session timed out, have not heard from server in 4846ms for sessionid 0x253850bfcad1cbe, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2016-04-26 19:38:29,100] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient)
[2016-04-26 19:38:29,399] INFO Opening socket connection to server 192.168.201.41/192.168.201.41:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2016-04-26 19:38:29,400] INFO Socket connection established to 192.168.201.41/192.168.201.41:2181, initiating session (org.apache.zookeeper.ClientCnxn)
[2016-04-26 19:38:29,402] INFO Session establishment complete on server 192.168.201.41/192.168.201.41:2181, sessionid = 0x253850bfcad1cbe, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn)
[2016-04-26 19:38:29,402] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient)
從上面日誌發現,與zk的連結也出現了超時情況

dstat檢視網絡卡情況,發現網絡卡吃滿,於是初步定為到是大量同步資料導致網絡卡吃滿,進而導致部分連線產生超時失敗


檢視kafka 資料目錄,發現數據目錄都是1g以上


要broker完全啟動成功,需要不同的replica之間資料同步完成,但是由於資料量過大,導致同步失敗,於是考慮有沒有辦法減少要同步的資料量

於是修改了 topic 的retention.ms 為 1小時,同時 server的配置 
log.retention.bytes 和 log.segment.bytes 也做了修改,分別改成1g和512m

但是這時可能仍然上線,只能依次重啟其它kafka server,當isr的master切換到當前這臺機器上,offset和log資料會以當前master為準同步,重啟其它機器,相當於不做同步,這可能會導致資料丟失。

依次重啟等待片刻後,所有kafka server 都上線,isr 恢復正常