網路原因造成 spark task 卡住
阿新 • • 發佈:2018-12-22
主機名映射出錯
背景
Yarn叢集新加入了一批Spark機器後發現執行Spark任務時,一些task會無限卡住且driver端沒有任何提示。
解決
進入task卡住的節點檢視container stderr日誌,發現在獲取其他節點block資訊時,連線不上其他的機器節點,不停重試。
懷疑部分舊節點的/etc/hosts檔案被運維更新漏了,檢視/etc/hosts,發現沒有加入新節點的地址,加入後問題解決。
在叢集節點不斷增多的情況下,可以使用dns避免某些節點忘記修改/etc/hosts檔案導致的錯誤。
產生原因
在讀取shuffle資料時,本地的block會從本地的BlockManager讀取資料塊,遠端的block則通過 BlockTransferService 讀取,其中包含了hostname作為地址資訊,如果沒有hostname和ip的對映資訊,則會獲取失敗,將呼叫
RetryingBlockFetcher進行重試,如果繼續失敗則會丟擲異常: Exception while beginning fetch …
NettyBlockTransferService
override def fetchBlocks( host: String, port: Int, execId: String, blockIds: Array[String], listener: BlockFetchingListener): Unit = { logTrace(s"Fetch blocks from $host:$port (executor id $execId)") try { val blockFetchStarter = new RetryingBlockFetcher.BlockFetchStarter { override def createAndStart(blockIds: Array[String], listener: BlockFetchingListener) { val client = clientFactory.createClient(host, port) new OneForOneBlockFetcher(client, appId, execId, blockIds.toArray, listener).start() } } val maxRetries = transportConf.maxIORetries() if (maxRetries > 0) { // Note this Fetcher will correctly handle maxRetries == 0; we avoid it just in case there's // a bug in this code. We should remove the if statement once we're sure of the stability. new RetryingBlockFetcher(transportConf, blockFetchStarter, blockIds, listener).start() } else { blockFetchStarter.createAndStart(blockIds, listener) } } catch { case e: Exception => logError("Exception while beginning fetchBlocks", e) blockIds.foreach(listener.onBlockFetchFailure(_, e)) } }
RetryingBlockFetcher
private void fetchAllOutstanding() { // Start by retrieving our shared state within a synchronized block. String[] blockIdsToFetch; int numRetries; RetryingBlockFetchListener myListener; synchronized (this) { blockIdsToFetch = outstandingBlocksIds.toArray(new String[outstandingBlocksIds.size()]); numRetries = retryCount; myListener = currentListener; } // Now initiate the fetch on all outstanding blocks, possibly initiating a retry if that fails. try { fetchStarter.createAndStart(blockIdsToFetch, myListener); } catch (Exception e) { logger.error(String.format("Exception while beginning fetch of %s outstanding blocks %s", blockIdsToFetch.length, numRetries > 0 ? "(after " + numRetries + " retries)" : ""), e); if (shouldRetry(e)) { initiateRetry(); } else { for (String bid : blockIdsToFetch) { listener.onBlockFetchFailure(bid, e); } } } }
讀取kafka失敗,不斷重試
原因
防火牆原因連不上9092埠
總結
遇到類似問題一般在executor的日誌下都能找到結果