Yarn在Shuffle階段記憶體不足問題(error in shuffle in fetcher)

阿新 • • 發佈：2019-01-17

最近在遷移job到新叢集hadoop2.4，業務方在跑一個hql的時候shuffle階段出現OOM，這個問題之前從來沒有遇到過，看了一下相關日誌和counter，看不出個所以然，在網上搜索了一下，發現網友也遇到過相同的問題，以下是轉載的該問題的解決方法：

=====================================================================

在Hadoop叢集（CDH4.4, Mv2即Yarn框架）使用過程中，發現處理大資料集時程式報出如下錯誤：

13/12/02 20:02:06 INFO mapreduce.Job: map 100% reduce 2%

13/12/02 20:02:18 INFO mapreduce.Job: Task Id : attempt_1385983958793_0001_r_000000_1, Status : FAILED
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#1
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:121)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:379)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)

Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:58)
at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:45)
at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:297)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:287)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:360)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:295)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:154)

Google一番後居然無果！程式等著執行，老闆催著要結果，沒有大師協助，只能開始艱難地自救了！認真分析，求助於原始碼！

首先發現的一點是：map任務百分比一直在遞增，出現reduce任務之後，每隔一段時間報一個類似上面的錯誤，reduce從0%重新開始，而Map任務繼續前進，reduce處理一段後再報，再從0開始。累計到第四個報錯後即整個Application宣佈Fail。

根據這一點，大致可以得出這樣的結論：

reduce任務每次嘗試都失敗了，失敗後重新開始；
reduce任務失敗累計4次後整個Application退出，應該是設定了最大重試次數之類的配置項。
map任務與reduce任務是隔離的，之間不會干擾。這個從map、reduce任務原理也可以瞭解到。

基於這一點，首先查詢到map-site.xml中的配置項mapreduce.reduce.maxattempts，表示Reduce Task最大失敗嘗試次數，這個配置預設是4，調整到400後接著嘗試。

mapreduce.reduce.maxattempts起了作用，但是報錯依然不斷，不過不會4次報錯就結束了，map進度一直向前，map到達100%後，reduce依然重複報錯的節奏。是時候查查這裡報錯的類究竟在做啥了。

org.apache.hadoop.mapreduce.task.reduce.Fetcher類位於hadoop-mapreduce-client-core-2.0.0-cdh4.4.0.jar包中，Maven的話在pom.xml新增如下配置，可以獲取該包以及原始碼：
<dependency>
<groupId >org.apache.hadoop</ groupId>
<artifactId >hadoop-mapreduce -client-core</ artifactId>
<version >2.0.0-cdh4.4.0</ version>
</dependency>

問題的入口是run中的：

// Shuffle
copyFromHost(host);

跟蹤到copyMapOutput，是要準備從Map節點本地拷貝map的output進行shuffle。其中出錯點：

// Get the location for the map output – either in-memory or on-disk
mapOutput = merger.reserve(mapId, decompressedLength, id );

merger指向了MergeManagerImpl物件，呼叫其reserve函式，而這個函式中定義了shuffle的處理方式，是將output塞入記憶體(InMemoryMapOutput)還是放在磁碟上慢慢做(OnDiskMapOutput)？

從我們這邊的出錯資訊，顯然可以看到任務選擇了InMemoryMapOutput，在檢查為什麼作出這樣的選擇前，我們看看map的輸出結果到底有多大：

shell> cd /data/1/mrlocal/yarn/local/usercache/hdfs/appcache/application_1385983958793_0001/output
shell>du -sh * | grep _r_
7.3G attempt_1385983958793_0001_r_000000_1
6.5G attempt_1385983958793_0001_r_000000_12
5.2G attempt_1385983958793_0001_r_000000_5
5.8G attempt_1385983958793_0001_r_000000_7

這樣大的輸出放到記憶體裡，顯然要OOM了，可以有兩種選擇，它為什麼不選擇OnDiskMapOutput呢？

如下這段很顯然是關鍵所在：
if (!canShuffleToMemory(requestedSize)) {
LOG.info(mapId + “: Shuffling to disk since ” + requestedSize +
” is greater than maxSingleShuffleLimit (” +
maxSingleShuffleLimit + “)” );
return new OnDiskMapOutput<K,V>(mapId, reduceId, this , requestedSize,
jobConf, mapOutputFile , fetcher, true);
}

再看canShuffleToMemory：

private boolean canShuffleToMemory( long requestedSize) {
return (requestedSize < maxSingleShuffleLimit);
}

requestedSize從原始碼上並不能清楚瞭解其真實含義，問題最終落在maxSingleShuffleLimit這個引數的含義和來源上，進一步細查可以發現其來源：

this.maxSingleShuffleLimit =
(long)( memoryLimit * singleShuffleMemoryLimitPercent);

兩個變數的取值：

// Allow unit tests to fix Runtime memory
this. memoryLimit =
(long)(jobConf.getLong(MRJobConfig. REDUCE_MEMORY_TOTAL_BYTES,
Math. min(Runtime.getRuntime ().maxMemory(), Integer.MAX_VALUE))
* maxInMemCopyUse);

final float singleShuffleMemoryLimitPercent =
jobConf.getFloat(MRJobConfig. SHUFFLE_MEMORY_LIMIT_PERCENT,
DEFAULT_SHUFFLE_MEMORY_LIMIT_PERCENT );

singleShuffleMemoryLimitPercent 取的是mapreduce.reduce.shuffle.memory.limit.percent這個配置的取值，官網給出的解釋是：

Expert: Maximum percentage of the in-memory limit that a single shuffle can consume

單個shuffle能夠消耗的記憶體佔reduce所有記憶體的比例，預設值為0.25。Expert”專家模式”，說的很唬人。。

那麼降低mapreduce.reduce.shuffle.memory.limit.percent這個引數應該可以使得程式選擇OnDiskMapout而不是選擇InMemory，調低至0.06在測試，順利執行，不再報錯。

收穫：選擇了最新的框架，意味著會遇到最新的問題。無助時，瞭解原理，查詢原始碼，總能找到想要的答案。

遺留：

1.檢視原始碼，很多不清晰的地方都略過了，其中memoryLimit的取值，即reduce所有可使用的記憶體，實際取值如何確定，需要進一步找尋答案。

2.如何控制mapreduce.reduce.shuffle.memory.limit.percent使得我們能夠使用合理的配置來最大化的使用記憶體，待續。

Yarn在Shuffle階段記憶體不足問題(error in shuffle in fetcher)

最近在遷移job到新叢集hadoop2.4，業務方在跑一個hql的時候shuffle階段出現OOM，這個問題之前從來沒有遇到過，看了一下相關日誌和counter，看不出個所以然，在網上搜索了一下，發現網友也遇到過相同的問題，以下是轉載的該問題的解決方法： =======

解決安裝ipython時Command "python setup.py egg_info" failed with error code 1 in /tmp

ipython pip failed python2.7 ipython 6.0+ 最近使用ubuntu16.04 server版安裝ipython的時候一直在報錯：IPython 6.0+ does not support Python 2.6, 2.7, 3.0, 3.1, or 3.

Ubuntu18.04 環境下g++ 中出現error: stray ‘\357’ in program問題

環境：程式設計環境：Ubuntu18.04 程式設計工具：vim 編譯工具：g++ 輸入法：搜狗錯誤：編譯報錯：error: stray '\357' in program 原因：在程式中打入了全形字元具體分析產生原因：在程式設計中，由於打字的快速，按下

Error 3002: Problem in mapping fragments | c# linq to entities

錯誤展示： Error 3002: Problem in mapping fragments starting at line 1330:Potential runtime violation of table FTPRuns’s keys (FTPRuns.ID): Columns (

安裝notedown時出現"python setup.py egg_info" failed with error code 1 in /tmp/pip-req-build-By7yob/

notedown外掛可以開啟Markdown檔案，用起來很方便。可是最近在Python2.7+caffe2環境下安裝notedown的時候卻出現了很多問題。 1. Python3.6 + notedown 對於Python3.0以上的版本，可以直接使用下面

Command "python setup.py egg_info" failed with error code 1 in C:\Users\w5659\AppData\Local\Temp\pip-install-t7uomu4r\xa dmin\

Error msg: C:\Users\w5659>pip install xadmin Collecting xadmin Using cached https://files.pythonhosted.org/packages/1d/e9/2ac160c532d0d462142fa90a

error: stray ‘\200’ in program解決

gcc編譯錯誤現象： b.cpp:20:1: error: stray ‘\343’ in program //尋找最小值的下標 ^ b.cpp:20:1: error: stray ‘\200’ in program b.cpp:20:1:

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-qvc66dfs/supervisor/

dfs comm png iso mage str pip sta style # 安裝supervisor 出錯 pip3 install supervisor # 解決 sudo pip3 install supervisor Command "pyt

Parsing error was found in mapping #{}. Check syntax #{property|(expression), var1=value1, var2=val

Parsing error was found in mapping #{}. Check syntax #{property|(expression), var1=value1, var2=value2, …} 或者報 String index out of range: 0 的原因

ROS編譯：error: stray ‘\357’ in program

今天在用catkin_make編譯ros工作空間時，出現報警： error: stray ‘\357’ in program error: stray ‘\274’ in program error: stray ‘\233’ in program 原因是，在編碼過程中，由於中文新增備註

連線查詢 Error:Column 'XXX' in order clause is ambiguous

錯誤為: Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Column 'XXX' in order clause is ambiguous↵ ### T

Parse error: syntax error, unexpected T_PUBLIC in

在開發中我們本地測試功能時可能沒有報錯但是如下這樣寫法會出現的 class 類中 } public function _getInfo($sn){ $title = '';

kafka 容器報記憶體不足異常（failed; error='Cannot allocate memory' (errno=12)）

原路徑https://blog.csdn.net/womenrendeme/article/details/76855490 異常： OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000c0000000, 1

安裝jupyter 時Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-bu、ipnel/ 問題解決

解決方法，升級pip sudo python -m pip install --upgrade --force pip sudo pip install setuptools==33.1.1 報錯： File "/usr/bin/pip", line 9, in &

安裝 pyspark 出現 "python setup.py egg_info" failed with error code 1 in C:\Users\ZHANGZ~1\...

在安裝pyspark時執行pip install pyspark後報如下錯誤: File “d:\python36\lib\site-packages\setuptools\dist.py”, line 429, in fetch_build_egg return cmd.easy_inst

[MySQL] Error Code: 1140. In aggregated query without GROUP BY, expression #2 of SELECT list contains nonaggregated column; this

今天遇到這個Error 滿神奇的，只有出現在部份的 MySQL server 上。我的 SQL: update TABLE_A as c inner join (select count(*) as MY_COUNT, l1.device_id from TABLE_B as l1 where l1.log

Yarn在Shuffle階段記憶體不足問題(error in shuffle in fetcher)

Yarn在Shuffle階段記憶體不足問題(error in shuffle in fetcher)

解決安裝ipython時Command "python setup.py egg_info" failed with error code 1 in /tmp

Ubuntu18.04 環境下g++ 中出現error: stray ‘\357’ in program問題

Error 3002: Problem in mapping fragments | c# linq to entities

安裝notedown時出現"python setup.py egg_info" failed with error code 1 in /tmp/pip-req-build-By7yob/

Command "python setup.py egg_info" failed with error code 1 in C:\Users\w5659\AppData\Local\Temp\pip-install-t7uomu4r\xa dmin\

error: stray ‘\200’ in program解決

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-qvc66dfs/supervisor/

Parsing error was found in mapping #{}. Check syntax #{property|(expression), var1=value1, var2=val

ROS編譯：error: stray ‘\357’ in program

連線查詢 Error:Column 'XXX' in order clause is ambiguous

Parse error: syntax error, unexpected T_PUBLIC in

kafka 容器報記憶體不足異常（failed; error='Cannot allocate memory' (errno=12)）

安裝jupyter 時Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-bu、ipnel/ 問題解決

安裝 pyspark 出現 "python setup.py egg_info" failed with error code 1 in C:\Users\ZHANGZ~1\...

[MySQL] Error Code: 1140. In aggregated query without GROUP BY, expression #2 of SELECT list contains nonaggregated column; this

【問題解決】記憶體不足引起“gcc: internal compiler error: Killed (program cc1plus)”以及解決辦法

【bug】error: multiple types in one declaration

Troubleshoot ElastiCache Error "CROSSSLOT Keys in request don't hash to the same slot"

pip安裝軟體時出現Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build*的解決方案

Yarn在Shuffle階段記憶體不足問題(error in shuffle in fetcher)

相關推薦