Hadoop 3.1.0 ErasureCodingPolicy導致spark streaming的任務失敗問題分析

阿新 • • 發佈：2018-12-03

一、ErasureCodePolicy問題，導致streaming任務退出

1、任務失敗的原因，當執行block時有uncaught 異常時，stop sparkcontext，如下

具體原始碼錯誤路徑，感興趣的，可以根據錯誤日誌跟蹤一下，這裡就不具體跟蹤了，只顯示比較重要的原始碼資訊
AsyncEventQueue

  private val dispatchThread = new Thread(s"spark-listener-group-$name") {
    setDaemon(true)
    override def run(): Unit = Utils.tryOrStopSparkContext(sc) {
      dispatch()
    }
  }

  private def dispatch(): Unit = LiveListenerBus.withinListenerThread.withValue(true) {
    try {
      var next: SparkListenerEvent = eventQueue.take()
      while (next != POISON_PILL) {
        val ctx = processingTime.time()
        try {
          super.postToAll(next)
        } finally {
          ctx.stop()
        }
        eventCount.decrementAndGet()
        next = eventQueue.take()
      }
      eventCount.decrementAndGet()
    } catch {
      case ie: InterruptedException =>
        logInfo(s"Stopping listener queue $name.", ie)
    }
  }

org.apache.spark.util.Utils

  /**
   * Execute a block of code that evaluates to Unit, stop SparkContext if there is any uncaught
   * exception
   *
   * NOTE: This method is to be called by the driver-side components to avoid stopping the
   * user-started JVM process completely; in contrast, tryOrExit is to be called in the
   * spark-started JVM process .
   */
  def tryOrStopSparkContext(sc: SparkContext)(block: => Unit) {
    try {
      block
    } catch {
      case e: ControlThrowable => throw e
      case t: Throwable =>
        val currentThreadName = Thread.currentThread().getName
        if (sc != null) {
          logError(s"uncaught error in thread $currentThreadName, stopping SparkContext", t)
          sc.stopInNewThread()
        }
        if (!NonFatal(t)) {
          logError(s"throw uncaught fatal error in thread $currentThreadName", t)
          throw t
        }
    }
  }

2、首先是個告警資訊，如下

2018-11-30 16:35:53 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2018-11-30 16:35:54 WARN  ErasureCodeNative:55 - ISA-L support is not available in your platform... using builtin-java codec where applicable

3、當spark streaming任務跑完一批後，就開始報錯，一直迴圈同樣的報錯，跑不了多久任務就掛掉了，報錯資訊如下

2018-11-30 16:37:21 ERROR AsyncEventQueue:91 - Listener EventLoggingListener threw an exception
java.lang.IllegalStateException
	at com.google.common.base.Preconditions.checkState(Preconditions.java:133)
	at org.apache.hadoop.hdfs.DFSStripedOutputStream$CellBuffers.addTo(DFSStripedOutputStream.java:238)
	at org.apache.hadoop.hdfs.DFSStripedOutputStream$CellBuffers.access$700(DFSStripedOutputStream.java:203)
	at org.apache.hadoop.hdfs.DFSStripedOutputStream.writeChunk(DFSStripedOutputStream.java:520)
	at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:217)
	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:164)
	at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:145)
	at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:136)
	at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:111)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57)
	at java.io.DataOutputStream.write(DataOutputStream.java:107)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
	at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
	at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
	at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)
	at java.io.BufferedWriter.flushBuffer(BufferedWriter.java:129)
	at java.io.BufferedWriter.write(BufferedWriter.java:230)
	at java.io.PrintWriter.write(PrintWriter.java:456)
	at java.io.PrintWriter.write(PrintWriter.java:473)
	at java.io.PrintWriter.print(PrintWriter.java:603)
	at java.io.PrintWriter.println(PrintWriter.java:739)
	at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:143)
	at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$1.apply(EventLoggingListener.scala:143)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:143)
	at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:164)
	at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
	at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
	at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
	at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:82)
	at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:89)
	at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:89)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
	at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83)
	at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79)
	at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
	at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78)

二、問題分析

報錯的原始碼大致意思是

1、在 onTaskEnd 的時候會呼叫 PrintWriter 輸出 even log 資訊

2、而這個 PrintWriter 的write 最終由 DFSStripedOutputStream 的 writeChunk去寫

3、寫之前有個 checkState 就是在這裡報的錯

4、相關引數有個 ErasureCodingPolicy 的一個 cellSize

5、當輸出的 position 大於 cellsize 時就拋這個異常了

6、而且 spark.util.Utils 的 tryOrStopSparkContext 發現執行 block 有 uncaught 異常，就stop sparkContext

7、從這來看任務在新叢集沒跑一會兒就掛了很大原因就是這個引起的

8、EC分析：日誌輸出時，會將其分塊，然後寫入EC的緩衝池中，當緩衝池滿後，flush將其清空，導致異常當兩個很重要當原因：A、緩衝塊分的大 B、緩衝池比較小，由於這兩個原因導致異常發生

三、問題解決

將日誌輸出改為本地，不寫HDFS，或關閉日誌輸出的EC

四、程式碼分析

報錯程式碼

類：DFSStripedOutputStream

private int addTo(int i, byte[] b, int off, int len) {
    ByteBuffer buf = this.buffers[i];
    int pos = buf.position() + len;
    Preconditions.checkState(pos <= DFSStripedOutputStream.this.cellSize);
    buf.put(b, off, len);
    return pos;
}

public ErasureCodingPolicy(String name, ECSchema schema, int cellSize, byte id) {
    Preconditions.checkNotNull(name);
    Preconditions.checkNotNull(schema);
    Preconditions.checkArgument(cellSize > 0, "cellSize must be positive");
    Preconditions.checkArgument(cellSize % 1024 == 0, "cellSize must be 1024 aligned");
    this.name = name;
    this.schema = schema;
    this.cellSize = cellSize;
    this.id = id;
}

有關EC導致的直接問題，目前還沒真正解決

Hadoop 3.1.0 ErasureCodingPolicy導致spark streaming的任務失敗問題分析

一、ErasureCodePolicy問題，導致streaming任務退出 1、任務失敗的原因，當執行block時有uncaught 異常時，stop sparkcontext，如下具體原始碼錯誤路徑，感興趣的，可以根據錯誤日誌跟蹤一下，這裡就不具體跟蹤了，只顯示比較重要的原始碼資訊A

Hadoop 3.1.0 單機版偽分散式的搭建

hadoop3.1.0單機搭建 1.1 軟體準備說明所需軟體如下圖所示：大資料插架都在於此：http://archive.apache.org/dist Hadoop3.1.0的地址：http://archive.apache.org/dist/hadoo

hadoop版本3.1.0分散式環境下執行mapreduce作業

一、環境配置 1、hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</

hadoop-2.6.0.tar.gz + spark-1.5.2-bin-hadoop2.6.tgz的叢集搭建（單節點）（Ubuntu系統）

前言關於幾個疑問和幾處心得！ a.用NAT，還是橋接，還是only-host模式？ b.用static的ip，還是dhcp的？答：static c.別認為快照和克隆不重要，小技巧，比別人靈活用，會很節省時間和大大減少錯誤。 d.重用起來指令碼語言的程式設計，如paython

hadoop-2.6.0.tar.gz + spark-1.6.1-bin-hadoop2.6.tgz的叢集搭建（單節點）（CentOS系統）

前言關於幾個疑問和幾處心得！ a.用NAT，還是橋接，還是only-host模式？ b.用static的ip，還是dhcp的？答：static c.別認為快照和克隆不重要，小技巧，比別人靈活用，會很節省時間和大大減少錯誤。 d.重用起來指令碼語言

Castle.Windsor依賴註入的高級應用_Castle.Windsor.3.1.0

mic sum port let 有一個設置 .text per nbsp [轉]Castle.Windsor依賴註入的高級應用_Castle.Windsor.3.1.0 1. 使用代碼方式進行組件註冊【依賴服務類】 using System; using System

面試題3：在一個長度為n的數組裏的所有數字都在0到n-1的範圍內。數組中某些數字是重復的，但不知道有幾個數字是重復的。也不知道每個數字重復幾次。請找出數組中任意一個重復的數字。例如，如果輸入長度為7的數組{2,3,1,0,2,5,3}，那麽對應的輸出是第一個重復的數字2。

length value 如果 while 返回 sys public ret || package siweifasan_6_5; /** * @Description:在一個長度為n的數組裏的所有數字都在0到n-1的範圍內。 * 數組中某些數字是重復的，

Hadoop 3.1.0 ErasureCodingPolicy導致spark streaming的任務失敗問題分析

Hadoop 3.1.0 ErasureCodingPolicy導致spark streaming的任務失敗問題分析

Hadoop 3.1.0 單機版偽分散式的搭建

hadoop版本3.1.0分散式環境下執行mapreduce作業

hadoop-2.6.0.tar.gz + spark-1.5.2-bin-hadoop2.6.tgz的叢集搭建（單節點）（Ubuntu系統）

hadoop-2.6.0.tar.gz + spark-1.6.1-bin-hadoop2.6.tgz的叢集搭建（單節點）（CentOS系統）

Castle.Windsor依賴註入的高級應用_Castle.Windsor.3.1.0

ubuntu 16.04 opencv-3.1.0 cannot find -lippicv

Drupal V7.3.1 框架處理不當導致SQL註入

安裝openmpi-3.1.0

Xamarin.Forms 3.1.0+版本 Android 原生支持底部導航欄

hadoop 3.1.1 單機集群配置/啟動問題時的問題處理

hadoop 3.1.1 Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster

Hadoop 3.1.1 - Yarn - 使用 GPU

Hadoop 3.1.1 - Yarn - 使用 FPGA

Windows下編譯Yolov3（CUDA9.1+cudnn7.0+OpenCV 3.1.0）

hadoop 3.1.1 分散式安裝部署 ubuntu16.04

idea執行mapreduce報錯 Could not locate Hadoop executable: C:\hadoop-3.1.1\bin\winutils.exe

Apache Impala 3.1.0 釋出，高效能的分散式 SQL 引擎

Mac下面配置hadoop 3.1.1

Hadoop 3.1.0 ErasureCodingPolicy導致spark streaming的任務失敗問題分析

相關推薦