淺談spark yarn模式的問題除錯
spark執行的時候,有可能發生崩潰,而在spark console裡面看到的堆疊,很可能不是發生問題的實際堆疊,這個時候需要看yarn日誌來定位問題。
我在除錯spark程式的時候,採用如下命令啟動spark驅動程式:
/usr/local/spark-1.3.1-bin-hadoop2.6/bin/spark-submit --supervise --class spark_security.Sockpuppet --name "testperf" --executor-memory 4096M --num-executors 8 --driver-memory 8096M --master yarn-client /home/www/spark_Security-1.0-SNAPSHOT.jar
這個時候報如下錯誤:
15/07/03 14:35:01 INFO scheduler.DAGScheduler: Stopping DAGScheduler 15/07/03 14:35:01 INFO scheduler.DAGScheduler: Job 10 failed: foreachRDD at Sockpuppet.scala:80, took 2.226514 s 15/07/03 14:35:01 INFO scheduler.DAGScheduler: Stage 20 (map at Sockpuppet.scala:57) failed in 2.192 s 15/07/03 14:35:01 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors 15/07/03 14:35:01 INFO cluster.YarnClientSchedulerBackend: Asking each executor to shut down 15/07/03 14:35:01 ERROR scheduler.JobScheduler: Error running job streaming job 1435905299000 ms.0 org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:699) at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:698) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:698) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1411) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1346) at org.apache.spark.SparkContext.stop(SparkContext.scala:1380) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:143) Exception in thread "main" org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:699) at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:698) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:698) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1411) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1346) at org.apache.spark.SparkContext.stop(SparkContext.scala:1380) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:143) 15/07/03 14:35:01 INFO cluster.YarnClientSchedulerBackend: Stopped 15/07/03 14:35:01 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorActor: OutputCommitCoordinator stopped! 15/07/03 14:35:01 INFO spark.MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 15/07/03 14:35:01 INFO storage.MemoryStore: MemoryStore cleared 15/07/03 14:35:01 INFO storage.BlockManager: BlockManager stopped
可是在我的驅動程式中,實在是沒有出錯導致異常的地方。這個時候我採用如下命令
yarn logs -applicationId application_1436175803684_0004 >execption
看了一下yarn日誌。
yarn中出錯資訊有2個,第一個是如下堆疊資訊:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
第二個是如下錯誤:
15/07/06 10:49:54 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
我最先根據
ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
這條出錯資訊查詢解決方案,嘗試了各種解決方案,花費了很長的時間,問題還是沒有解決。
然後根據yarn裡面的這條堆疊資訊:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
找到了解決方案,問題的原因是hdp版本不對,在spark-defaults.conf檔案裡面加上如下兩行:
spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041
問題解決
總結一下:
1. 當spark console列印的堆疊很可能只是表面現象,導致問題出現的堆疊資訊很可能在yarn的日誌裡面
2. yarn日誌裡面的堆疊錯誤,要優先排查解決