【解決】centos6.2 spark cluster問題(持續追加)
系統:centos6.2
節點數目:1個master,16個worker
spark版本:0.8.0
核心版本:2.6.32
以下是遇到的問題及解決辦法:
1. 執行完某一個任務後,某個節點無法再次連線,在其上執行jps出現:StandaloneExecutorBackend程序,無法結束。
原因:不明
解決辦法:重啟,重新連線。
2. worker節點tasktracker無法啟動
原因:關閉集群后的worker節點上tasktracker節點沒關掉。
解決辦法:關閉集群后,手動找到worker節點上的tasktracker程序並殺死
3. 執行start-master.sh報錯:failed to launch org.apache.spark.deploy.master.Master
解決辦法:執行sbt/sbt clean assembly
4. master執行結束後結果顯示正常,但是worker節點的/work/XX/stderr中報錯誤。
cd $SPARK_HOME
./run-example org.apache.spark.examples.SparkPi spark://hw024:7077
標準輸出結果:
……
14/03/20 11:13:02 INFO scheduler.DAGScheduler: Stage 0 (reduce at .scala:39) finished in 1.642 s
14/03/20 11:13:02 INFO cluster.ClusterScheduler: Remove 0.0 from pool
14/03/20 11:13:02 INFO spark.SparkContext: Job finished: reduce at .scala:39, took 1.708775428 s
Pi is roughly 3.13434
但是檢視workernode的/home/zhangqianlong/spark-0.8.0-incubating-bin-hadoop1/work/app-20140320111300-0008/8/stderr內容如下:
Spark Executor Command: "java" "-cp" ":/home/zhangqianlong/spark-0.8.0-incubating-bin-hadoop1/conf:/home/zhangqianlong/spark-0.8.0-incubating-bin-hadoop1/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.StandaloneExecutorBackend" "akka://[email protected]:60929/user/StandaloneScheduler" "8" "hw018" "24"
====================================
14/03/20 11:05:15 INFO slf4j.Slf4jEventHandler: started
14/03/20 11:05:15 INFO executor.StandaloneExecutorBackend: Connecting to driver: akka://[email protected]:60929/user/StandaloneScheduler
14/03/20 11:05:15 INFO executor.StandaloneExecutorBackend: Successfully registered with driver
14/03/20 11:05:15 INFO slf4j.Slf4jEventHandler: started
14/03/20 11:05:15 INFO spark.SparkEnv: Connecting to : akka://[email protected]:60929/user/BlockManagerMaster
14/03/20 11:05:15 INFO storage.MemoryStore: started with capacity 323.9 MB.
14/03/20 11:05:15 INFO storage.DiskStore: Created local directory at /tmp/spark-local-20140320110515-9151
14/03/20 11:05:15 INFO network.ConnectionManager: Bound socket to port 59511 with id = (hw018,59511)
14/03/20 11:05:15 INFO storage.BlockManagerMaster: Trying to register
14/03/20 11:05:15 INFO storage.BlockManagerMaster: Registered
14/03/20 11:05:15 INFO spark.SparkEnv: Connecting to : akka://[email protected]:60929/user/MapOutputTracker
14/03/20 11:05:15 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-81a80beb-fd56-4573-9afe-ca9310d3ea8d
14/03/20 11:05:15 INFO server.Server: jetty-7.x.y-SNAPSHOT
14/03/20 11:05:15 INFO server.AbstractConnector: Started [email protected]:56230
14/03/20 11:05:16 ERROR executor.StandaloneExecutorBackend: Driver terminated or disconnected! Shutting down.
這個問題困擾我一週了,F**K!
經過多次與其他攻城獅討論,該問題不用理會,只要正常執行結束,並且hadoop fs -cat /XX/part-XXX有輸出結果,結果是想要的就行。我猜是因為延時配置問題。
5.執行時報錯:filenotfoundexception: too many open files
原因:迭代計算時開啟太多臨時檔案
解決辦法:修改所有節點的系統開啟檔案上限設定:/etc/security/limits.conf(注意,不能用ssh遠端登陸後先刪除再拷貝,會導致系統無法登陸)
重啟spark後生效就可以解決
6. 由於在spark上執行的程式,輸入資料太大(當輸入小的時候可以成功執行)造成程式掛掉。
報錯資訊如下:
14/04/15 16:14:33 INFO cluster.ClusterTaskSetManager: Starting task 2.0:92 as TID 594 on executor 24: hw028 (ANY)
14/04/15 16:14:33 INFO cluster.ClusterTaskSetManager: Serialized task 2.0:92 as 2119 bytes in 0 ms
14/04/15 16:14:33 INFO client.Client$ClientActor: Executor updated: app-20140415151451-0000/23 is now FAILED (Command exited with code 137)
14/04/15 16:14:33 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140415151451-0000/23 removed: Command exited with code 137
14/04/15 16:14:33 ERROR client.Client$ClientActor: Master removed our application: FAILED; stopping client
14/04/15 16:14:33 ERROR cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster!
14/04/15 16:14:33 INFO cluster.ClusterScheduler: Remove TaskSet 2.0 from pool
14/04/15 16:14:33 INFO cluster.ClusterScheduler: Ignoring update from TID 590 because its task set is gone
14/04/15 16:14:33 INFO cluster.ClusterScheduler: Ignoring update from TID 593 because its task set is gone
14/04/15 16:14:33 INFO scheduler.DAGScheduler: Failed to run count at PageRank.scala:43
Exception in thread "main" org.apache.spark.SparkException:
Job failed: Error: Disconnected from Spark cluster
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:760)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:758)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:758)
at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:379)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:441)
at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:149)
原因:RDD太大了,每個節點好多RDD,導致記憶體不夠。
解決辦法:修改執行命令或者sprak-env.sh,新增引數 -Dspark.akka.frameSize=10000(以M為單位的)。
7. 由於輸入資料太大或者網路太差導致的woker節點“no recent heart beats”。
現象: WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, hw032, 39782, 0) with no recent heart beats: 46910ms exceeds 45000ms
原因:由於網路差或者資料量太大,worker節點在一定的時間內(預設45s)沒有給master訊號,master以為它掛了。
解決辦法:修改執行命令或者sprak-env.sh,新增引數 -Dspark.storage.blockManagerHeartBeatMs=60000(以ms為單位,即6分鐘)。