1. 程式人生 > >從Spark1.4版本升級為Spark2.2.1所遇到的坑

從Spark1.4版本升級為Spark2.2.1所遇到的坑

1.從Spark2.2.X開始spark支援的JDK已經不支援1.7以下的版本的,需要將JDK進行升級

2.將1.4中的conf複製到2.2.1的conf中,啟動和關閉沒有什麼區別

3.重點坑讓我折騰了一天

執行程式報錯  java.io.EOFException: Unexpected end of input stream

17/12/27 14:50:17 INFO scheduler.DAGScheduler: ShuffleMapStage 13 (map at PidCount.scala:86) failed in 2.215 s due to Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 18, 10.26.238.178, executor 1): java.io.EOFException: Unexpected end of input stream
	at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
	at java.io.InputStream.read(InputStream.java:101)
	at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
	at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
	at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
	at org.apache.hadoop.mapred.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:208)
	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246)
	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:271)
	at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:208)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:

Google了好久說是讀取的檔案有誤,1.檔案損壞,2.是
val textFile=sc.textFile(inputFile+"/*.gz");

索取的檔案為空 也會報io錯誤

研究半天也沒有找到如何讓它為空繼續進行,只好給他建立了一個為空的但大小不是0b的.gz檔案,然後不報錯了

谷歌上還有人說 如果你用的是Hive的話在

Starting from Spark 2.1 you can ignore corrupt files by enabling the spark.sql.files.ignoreCorruptFiles option. Add this to your spark-submit or pyspark command:

--conf spark.sql.files.ignoreCorruptFiles=true
因為我不是用的Hive感覺好像新增這個後還是無法過濾空的檔案

希望對你有幫助吧