從Spark1.4版本升級為Spark2.2.1所遇到的坑
阿新 • • 發佈:2019-02-07
1.從Spark2.2.X開始spark支援的JDK已經不支援1.7以下的版本的,需要將JDK進行升級
2.將1.4中的conf複製到2.2.1的conf中,啟動和關閉沒有什麼區別
3.重點坑讓我折騰了一天
執行程式報錯 java.io.EOFException: Unexpected end of input stream
17/12/27 14:50:17 INFO scheduler.DAGScheduler: ShuffleMapStage 13 (map at PidCount.scala:86) failed in 2.215 s due to Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 18, 10.26.238.178, executor 1): java.io.EOFException: Unexpected end of input stream at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapred.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:208) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:271) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:208) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace:
Google了好久說是讀取的檔案有誤,1.檔案損壞,2.是
val textFile=sc.textFile(inputFile+"/*.gz");
索取的檔案為空 也會報io錯誤
研究半天也沒有找到如何讓它為空繼續進行,只好給他建立了一個為空的但大小不是0b的.gz檔案,然後不報錯了
谷歌上還有人說 如果你用的是Hive的話在
因為我不是用的Hive感覺好像新增這個後還是無法過濾空的檔案Starting from Spark 2.1 you can ignore corrupt files by enabling the spark.sql.files.ignoreCorruptFiles option. Add this to your spark-submit or pyspark command: --conf spark.sql.files.ignoreCorruptFiles=true
希望對你有幫助吧