Spark常見問題解決
這倆天總結了在寫Spark Job的時候遇到的一些問題,寫在這裡,以後遇到了方便檢視。
1.Error:(64, 64) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
問題的原始碼:
...
val originRdd: RDD[(String, String, Int)] = dataFrame.map(row => (row.getString(0), row.getString(1), row.getInt(2))).rdd
...
這裡是對DataFrame的三列值進行一個特定資料型別的獲取動作,但是報錯了,這是因為spark資料集在儲存資料型別的時候需要編碼器,對於常見的資料型別,spark已經有現成的編碼器可用,但是需要從spark.implicits._中引入才可以執行。
解決方案:在執行DataFrame轉換為RDD之前引入sparkSession.implicits._
import sparkSession.implicits._
val originRdd: RDD[(String, String, Int)] = dataFrame.map(row => (row.getString(0), row.getString(1), row.getInt(2))).rdd
2.Error:(21, 19) missing parameter type lists.forEach(no => {})
有可能是在遍歷java.util.List物件的時候出錯,這個時候需要引入一個轉換器:
import scala.collection.JavaConversions._
其他的問題可以參考:
3.Scala測試用例報錯java.lang.Exception: Test class should have exactly one public constructor
java.lang.Exception: Test class should have exactly one public constructor
at org.junit.runners.BlockJUnit4ClassRunner.validateOnlyOneConstructor(BlockJUnit4ClassRunner.java:158)
at org.junit.runners.BlockJUnit4ClassRunner.validateConstructor(BlockJUnit4ClassRunner.java:147)
at org.junit.runners.BlockJUnit4ClassRunner.collectInitializationErrors(BlockJUnit4ClassRunner.java:127)
at org.junit.runners.ParentRunner.validate(ParentRunner.java:416)
at org.junit.runners.ParentRunner.<init>(ParentRunner.java:84)
at org.junit.runners.BlockJUnit4ClassRunner.<init>(BlockJUnit4ClassRunner.java:65)
at org.junit.internal.builders.JUnit4Builder.runnerForClass(JUnit4Builder.java:10)
at org.junit.runners.model.RunnerBuilder.safeRunnerForClass(RunnerBuilder.java:59)
at org.junit.internal.builders.AllDefaultPossibilitiesBuilder.runnerForClass(AllDefaultPossibilitiesBuilder.java:26)
at org.junit.runners.model.RunnerBuilder.safeRunnerForClass(RunnerBuilder.java:59)
at org.junit.internal.requests.ClassRequest.getRunner(ClassRequest.java:33)
at org.junit.internal.requests.FilterRequest.getRunner(FilterRequest.java:36)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:49)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at com.intellij.rt.execution.junit.sJUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
解決辦法:將object 改為class
4.org.apache.spark.SparkException: Task not serializable錯誤
Caused by: org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1.apply(PairRDDFunctions.scala:758)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1.apply(PairRDDFunctions.scala:757)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.mapValues(PairRDDFunctions.scala:757)
at com.test.Test2$.main(Test2.scala:70)
在進行RDD的處理時,報了這個錯,這是因為Spark在進行分散式計算的時候,需要從Driver節點往Worker節點傳輸資料內容,比如一個公用的Object,在傳輸過程中就會使用到序列化(實現Serializable介面),如果傳輸的物件不能序列化,就會導致Worker不能正常接收到該資料,所以就會報錯。
所以出現了這個錯誤,首先看看rdd的action裡面有沒有使用到外部的物件,一定要使用可序列化的物件。
val groupedRdd: RDD[(String, Iterable[(String, String, Long)])] = originRdd.keyBy(row => row._1).groupByKey
groupedRdd.mapValues(values => { //這裡的values是一個Iterable物件,非序列化物件
//action內容
}
比如我這裡是因為使用了一個Iterable物件,不能序列化導致了錯誤。
所以我改為了
groupedRdd.mapValues(iter => iter.toList) // 將Iterable轉換為List,List是可序列化的,具體可看原始碼scala.collection.immutable.List
mapValues(values => { //這裡的values是一個Iterable物件,非序列化物件
//action內容
}