Spark2.4.0和Scala2.11整合Kudu1.8.0遇到的坑
最近做實時數倉用到了spark streaming和kudu兩個元件,因為資料少得可憐,折騰了一番終於是搞定了,在這裡記錄下期間遇到的坑
先通過Impala建張Kudu表
create table kudu_appbind_test(
md5 string,userid string,datetime_ string,time_ string,cardno string,flag string,cardtype string,primary key(md5,userid,datetime_)
)
stored as kudu;
複製程式碼
依賴選擇
參考kudu官網:kudu.apache.org/docs/develo…
- Use the kudu-spark_2.10 artifact if using Spark with Scala 2.10. Note that Spark 1 is no longer supported in Kudu starting from version 1.6.0. So in order to use Spark 1 integrated with Kudu,version 1.5.0 is the latest to go to.
- Use kudu-spark2_2.11 artifact if using Spark 2 with Scala 2.11.
- kudu-spark versions 1.8.0 and below have slightly different syntax.
- Spark 2.2+ requires Java 8 at runtime even though Kudu Spark 2.x integration is Java 7 compatible. Spark 2.2 is the default dependency version as of Kudu 1.5.0.
我這裡是使用spark 2.4.0、scala 2.11、kudu 1.8.0,所以也該選擇 kudu-spark_2.11-1.8.0.jar,maven中配置如下:
<!-- https://mvnrepository.com/artifact/org.apache.kudu/kudu-spark2 -->
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-spark2_2.11</artifactId>
<version>1.8.0</version>
</dependency>
複製程式碼
但是針對如下寫入語句時報錯
kuduDF.write.format("kudu")
.mode("append")
.option("kudu.master","server:7051")
.option("kudu.table","impala::kudu_appbind_test")
.mode("append")
.save()
複製程式碼
java.lang.ClassNotFoundException: Failed to find data source: kudu. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:649)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
... 49 elided
Caused by: java.lang.ClassNotFoundException: kudu.DefaultSource
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:628)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:628)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:628)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:628)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:628)
... 51 more
複製程式碼
從報錯資訊來看,kudu不是spark的Data Source。百度了一下,看到有人說把上面那個jar包換成1.9.0版本,也就是 kudu-spark_2.11-1.9.0.jar。還是報錯了
# 使用 kudu-spark2_2.11-1.9.0.jar
java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.kudu.spark.kudu.DefaultSource not a subtype
at java.util.ServiceLoader.fail(ServiceLoader.java:239)
at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:376)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:624)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
... 49 elided
複製程式碼
繼續看了下kudu這裡的原始碼,發現kudu中 org.apache.kudu.spark.kudu
類裡寫的是
implicit class KuduDataFrameWriter[T](writer: DataFrameWriter[T]) {
def kudu = writer.format("org.apache.kudu.spark.kudu").save
}
複製程式碼
它format的寫法和官網format("kudu")
不同,最後我也改成這個,發現居然可以了
kudu.write.format("org.apache.kudu.spark.kudu")
.mode("append")
.option("kudu.master","server:7051")
.option("kudu.table","impala::kudu_appbind_test")
.save()
複製程式碼
Spark 整合 Kudu的幾個限制
- Kudu tables with a name containing upper case or non-ascii characters must be assigned an alternate name when registered as a temporary table.
- Kudu tables with a column name containing upper case or non-ascii characters may not be used with SparkSQL. Columns may be renamed in Kudu to work around this issue.
- <> and OR predicates are not pushed to Kudu,and instead will be evaluated by the Spark task. Only LIKE predicates with a suffix wildcard are pushed to Kudu,meaning that LIKE "FOO%" is pushed down but LIKE "FOO%BAR" isn’t.
- Kudu does not support every type supported by Spark SQL. For example,Date and complex types are not supported.
- Kudu tables may only be registered as temporary tables in SparkSQL. Kudu tables may not be queried using HiveContext.
-
當註冊為臨時表時,必須為名稱包含大寫或非ascii字元的Kudu表分配備用名稱。
-
包含大寫或非ascii字元的列名的Kudu表不能與SparkSQL一起使用。可以在Kudu中重新命名列以解決此問題。
-
<>並且OR謂詞不會被推送到Kudu,而是由Spark任務進行評估。只有LIKE帶有字尾萬用字元的謂詞才會被推送到Kudu,這意味著它LIKE "FOO%"被推下但LIKE "FOO%BAR"不是。
-
Kudu不支援Spark SQL支援的每種型別。例如, Date不支援複雜型別。
-
Kudu表只能在SparkSQL中註冊為臨時表。使用HiveContext可能無法查詢Kudu表。
- 還有DataFrame在寫入Kudu表時,要保證Columns的列名要和Kudu表的列名一一相符。
- 在涉及到Kudu分割槽使用時,DataFrame要寫入Kudu分割槽的資料一定是已有的分割槽,不能插入不存在的分割槽
文章首發於:blog.csdn.net/lzw2016/art…
更多大資料相關Tips可以關注:github.com/josonle/Cod… 和 github.com/josonle/Big…