1. 程式人生 > 程式設計 >Spark2.4.0和Scala2.11整合Kudu1.8.0遇到的坑

Spark2.4.0和Scala2.11整合Kudu1.8.0遇到的坑

最近做實時數倉用到了spark streaming和kudu兩個元件,因為資料少得可憐,折騰了一番終於是搞定了,在這裡記錄下期間遇到的坑

先通過Impala建張Kudu表

create table kudu_appbind_test(
md5 string,userid string,datetime_ string,time_ string,cardno string,flag string,cardtype string,primary key(md5,userid,datetime_)
)
stored as kudu;
複製程式碼

依賴選擇

參考kudu官網:kudu.apache.org/docs/develo…

官網上提及了幾點關鍵資訊

  • Use the kudu-spark_2.10 artifact if using Spark with Scala 2.10. Note that Spark 1 is no longer supported in Kudu starting from version 1.6.0. So in order to use Spark 1 integrated with Kudu,version 1.5.0 is the latest to go to.
  • Use kudu-spark2_2.11 artifact if using Spark 2 with Scala 2.11.
  • kudu-spark versions 1.8.0 and below have slightly different syntax.
  • Spark 2.2+ requires Java 8 at runtime even though Kudu Spark 2.x integration is Java 7 compatible. Spark 2.2 is the default dependency version as of Kudu 1.5.0.

我這裡是使用spark 2.4.0、scala 2.11、kudu 1.8.0,所以也該選擇 kudu-spark_2.11-1.8.0.jar,maven中配置如下:

    <!-- https://mvnrepository.com/artifact/org.apache.kudu/kudu-spark2 -->
    <dependency>
      <groupId>org.apache.kudu</groupId>
      <artifactId>kudu-spark2_2.11</artifactId>
      <version>1.8.0</version>
    </dependency>
複製程式碼

但是針對如下寫入語句時報錯

kuduDF.write.format("kudu")
  .mode("append")
  .option("kudu.master","server:7051")
  .option("kudu.table","impala::kudu_appbind_test")
  .mode("append")
  .save()
複製程式碼
java.lang.ClassNotFoundException: Failed to find data source: kudu. Please find packages at http://spark.apache.org/third-party-projects.html
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:649)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
  ... 49 elided
Caused by: java.lang.ClassNotFoundException: kudu.DefaultSource
  at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:628)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:628)
  at scala.util.Try$.apply(Try.scala:192)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:628)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:628)
  at scala.util.Try.orElse(Try.scala:84)
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:628)
  ... 51 more
複製程式碼

從報錯資訊來看,kudu不是spark的Data Source。百度了一下,看到有人說把上面那個jar包換成1.9.0版本,也就是 kudu-spark_2.11-1.9.0.jar。還是報錯了

# 使用 kudu-spark2_2.11-1.9.0.jar
java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.kudu.spark.kudu.DefaultSource not a subtype
  at java.util.ServiceLoader.fail(ServiceLoader.java:239)
  at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:376)
  at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
  at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
  at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
  at scala.collection.Iterator$class.foreach(Iterator.scala:891)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
  at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
  at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
  at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:624)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
  ... 49 elided
複製程式碼

繼續看了下kudu這裡的原始碼,發現kudu中 org.apache.kudu.spark.kudu 類裡寫的是

  implicit class KuduDataFrameWriter[T](writer: DataFrameWriter[T]) {
    def kudu = writer.format("org.apache.kudu.spark.kudu").save
  }
複製程式碼

它format的寫法和官網format("kudu")不同,最後我也改成這個,發現居然可以了

kudu.write.format("org.apache.kudu.spark.kudu")
          .mode("append")
          .option("kudu.master","server:7051")
          .option("kudu.table","impala::kudu_appbind_test")
          .save()
複製程式碼

Spark 整合 Kudu的幾個限制

kudu.apache.org/docs/develo…

  • Kudu tables with a name containing upper case or non-ascii characters must be assigned an alternate name when registered as a temporary table.
  • Kudu tables with a column name containing upper case or non-ascii characters may not be used with SparkSQL. Columns may be renamed in Kudu to work around this issue.
  • <> and OR predicates are not pushed to Kudu,and instead will be evaluated by the Spark task. Only LIKE predicates with a suffix wildcard are pushed to Kudu,meaning that LIKE "FOO%" is pushed down but LIKE "FOO%BAR" isn’t.
  • Kudu does not support every type supported by Spark SQL. For example,Date and complex types are not supported.
  • Kudu tables may only be registered as temporary tables in SparkSQL. Kudu tables may not be queried using HiveContext.
  • 當註冊為臨時表時,必須為名稱包含大寫或非ascii字元的Kudu表分配備用名稱。

  • 包含大寫或非ascii字元的列名的Kudu表不能與SparkSQL一起使用。可以在Kudu中重新命名列以解決此問題。

  • <>並且OR謂詞不會被推送到Kudu,而是由Spark任務進行評估。只有LIKE帶有字尾萬用字元的謂詞才會被推送到Kudu,這意味著它LIKE "FOO%"被推下但LIKE "FOO%BAR"不是。

  • Kudu不支援Spark SQL支援的每種型別。例如, Date不支援複雜型別。

  • Kudu表只能在SparkSQL中註冊為臨時表。使用HiveContext可能無法查詢Kudu表。


  • 還有DataFrame在寫入Kudu表時,要保證Columns的列名要和Kudu表的列名一一相符。
  • 在涉及到Kudu分割槽使用時,DataFrame要寫入Kudu分割槽的資料一定是已有的分割槽,不能插入不存在的分割槽

文章首發於:blog.csdn.net/lzw2016/art…

更多大資料相關Tips可以關注:github.com/josonle/Cod…github.com/josonle/Big…