1. 程式人生 > >執行spark sql 遇到的問題

執行spark sql 遇到的問題

執行環境:

用圖形更直觀點。

在 spark cluster 和 yarn cluster 兩種方式執行spark sql, 操作hive中的資料,另外,hive 是獨立的,可以直接執行hive處理資料。

spark sql的程式比較好寫,直接看spark的example的例子HiveFromSpark ,很容易理解

首先,在spark cluster上執行:

將hive的   hive-site.xml  配置檔案放到 ${SPARK_HOME}/conf 目錄下

#!/bin/bash

cd $SPARK_HOME
./bin/spark-submit \
  --class com.datateam.spark.sql.HotelHive \
  --master spark://192.168.44.80:8070 \
  --executor-memory 2G \
  --total-executor-cores 10 \
  /home/q/spark/spark-1.1.1-SNAPSHOT-bin-2.2.0/jobs/spark-jobs-20141023.jar \

執行指令碼,遇到下面的錯誤:
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table dw_hotel_price_log
        at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:958)
        at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:924)
……
Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" plugin to create a ConnectionPool gave an error : 
The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. 
Please check your CLASSPATH specification, and the name of the driver.
        at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:237)
        at org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:110)
        at org.datanucleus.store.rdbms.ConnectionFactoryImpl.<init>(ConnectionFactoryImpl.java:82)
        ... 127 more
Caused by: org.datanucleus.store.rdbms.datasource.DatastoreDriverNotFoundException: The specified datastore driver ("com.mysql.jdbc.Driver") was not
 found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
        at org.datanucleus.store.rdbms.datasource.AbstractDataSourceFactory.loadDriver(AbstractDataSourceFactory.java:58)
        at org.datanucleus.store.rdbms.datasource.BoneCPDataSourceFactory.makePooledDataSource(BoneCPDataSourceFactory.java:61)
        at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:217)

意思是找不到 jdbc 的 connector,解決辦法:

在提交任務的腳本里加入下面的配置語句即可:

    --driver-class-path /home/q/spark/spark-1.1.1-SNAPSHOT-bin-2.2.0/lib/mysql-connector-java-5.1.22-bin.jar \

spark cluster 遇到的問題不多,主要在yarn cluster上遇到幾個問題。

在spark cluster上呼叫hive的資料,需要將 hive-site.xml 檔案放到spark的conf 目錄下,那在yarn上執行該將hive的配置檔案放到哪裡才能被 spark sql 識別呢?

在提交任務的時候加上:

--files /home/q/spark/spark-1.1.1-SNAPSHOT-bin-2.2.0/conf/hive-site.xml \

這裡用到的是 --files  , 而不是 --conf

先看一下 提交任務的指令碼:

cd $SPARK_HOME
./bin/spark-submit --class com.qunar.datateam.spark.sql.HotelHive \
  --master yarn-cluster \
  --num-executors 10 \
  --driver-memory 4g \
  --executor-memory 2g \
  --executor-cores 2 \
  --files /home/q/spark/spark-1.1.1-SNAPSHOT-bin-2.2.0/conf/hive-site.xml \
  /home/q/spark/spark-1.1.1-SNAPSHOT-bin-2.2.0/jobs/spark-jobs-20141023.jar \

ok,我們這裡同樣需要將mysql connector的jar包新增進去,如何進行?
--jars mysql-connectorpath

但是會出現下面的問題:
Exception in thread "Driver" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
……

 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table tablename
        at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:958)
        at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:924)
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
        at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1212)
        at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:62)
        at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72)
        at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2372)
        at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2383)
        at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
        ... 68 more
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1210)
        ... 73 more
Caused by: javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory
……
Caused by: java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:270)
        at javax.jdo.JDOHelper$18.run(JDOHelper.java:2018)
        at javax.jdo.JDOHelper$18.run(JDOHelper.java:2016)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.jdo.JDOHelper.forName(JDOHelper.java:2015)
        at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162)
        ... 97 more

於是將

datanucleus-api-jdo-3.2.1.jar, datanucleus-core-3.2.2.jar, datanucleus-rdbms-3.2.1.jar 

都加到 --jars 裡,但是還是出問題:
Exception in thread "Driver" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 2.0 failed 4 times, most recent failure: 
Lost task 6.3 in stage 2.0 (TID 34, l-hbase72.data.cn8
): java.io.FileNotFoundException: ./datanucleus-core-3.2.2.jar (Permission denied)
        java.io.FileOutputStream.open(Native Method)
        java.io.FileOutputStream.<init>(FileOutputStream.java:221)
        com.google.common.io.Files$FileByteSink.openStream(Files.java:223)
        com.google.common.io.Files$FileByteSink.openStream(Files.java:211)
        com.google.common.io.ByteSource.copyTo(ByteSource.java:203)
        com.google.common.io.Files.copy(Files.java:436)

經過不斷嘗試,將 --jars 後面的配置的jar包都用 --archives 的方式打到執行jar中:
--archives mysql-connector.jar,datanucleus-api-jdo-3.2.1.jar, datanucleus-core-3.2.2.jar, datanucleus-rdbms-3.2.1.jar 

ok

另外還要注意一點:

spark sql 中不認“;”,所以只能在sql中指明database,不能用 use database ; 這樣的hive sql 語句指定database