Spark(十四)【SparkSQL整合Hive】
阿新 • • 發佈:2020-08-10
目錄
1.內嵌的HIVE
如果使用 Spark 內嵌的 Hive, 則什麼都不用做, 直接使用即可.
Hive 的元資料儲存在 derby 中, 預設倉庫地址:$SPARK_HOME/spark-warehouse
實際使用中, 幾乎沒有不會使用內建的 Hive
2.整合外部的Hive
spark-shell
① 將Hive的目錄下的/opt/module/hive/conf/hive-site.xml拷貝到%SPARK_HOME%/conf/目錄下
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- jdbc連線的URL,metastore:儲存元資料的mysql的庫 --> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoop102:3306/hive_metastore?useSSL=false</value> </property> <!-- jdbc連線的Driver--> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <!-- jdbc連線的登入Mysql的username--> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <!-- jdbc連線的登入Mysql的password --> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>root</value> </property> <!-- Hive預設在HDFS的工作目錄,儲存資料的工作目錄 --> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> <!-- Hive元資料儲存版本的驗證 --> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <!-- 指定儲存元資料要連線的地址 --> <property> <name>hive.metastore.uris</name> <value>thrift://hadoop102:9083</value> </property> <!-- 指定hiveserver2連線的埠號 --> <property> <name>hive.server2.thrift.port</name> <value>10000</value> </property> <!-- 指定hiveserver2連線的host --> <property> <name>hive.server2.thrift.bind.host</name> <value>hadoop102</value> </property> </configuration>
② 將Hive 的/lib目錄下的Mysql的驅動拷貝到%SPARK_HOME%/jars/目錄下
③ 重啟spark-shell,即可操作資料
Idea開發中
①pom依賴
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.12</artifactId> <version>3.0.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.12</artifactId> <version>3.0.0</version> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>3.1.2</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.27</version> </dependency> <!--可能會jackson的錯誤,新增此依賴--> <dependency> <groupId>com.fasterxml.jackson.core</groupId> <artifactId>jackson-core</artifactId> <version>2.10.1</version> </dependency>
②在resource中新增hive-site.xml保留metastore連線配置。
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!--指定儲存元資料要連線的地址--> <property> <name>hive.metastore.uris</name> <value>thrift://hadoop102:9083</value> </property> </configuration>
③ 測試
import org.apache.spark.sql.SparkSession
/**
* @description: idea使用sparksession連線hive
* @author: HaoWu
* @create: 2020年08月10日
*/
object SparkHiveTest {
def main(args: Array[String]): Unit = {
System.setProperty("HADOOP_USER_NAME", "hadoop")
val sparkSession: SparkSession = SparkSession.builder .config("spark.sql.warehouse.dir","hdfs://hadoop102:8020/user/hive/warehouse")
.enableHiveSupport()
.master("local[*]")
.appName("sparksql")
.getOrCreate()
sparkSession.sql("use hive_test").show()
sparkSession.sql("create table ideatest(name string, age int)").show()
}
}
結果
+----+----------+-----+
| uid|subject_id|score|
+----+----------+-----+
|1001| 01| 90|
|1001| 02| 90|
|1001| 03| 90|
|1002| 01| 85|
|1002| 02| 85|
|1002| 03| 70|
|1003| 01| 70|
|1003| 02| 70|
|1003| 03| 85|
+----+----------+-----+
注意:在開發工具中建立資料庫預設是在本地倉庫,通過引數修改資料庫倉庫的地址: config("spark.sql.warehouse.dir", "hdfs://linux1:8020/user/hive/warehouse")
FAQ
1.新增hive-site.xml配置檔案
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/08/10 14:19:36 WARN SharedState: Not allowing to set spark.sql.warehouse.dir or hive.metastore.warehouse.dir in SparkSession's options, it should be set statically for cross-session usages
20/08/10 14:19:41 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
20/08/10 14:19:41 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
20/08/10 14:19:44 WARN MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored
20/08/10 14:19:44 WARN MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored
20/08/10 14:19:44 WARN MetaData: Metadata has jdbc-type of null yet this is not valid. Ignored
2.新增jackson依賴
Exception in thread "main" java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/exc/InputCoercionException
at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<init>(ScalaNumberDeserializersModule.scala:48)
at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<clinit>(ScalaNumberDeserializersModule.scala)
at com.fasterxml.jackson.module.scala.deser.ScalaNumberDeserializersModule.$init$(ScalaNumberDeserializersModule.scala:60)
at com.fasterxml.jackson.module.scala.DefaultScalaModule.<init>(DefaultScalaModule.scala:18)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<init>(DefaultScalaModule.scala:36)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<clinit>(DefaultScalaModule.scala)
3.可以程式碼最前面增加如下程式碼解決:System.setProperty("HADOOP_USER_NAME", "root")
Exception in thread "main" org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to create database path file:/D:/SoftWare/idea-2019.2.3/wordspace/spark-warehouse/idea_test.db, failed to create database idea_test);