1. 程式人生 > >spark通過jdbc訪問postgresql資料庫

spark通過jdbc訪問postgresql資料庫

1.首先要有可用的jdbc
[[email protected] bin]$ locate jdbc|grep postgres
/mnt/hd01/www/html/deltasql/clients/java/dbredactor/lib/postgresql-8.2-507.jdbc4.jar
/usr/lib/ruby/gems/1.8/gems/railties-3.2.13/lib/rails/generators/rails/app/templates/config/databases/jdbcpostgresql.yml
/usr/src/postgis-2.0.0/java/jdbc/src/org/postgresql
/usr/src/postgis-2.0.0/java/jdbc/src/org/postgresql/driverconfig.properties
/usr/src/postgis-2.0.0/java/jdbc/stubs/org/postgresql
/usr/src/postgis-2.0.0/java/jdbc/stubs/org/postgresql/Connection.java
/usr/src/postgis-2.0.0/java/jdbc/stubs/org/postgresql/PGConnection.java
/usr/src/postgis-2.1.0/java/jdbc/src/org/postgresql
/usr/src/postgis-2.1.0/java/jdbc/src/org/postgresql/driverconfig.properties
/usr/src/postgis-2.1.0/java/jdbc/stubs/org/postgresql
/usr/src/postgis-2.1.0/java/jdbc/stubs/org/postgresql/Connection.java
/usr/src/postgis-2.1.0/java/jdbc/stubs/org/postgresql/PGConnection.java
沒有合適的,就在管網下載:https://jdbc.postgresql.org/download/postgresql-9.4-1205.jdbc4.jar

2.把下載好的jar檔案放在$SPARK_HOME/lib下面
3.啟動sparck shell

[[email protected] bin]$ SPARK_CLASSPATH=$SPARK_HOME/lib/postgresql-9.4-1205.jdbc4.jar $SPARK_HOME/bin/spark-shell
.
.
.
Please instead use:
 - ./spark-submit with --driver-class-path to augment the driver classpath
 - spark.executor.extraClassPath to augment the executor classpath

15/11/04 17:53:05 WARN SparkConf: Setting 'spark.executor.extraClassPath' to '/usr/src/data-integration/lib/postgresql-9.3-1102-jdbc4.jar' as a work-around.
15/11/04 17:53:05 WARN SparkConf: Setting 'spark.driver.extraClassPath' to '/usr/src/data-integration/lib/postgresql-9.3-1102-jdbc4.jar' as a work-around.
15/11/04 17:53:06 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
15/11/04 17:53:07 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
Spark context available as sc.
15/11/04 17:53:09 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/11/04 17:53:09 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/11/04 17:53:25 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
15/11/04 17:53:25 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
15/11/04 17:53:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/11/04 17:53:29 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
15/11/04 17:53:29 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
SQL context available as sqlContext.

4.建立DataFrame物件
scala> val df = sqlContext.load("jdbc", Map("url" -> "jdbc:postgresql://localhost:5434/cd03?user=cd03&password=cd03", "dbtable" -> "test_trans"))
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
df: org.apache.spark.sql.DataFrame = [trans_date: string, trans_prd: int, trans_cust: int]
標準格式:
val jdbcDF = sqlContext.read.format("jdbc").options( 
  Map("url" -> "jdbc:postgresql:dbserver",
  "dbtable" -> "schema.tablename")).load()

5.檢視schema
scala> df.printSchema()
root
 |-- trans_date: string (nullable = true)
 |-- trans_prd: integer (nullable = true)
 |-- trans_cust: integer (nullable = true)

6.簡單計算
scala> df.filter(df("trans_cust")>9999999).select("trans_date","trans_prd").show
+----------+---------+
|trans_date|trans_prd|
+----------+---------+
| 2015-5-20|     2007|
| 2015-7-24|     5638|
| 2015-5-19|     8182|
| 2015-2-24|    11391|
| 2015-8-13|    17341|
| 2015-2-22|    10996|
| 2015-1-17|    15284|
|  2015-1-8|    16090|
| 2015-1-25|    13528|
| 2015-1-17|     9498|
| 2015-9-25|     7235|
| 2015-8-19|     4084|
| 2015-4-24|    16637|
| 2015-5-27|    13829|
| 2015-0-13|    13956|
| 2015-3-19|    11974|
| 2015-10-5|     1185|
| 2015-3-28|     9412|
| 2015-6-13|    15203|
| 2015-2-14|    10087|
+----------+---------+
only showing top 20 rows