spark通過jdbc訪問postgresql資料庫
阿新 • • 發佈:2019-01-10
1.首先要有可用的jdbc
2.把下載好的jar檔案放在$SPARK_HOME/lib下面
3.啟動sparck shell
4.建立DataFrame物件
5.檢視schema
6.簡單計算
[[email protected] bin]$ locate jdbc|grep postgres /mnt/hd01/www/html/deltasql/clients/java/dbredactor/lib/postgresql-8.2-507.jdbc4.jar /usr/lib/ruby/gems/1.8/gems/railties-3.2.13/lib/rails/generators/rails/app/templates/config/databases/jdbcpostgresql.yml /usr/src/postgis-2.0.0/java/jdbc/src/org/postgresql /usr/src/postgis-2.0.0/java/jdbc/src/org/postgresql/driverconfig.properties /usr/src/postgis-2.0.0/java/jdbc/stubs/org/postgresql /usr/src/postgis-2.0.0/java/jdbc/stubs/org/postgresql/Connection.java /usr/src/postgis-2.0.0/java/jdbc/stubs/org/postgresql/PGConnection.java /usr/src/postgis-2.1.0/java/jdbc/src/org/postgresql /usr/src/postgis-2.1.0/java/jdbc/src/org/postgresql/driverconfig.properties /usr/src/postgis-2.1.0/java/jdbc/stubs/org/postgresql /usr/src/postgis-2.1.0/java/jdbc/stubs/org/postgresql/Connection.java /usr/src/postgis-2.1.0/java/jdbc/stubs/org/postgresql/PGConnection.java 沒有合適的,就在管網下載:https://jdbc.postgresql.org/download/postgresql-9.4-1205.jdbc4.jar
2.把下載好的jar檔案放在$SPARK_HOME/lib下面
3.啟動sparck shell
[[email protected] bin]$ SPARK_CLASSPATH=$SPARK_HOME/lib/postgresql-9.4-1205.jdbc4.jar $SPARK_HOME/bin/spark-shell . . . Please instead use: - ./spark-submit with --driver-class-path to augment the driver classpath - spark.executor.extraClassPath to augment the executor classpath 15/11/04 17:53:05 WARN SparkConf: Setting 'spark.executor.extraClassPath' to '/usr/src/data-integration/lib/postgresql-9.3-1102-jdbc4.jar' as a work-around. 15/11/04 17:53:05 WARN SparkConf: Setting 'spark.driver.extraClassPath' to '/usr/src/data-integration/lib/postgresql-9.3-1102-jdbc4.jar' as a work-around. 15/11/04 17:53:06 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 15/11/04 17:53:07 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. Spark context available as sc. 15/11/04 17:53:09 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/11/04 17:53:09 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/11/04 17:53:25 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 15/11/04 17:53:25 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 15/11/04 17:53:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/11/04 17:53:29 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/11/04 17:53:29 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) SQL context available as sqlContext.
4.建立DataFrame物件
scala> val df = sqlContext.load("jdbc", Map("url" -> "jdbc:postgresql://localhost:5434/cd03?user=cd03&password=cd03", "dbtable" -> "test_trans")) warning: there were 1 deprecation warning(s); re-run with -deprecation for details df: org.apache.spark.sql.DataFrame = [trans_date: string, trans_prd: int, trans_cust: int] 標準格式: val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> "jdbc:postgresql:dbserver", "dbtable" -> "schema.tablename")).load()
5.檢視schema
scala> df.printSchema() root |-- trans_date: string (nullable = true) |-- trans_prd: integer (nullable = true) |-- trans_cust: integer (nullable = true)
6.簡單計算
scala> df.filter(df("trans_cust")>9999999).select("trans_date","trans_prd").show +----------+---------+ |trans_date|trans_prd| +----------+---------+ | 2015-5-20| 2007| | 2015-7-24| 5638| | 2015-5-19| 8182| | 2015-2-24| 11391| | 2015-8-13| 17341| | 2015-2-22| 10996| | 2015-1-17| 15284| | 2015-1-8| 16090| | 2015-1-25| 13528| | 2015-1-17| 9498| | 2015-9-25| 7235| | 2015-8-19| 4084| | 2015-4-24| 16637| | 2015-5-27| 13829| | 2015-0-13| 13956| | 2015-3-19| 11974| | 2015-10-5| 1185| | 2015-3-28| 9412| | 2015-6-13| 15203| | 2015-2-14| 10087| +----------+---------+ only showing top 20 rows