Spark用法：關於寫入Mysql表Append Mode資料被清空的解決辦法

阿新 • • 發佈：2018-12-17

前提：小編自己的環境是（CDH）Spark2.2.0 Scala2.11.8

起因：當使用Append追加寫入mysql型別的資料庫，spark預設是把之前存在的資料清空，然後再寫入資料；這讓我們很無語，明明是Append，你卻給我overwrite

解決：修改原始碼，重寫兩個類（只要把這兩個類放到自己專案中，無需修改spark底層原始碼）

1.JdbcUtils

原本是：if (mode == SaveMode.Append && tableExists) {
  truncateTable(conn, table)
  tableExists = true
}

把truncateTable(conn, table)刪除即可

完整的：

import java.sql.{Connection, DriverManager, PreparedStatement}
import java.util.Properties
import org.apache.spark.internal.Logging
import org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper
import org.apache.spark.sql.jdbc.JdbcDialects
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Row, SaveMode}
import scala.util.Try

/**
  * Util functions for JDBC tables.
  */
object JdbcUtils extends Logging {

  val mode = SaveMode.Append



  def jdbc(mode:SaveMode,url: String, df: DataFrame, table: String, connectionProperties: Properties): Unit = {
    val props = new Properties()
    props.putAll(connectionProperties)
    val conn = JdbcUtils.createConnection(url, props)

    try {
      var tableExists = JdbcUtils.tableExists(conn, table)

      if (mode == SaveMode.Ignore && tableExists) {
        return
      }

      if (mode == SaveMode.ErrorIfExists && tableExists) {
        sys.error(s"Table $table already exists.")
      }

      if (mode == SaveMode.Overwrite && tableExists) {
        truncateTable(conn, table)
        tableExists = true
      }
      if (mode == SaveMode.Append && tableExists) {
        //*********************把下面這一行註釋或刪除*****************************
       // truncateTable(conn, table)
        tableExists = true
      }
      // Create the table if the table didn't exist.
      if (!tableExists) {
        val schema = JdbcUtils.schemaString(df, url)
        val sql = s"CREATE TABLE $table ($schema)"
        conn.prepareStatement(sql).executeUpdate()
      }

      JdbcUtils.saveTable(df, url, table, props)
    } finally {
      conn.close()
    }


  }

  /**
    * Establishes a JDBC connection.
    */
  def createConnection(url: String, connectionProperties: Properties): Connection = {
    JDBCRDD.getConnector(connectionProperties.getProperty("driver"), url, connectionProperties)()
  }

  /**
    * Returns true if the table already exists in the JDBC database.
    */
  def tableExists(conn: Connection, table: String): Boolean = {
    // Somewhat hacky, but there isn't a good way to identify whether a table exists for all
    // SQL database systems, considering "table" could also include the database name.
    Try(conn.prepareStatement(s"SELECT 1 FROM $table LIMIT 1").executeQuery().next()).isSuccess
  }

  /**
    * Drops a table from the JDBC database.
    */
  def dropTable(conn: Connection, table: String): Unit = {
    conn.prepareStatement(s"DROP TABLE $table").executeUpdate()
  }

  def truncateTable(conn: Connection, table: String): Unit = {
    conn.prepareStatement(s"TRUNCATE TABLE $table").executeUpdate()
  }

  /**
    * Returns a PreparedStatement that inserts a row into table via conn.
    */
  def insertStatement(conn: Connection, table: String, rddSchema: StructType): PreparedStatement = {
    val fields = rddSchema.fields
    val fieldsSql = new StringBuilder(s"(")
    var i = 0;
    for (f <- fields) {
      fieldsSql.append(f.name)

      if (i == fields.length - 1) {
        fieldsSql.append(")")
      } else {
        fieldsSql.append(",")
      }
      i += 1
    }

    val sql = new StringBuilder(s"INSERT INTO $table ")
    sql.append(fieldsSql.toString())
    sql.append(" VALUES (")
    var fieldsLeft = rddSchema.fields.length
    while (fieldsLeft > 0) {
      sql.append("?")
      if (fieldsLeft > 1) sql.append(", ") else sql.append(")")
      fieldsLeft = fieldsLeft - 1
    }
    //println(sql.toString())
    conn.prepareStatement(sql.toString())
  }

  /**
    * Saves a partition of a DataFrame to the JDBC database.  This is done in
    * a single database transaction in order to avoid repeatedly inserting
    * data as much as possible.
    *
    * It is still theoretically possible for rows in a DataFrame to be
    * inserted into the database more than once if a stage somehow fails after
    * the commit occurs but before the stage can return successfully.
    *
    * This is not a closure inside saveTable() because apparently cosmetic
    * implementation changes elsewhere might easily render such a closure
    * non-Serializable.  Instead, we explicitly close over all variables that
    * are used.
    */
  def savePartition(
                     getConnection: () => Connection,
                     table: String,
                     iterator: Iterator[Row],
                     rddSchema: StructType,
                     nullTypes: Array[Int]): Iterator[Byte] = {
    val conn = getConnection()
    var committed = false
    try {
      conn.setAutoCommit(false) // Everything in the same db transaction.
      val stmt = insertStatement(conn, table, rddSchema)
      try {
        while (iterator.hasNext) {
          val row = iterator.next()
          val numFields = rddSchema.fields.length
          var i = 0
          while (i < numFields) {
            if (row.isNullAt(i)) {
              stmt.setNull(i + 1, nullTypes(i))
            } else {
              rddSchema.fields(i).dataType match {
                case IntegerType => stmt.setInt(i + 1, row.getInt(i))
                case LongType => stmt.setLong(i + 1, row.getLong(i))
                case DoubleType => stmt.setDouble(i + 1, row.getDouble(i))
                case FloatType => stmt.setFloat(i + 1, row.getFloat(i))
                case ShortType => stmt.setInt(i + 1, row.getShort(i))
                case ByteType => stmt.setInt(i + 1, row.getByte(i))
                case BooleanType => stmt.setBoolean(i + 1, row.getBoolean(i))
                case StringType => stmt.setString(i + 1, row.getString(i))
                case BinaryType => stmt.setBytes(i + 1, row.getAs[Array[Byte]](i))
                case TimestampType => stmt.setTimestamp(i + 1, row.getAs[java.sql.Timestamp](i))
                case DateType => stmt.setDate(i + 1, row.getAs[java.sql.Date](i))
                case t: DecimalType => stmt.setBigDecimal(i + 1, row.getDecimal(i))
                case _ => throw new IllegalArgumentException(
                  s"Can't translate non-null value for field $i")
              }
            }
            i = i + 1
          }
          stmt.executeUpdate()
        }
      } finally {
        stmt.close()
      }
      conn.commit()
      committed = true
    } finally {
      if (!committed) {
        // The stage must fail.  We got here through an exception path, so
        // let the exception through unless rollback() or close() want to
        // tell the user about another problem.
        conn.rollback()
        conn.close()
      } else {
        // The stage must succeed.  We cannot propagate any exception close() might throw.
        try {
          conn.close()
        } catch {
          case e: Exception => logWarning("Transaction succeeded, but closing failed", e)
        }
      }
    }
    Array[Byte]().iterator
  }

  /**
    * Compute the schema string for this RDD.
    */
  def schemaString(df: DataFrame, url: String): String = {
    val sb = new StringBuilder()
    val dialect = JdbcDialects.get(url)
    df.schema.fields foreach { field => {
      val name = field.name
      val typ: String =
        dialect.getJDBCType(field.dataType).map(_.databaseTypeDefinition).getOrElse(
          field.dataType match {
            case IntegerType => "INTEGER"
            case LongType => "BIGINT"
            case DoubleType => "DOUBLE PRECISION"
            case FloatType => "REAL"
            case ShortType => "INTEGER"
            case ByteType => "BYTE"
            case BooleanType => "BIT(1)"
            case StringType => "TEXT"
            case BinaryType => "BLOB"
            case TimestampType => "TIMESTAMP"
            case DateType => "DATE"
            case t: DecimalType => s"DECIMAL(${t.precision},${t.scale})"
            case _ => throw new IllegalArgumentException(s"Don't know how to save $field to JDBC")
          })
      val nullable = if (field.nullable) "" else "NOT NULL"
      sb.append(s", $name $typ $nullable")
    }
    }
    if (sb.length < 2) "" else sb.substring(2)
  }

  /**
    * Saves the RDD to the database in a single transaction.
    */
  def saveTable(
                 df: DataFrame,
                 url: String,
                 table: String,
                 properties: Properties = new Properties()) {
    val dialect = JdbcDialects.get(url)
    val nullTypes: Array[Int] = df.schema.fields.map { field =>
      dialect.getJDBCType(field.dataType).map(_.jdbcNullType).getOrElse(
        field.dataType match {
          case IntegerType => java.sql.Types.INTEGER
          case LongType => java.sql.Types.BIGINT
          case DoubleType => java.sql.Types.DOUBLE
          case FloatType => java.sql.Types.REAL
          case ShortType => java.sql.Types.INTEGER
          case ByteType => java.sql.Types.INTEGER
          case BooleanType => java.sql.Types.BIT
          case StringType => java.sql.Types.CLOB
          case BinaryType => java.sql.Types.BLOB
          case TimestampType => java.sql.Types.TIMESTAMP
          case DateType => java.sql.Types.DATE
          case t: DecimalType => java.sql.Types.DECIMAL
          case _ => throw new IllegalArgumentException(
            s"Can't translate null value for field $field")
        })
    }

    val rddSchema = df.schema

    /*    def createConnectionFactory(options: JDBCOptions): () => Connection = {
          val driverClass: String = options.driverClass
          () => {
            DriverRegistry.register(driverClass)
            val driver: Driver = DriverManager.getDrivers {
              case d: DriverWrapper if d.wrapped.getClass.getCanonicalName == driverClass => d
              case d if d.getClass.getCanonicalName == driverClass => d
            }.getOrElse {
              throw new IllegalStateException(
                s"Did not find registered driver with class $driverClass")
            }
            driver.connect(options.url, options.asConnectionProperties)
          }
        }*/


    val driver: String = getDriverClassName(url)
    val getConnection: () => Connection = JDBCRDD.getConnector(driver, url, properties)
    df.foreachPartition { iterator =>
      savePartition(getConnection, table, iterator, rddSchema, nullTypes)
    }
  }

  def getDriverClassName(url: String): String = DriverManager.getDriver(url) match {
    case wrapper: DriverWrapper => wrapper.wrapped.getClass.getCanonicalName
    case driver => driver.getClass.getCanonicalName
  }

}

2.JDBCRDD，這個類之所以也重寫是因為上面的JdbcUtils 要用到，不加入會報錯：

刪除部分，只保留有用的，結果是：

/**
  * Created by Administrator on 2018/5/10 0010.
  */
import java.sql.{Connection, DriverManager}
import java.util.Properties

import org.apache.spark.internal.Logging
import org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry


private  object JDBCRDD extends Logging {
  def getConnector(driver: String, url: String, properties: Properties): () => Connection = {
    () => {
      try {
        if (driver != null) DriverRegistry.register(driver)
      } catch {
        case e: ClassNotFoundException =>
          logWarning(s"Couldn't find class $driver", e)
      }
      DriverManager.getConnection(url, properties)
    }
  }



}

最後就是呼叫拉：

df是DataFrame;

url="jdbc:mysql://*****:3306/databaseName;

JdbcUtils.jdbc(SaveMode.Append,url,df,"page_visit_day",prop)

Spark用法：關於寫入Mysql表Append Mode資料被清空的解決辦法

前提：小編自己的環境是（CDH）Spark2.2.0 Scala2.11.8 起因：當使用Append追加寫入mysql型別的資料庫，spark預設是把之前存在的資料清空，然後再寫入資料；這讓我們很無語，明明是Append，你卻給我overwrite 解決：修改原始碼，

Spark SQL將資料寫入Mysql表的一些坑

轉自:https://blog.csdn.net/dai451954706/article/details/52840011/ 最近，在使用Spark SQL分析一些資料，要求將分析之後的結果資料存入到相應的MySQL表中。但是將資料處理完了之後，存

Spark:將DataFrame寫入Mysql

normal avi sqlt getc height serve saveas ecif access Spark將DataFrame進行一些列處理後，需要將之寫入mysql，下面是實現過程 1.mysql的信息 mysql的信息我保存在了外部的配置文件，這樣方便後續的配

『PHP學習筆記』系列七：讀取MySQL資料庫中的資料表

資料表結構：資料表資料：從 MySQL 資料庫讀取資料： SELECT 語句用於從資料表中讀取資料: SELECT column_name(s) FROM table_name 我們可以使用 * 號來讀取所有資料表中的欄位： SEL

『PHP學習筆記』系列八：向MySQL資料庫中新增資料

資料表結構：資料表原有資料：向MySQL資料庫寫入資料： INSERT INTO 語句通常用於向 MySQL 表新增新的記錄： INSERT INTO table_name (column1, column2, column3,...) VALUES

執行指令碼獲取mysql表中的資料，報1044錯誤

mysql>use mysql;mysql>grant all on *.* to 資料庫登入名字@"%" identified by "資料庫的密碼";mysql>FLUSH&nb

順序表應用5：有序順序表歸併（資料結構）

順序表應用5：有序順序表歸併 Time Limit: 100 ms Memory Limit: 880 KiB Submit Statistic Problem Description 已知順序表A與B是兩個有序的順序表，其中存放的資料元素皆為

Spark OOM：java heap space，OOM:GC overhead limit exceeded解決方法

問題描述：在使用spark過程中，有時會因為資料增大，而出現下面兩種錯誤: java.lang.OutOfMemoryError: Java heap space java.lang.OutOfMemoryError：GC overhead limit exceeded 這兩種錯誤之前我一直認為是e

如何使用mysql 命令列檢視mysql表大小、資料大小、索引大小

select * from information_schema.TABLES where information_schema.TABLES.TABLE_SCHEMA='databasename' and information_schema.TABLES.T

MySql表中欄位為Null 和空（''）有什麼區別，分別有什麼影響？

資料準備 insert into user VALUES (6,NULL,1,0,1,NULL); insert into user VALUES (7,'',10,0,0,''); 然後我們編寫了一個簡單的小程式查詢這兩條記錄 public User

spark+phoenix 通過jdbc讀取表中的資料

廢話不說，直接程式碼，解決燃煤之急新增maven配置<dependency> <groupId>org.apache.phoenix</groupId>

用SQL語言進行復雜查詢：對各表中的資料進行不同條件的連線查詢和巢狀查詢：１）查詢每個學生及其選課情況；２）查詢每門課的間接先修課

對各表中的資料進行不同條件的連線查詢和巢狀查詢：１）查詢每個學生及其選課情況；２）查詢每門課的間接先修課３）將STUDENT,SC進行右連線４）查詢有不及格的學生姓名和所在系５）查詢所有成績為優秀（大於90分）的學生姓名６）查詢既選修了2號課程又選修了3號課程的

記錄一下誤刪除了mysql表中的資料後的恢復過程

用navicat刪除資料庫中的入侵資料，導致刪順手了，把一些看似重複資料的重要資料在表中直接右鍵刪除掉了（相當於delete from table命令），而該資料庫沒有做過這些資料的備份，同時沒有開啟

js-serialize()序列化表單，中文出現亂碼的解決辦法

param size 序列化 bug 今天 from style span decode 今天遇到js序列化表單，中文亂碼的問題，不知道是不是jq的bug, 終於找到解決辦法，可以調用decodeURIComponent(XXX,true);將數據解碼，比如：var da

Vue 高效清空表單，一鍵清空表單

def 清空表篩選很多 mage 頁面查詢還要 pos gpo 前段時間在租個後臺的項目，有兩處需要一鍵清空表單數據一、表單篩選後，需要可以一鍵清空或者恢復初始化篩選條件初始化查詢數據： 1.在created鉤子深拷貝了一份數據模板：這個時候thi

dede自定義表單放首頁出錯的解決辦法

TP 技術定義 dede 檢查解決辦法意思 mage 怎麽一、當自定義表單放首頁提交的時候跳出這個頁面怎麽解決二、解決辦法可能有多個from表單提交出錯，也就是代碼沖突的意思，只要把代碼檢查好，from提交不要重復沖突就可以了 dede自定義表單放首頁出

Hive執行過程中出現Caused by ： java.lang.ClassNotFoundException: org.cloudera.htrace.Trace的錯誤解決辦法（圖文詳解）

pre wid logs In 實用過程 ase edit 微信　　　　不多說，直接上幹貨！問題詳情　　　　如下　　這個錯誤的意思是缺少 htrace-core-2.04.jar。　解決辦法：　　

遠程連接阿裏雲服務器出現"遠程桌面，身份驗證錯誤：要求的函數不受支持"解決辦法

cin 還需要阿裏 tools pac arch 控制面板 star -c ---恢復內容開始--- 阿裏雲服務器買好了,按照教程跟著來的,然後在遠程連接的時候出現了的這樣的東東,按照上面的提示,"是由於Cred SSP 加密 oracle修正",不少

mysql ERROR 1045 和2058時(28000): 錯誤解決辦法

TE itl format roo skip oot -o 添加 item mysql ERROR 1045 (28000): 錯誤解決辦法聽語音 | 瀏覽：54286 | 更新：2018-02-23 14:34 | 標簽：mysql 1 2 3

MYSQL啟動後報錯故障問題的解決辦法

quit ... AI gin native with 錯誤日誌查看 RR centos系統有一次異常關機後啟動mysql一直報錯#/etc/init.d/mysqld startStarting MySQL.. ERROR! The server quit withou

Spark用法：關於寫入Mysql表Append Mode資料被清空的解決辦法

相關推薦