Spark修煉之道（高階篇）——Spark原始碼閱讀：第十三節 Spark SQL之SQLContext（一)

阿新 • • 發佈：2018-12-25

作者：周志湖

1. SQLContext的建立

SQLContext是Spark SQL進行結構化資料處理的入口，可以通過它進行DataFrame的建立及SQL的執行，其建立方式如下：

//sc為SparkContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

其對應的原始碼為：

def this(sparkContext: SparkContext) = {
    this(sparkContext, new CacheManager, SQLContext.createListenerAndUI(sparkContext), true 
)
  }

其呼叫的是私有的主建構函式：

//1.主構造器中的引數CacheManager用於快取查詢結果
//在進行後續查詢時會自動讀取快取中的資料
//2.SQLListener用於監聽Spark scheduler事件，它繼承自SparkListener
//3.isRootContext表示是否是根SQLContext
class SQLContext private[sql](
    @transient val sparkContext: SparkContext,
    @transient protected[sql] val cacheManager: CacheManager,
    @transient private 
[sql] val listener: SQLListener,
    val isRootContext: Boolean)
  extends org.apache.spark.Logging with Serializable {

當spark.sql.allowMultipleContexts設定為true時，則允許建立多個SQLContexts/HiveContexts，建立方法為newSession

def newSession(): SQLContext = {
    new SQLContext(
      sparkContext = sparkContext,
      cacheManager = cacheManager,
      listener = listener,
      isRootContext = false)
  }

其isRootContext 被設定為false，否則會丟擲異常，因為root SQLContext只能有一個，其它SQLContext與root SQLContext共享SparkContext, CacheManager, SQLListener。如果spark.sql.allowMultipleContexts為false，則只允許一個SQLContext存在

2. 核心成員變數 ——catalog

 protected[sql] lazy val catalog: Catalog = new SimpleCatalog(conf)

catalog用於登出表、登出表、判斷表是否存在等，例如當DataFrame呼叫registerTempTable 方法時

val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people")

會sqlContext的registerDataFrameAsTable方法

def registerTempTable(tableName: String): Unit = {
    sqlContext.registerDataFrameAsTable(this, tableName)
  }

sqlContext.registerDataFrameAsTable實質上呼叫的就是catalog的registerTable 方法：

private[sql] def registerDataFrameAsTable(df: DataFrame, tableName: String): Unit = {
    catalog.registerTable(TableIdentifier(tableName), df.logicalPlan)
  }

SimpleCatalog整體原始碼如下：

class SimpleCatalog(val conf: CatalystConf) extends Catalog {
  private[this] val tables = new ConcurrentHashMap[String, LogicalPlan]

  override def registerTable(tableIdent: TableIdentifier, plan: LogicalPlan): Unit = {
    tables.put(getTableName(tableIdent), plan)
  }

  override def unregisterTable(tableIdent: TableIdentifier): Unit = {
    tables.remove(getTableName(tableIdent))
  }

  override def unregisterAllTables(): Unit = {
    tables.clear()
  }

  override def tableExists(tableIdent: TableIdentifier): Boolean = {
    tables.containsKey(getTableName(tableIdent))
  }

  override def lookupRelation(
      tableIdent: TableIdentifier,
      alias: Option[String] = None): LogicalPlan = {
    val tableName = getTableName(tableIdent)
    val table = tables.get(tableName)
    if (table == null) {
      throw new NoSuchTableException
    }
    val tableWithQualifiers = Subquery(tableName, table)

    // If an alias was specified by the lookup, wrap the plan in a subquery so that attributes are
    // properly qualified with this alias.
    alias.map(a => Subquery(a, tableWithQualifiers)).getOrElse(tableWithQualifiers)
  }

  override def getTables(databaseName: Option[String]): Seq[(String, Boolean)] = {
    tables.keySet().asScala.map(_ -> true).toSeq
  }

  override def refreshTable(tableIdent: TableIdentifier): Unit = {
    throw new UnsupportedOperationException
  }
}

3. 核心成員變數 ——sqlParser

sqlParser在SQLContext的定義：

protected[sql] val sqlParser = new SparkSQLParser(getSQLDialect().parse(_))

SparkSQLParser為頂級的Spark SQL解析器，對Spark SQL支援的SQL語法進行解析，其定義如下：

private[sql] class SparkSQLParser(fallback: String => LogicalPlan) extends AbstractSparkSQLParser

fallback函式用於解析其它非Spark SQL Dialect的語法。
Spark SQL Dialect支援的關鍵字包括：

protected val AS = Keyword("AS")
  protected val CACHE = Keyword("CACHE")
  protected val CLEAR = Keyword("CLEAR")
  protected val DESCRIBE = Keyword("DESCRIBE")
  protected val EXTENDED = Keyword("EXTENDED")
  protected val FUNCTION = Keyword("FUNCTION")
  protected val FUNCTIONS = Keyword("FUNCTIONS")
  protected val IN = Keyword("IN")
  protected val LAZY = Keyword("LAZY")
  protected val SET = Keyword("SET")
  protected val SHOW = Keyword("SHOW")
  protected val TABLE = Keyword("TABLE")
  protected val TABLES = Keyword("TABLES")
  protected val UNCACHE = Keyword("UNCACHE")

4. 核心成員變數 ——ddlParser

用於解析DDL（Data Definition Language 資料定義語言）

 protected[sql] val ddlParser = new DDLParser(sqlParser.parse(_))

其支援的關鍵字有：

  protected val CREATE = Keyword("CREATE")
  protected val TEMPORARY = Keyword("TEMPORARY")
  protected val TABLE = Keyword("TABLE")
  protected val IF = Keyword("IF")
  protected val NOT = Keyword("NOT")
  protected val EXISTS = Keyword("EXISTS")
  protected val USING = Keyword("USING")
  protected val OPTIONS = Keyword("OPTIONS")
  protected val DESCRIBE = Keyword("DESCRIBE")
  protected val EXTENDED = Keyword("EXTENDED")
  protected val AS = Keyword("AS")
  protected val COMMENT = Keyword("COMMENT")
  protected val REFRESH = Keyword("REFRESH")

主要做三件事，分別是建立表、描述表和更新表

protected lazy val ddl: Parser[LogicalPlan] = createTable | describeTable | refreshTable

createTable方法具有如下（具體功能參考註釋說明）：

/**
   * `CREATE [TEMPORARY] TABLE avroTable [IF NOT EXISTS]
   * USING org.apache.spark.sql.avro
   * OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")`
   * or
   * `CREATE [TEMPORARY] TABLE avroTable(intField int, stringField string...) [IF NOT EXISTS]
   * USING org.apache.spark.sql.avro
   * OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")`
   * or
   * `CREATE [TEMPORARY] TABLE avroTable [IF NOT EXISTS]
   * USING org.apache.spark.sql.avro
   * OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")`
   * AS SELECT ...
   */
  protected lazy val createTable: Parser[LogicalPlan] = {
    // TODO: Support database.table.
    (CREATE ~> TEMPORARY.? <~ TABLE) ~ (IF ~> NOT <~ EXISTS).? ~ tableIdentifier ~
      tableCols.? ~ (USING ~> className) ~ (OPTIONS ~> options).? ~ (AS ~> restInput).? ^^ {
      case temp ~ allowExisting ~ tableIdent ~ columns ~ provider ~ opts ~ query =>
        if (temp.isDefined && allowExisting.isDefined) {
          throw new DDLException(
            "a CREATE TEMPORARY TABLE statement does not allow IF NOT EXISTS clause.")
        }

        val options = opts.getOrElse(Map.empty[String, String])
        if (query.isDefined) {
          if (columns.isDefined) {
            throw new DDLException(
              "a CREATE TABLE AS SELECT statement does not allow column definitions.")
          }
          // When IF NOT EXISTS clause appears in the query, the save mode will be ignore.
          val mode = if (allowExisting.isDefined) {
            SaveMode.Ignore
          } else if (temp.isDefined) {
            SaveMode.Overwrite
          } else {
            SaveMode.ErrorIfExists
          }

          val queryPlan = parseQuery(query.get)
          CreateTableUsingAsSelect(tableIdent,
            provider,
            temp.isDefined,
            Array.empty[String],
            mode,
            options,
            queryPlan)
        } else {
          val userSpecifiedSchema = columns.flatMap(fields => Some(StructType(fields)))
          CreateTableUsing(
            tableIdent,
            userSpecifiedSchema,
            provider,
            temp.isDefined,
            options,
            allowExisting.isDefined,
            managedIfNoPath = false)
        }
    }
  }

describeTable及refreshTable程式碼如下：

 /*
   * describe [extended] table avroTable
   * This will display all columns of table `avroTable` includes column_name,column_type,comment
   */
  protected lazy val describeTable: Parser[LogicalPlan] =
    (DESCRIBE ~> opt(EXTENDED)) ~ tableIdentifier ^^ {
      case e ~ tableIdent =>
        DescribeCommand(UnresolvedRelation(tableIdent, None), e.isDefined)
    }

  protected lazy val refreshTable: Parser[LogicalPlan] =
    REFRESH ~> TABLE ~> tableIdentifier ^^ {
      case tableIndet =>
        RefreshTable(tableIndet)
    }

Spark修煉之道（高階篇）——Spark原始碼閱讀：第十三節 Spark SQL之SQLContext（一)

作者：周志湖 1. SQLContext的建立 SQLContext是Spark SQL進行結構化資料處理的入口，可以通過它進行DataFrame的建立及SQL的執行，其建立方式如下： //sc為SparkContext val sqlContext

Spark修煉之道（高階篇）——Spark原始碼閱讀：第十二節 Spark SQL 處理流程分析

作者：周志湖下面的程式碼演示了通過Case Class進行表Schema定義的例子： // sc is an existing SparkContext. val sqlContext = new org.apache.spark.sql.SQLConte

Spark修煉之道（進階篇）——Spark入門到精通：第十三節 Spark Streaming—— Spark SQL、DataFrame與Spark Streaming

主要內容 Spark SQL、DataFrame與Spark Streaming 1. Spark SQL、DataFrame與Spark Streaming import org.apache.spark.SparkConf import org

Spark修煉之道（高階篇）——Spark原始碼閱讀：第八節 Task執行

Task執行在上一節中，我們提到在Driver端CoarseGrainedSchedulerBackend中的launchTasks方法向Worker節點中的Executor傳送啟動任務命令，該命令的接收者是CoarseGrainedExecutorBack

Spark修煉之道（高階篇）——Spark原始碼閱讀：第一節 Spark應用程式提交流程

作者：搖擺少年夢微訊號： zhouzhihubeyond spark-submit 指令碼應用程式提交流程在執行Spar應用程式時，會將spark應用程式打包後使用spark-submit指令碼提交到Spark中執行，執行提交命令如下： root@s

Spark修煉之道（進階篇）——Spark入門到精通：第十節 Spark SQL案例實戰（一）

作者：周志湖放假了，終於能抽出時間更新部落格了……. 1. 獲取資料本文通過將github上的Spark專案git日誌作為資料，對SparkSQL的內容進行詳細介紹資料獲取命令如下： [[email protected] spa

Spark修煉之道（進階篇）——Spark入門到精通：第十節 Spark Streaming（一)

本節主要內容 Spark流式計算簡介 Spark Streaming相關核心類入門案例 1. Spark流式計算簡介 Hadoop的MapReduce及Spark SQL等只能進行離線計算，無法滿足實時性要求較高的業務需求，例如實時推薦、實時

Spark修煉之道（基礎篇）——Linux大資料開發基礎：第十三節：Shell程式設計入門（五)

本節主要內容 while expression do command command done （1）計數器格式適用於迴圈次數已知或固定時 root@sparkslave02:~/ShellLearning/Chapter13# vim w

說說在 Vue.js 中如何實現元件間通訊（高階篇）

之前說過，可以使用 props 將資料從父元件傳遞給子元件。其實還有其它種的通訊方式，下面我們一一娓娓道來。 1 自定義事件通過自定義事件，我們可以把資料從子元件傳輸回父元件。子元件通過 $emit() 來觸發事件，而父元件通過 $on() 來監聽事件，這是典型的觀察者模式。 htm

使用JUnit4進行單元測試（高階篇）

通過前 2 篇文章，您一定對 JUnit 有了一個基本的瞭解，下面我們來探討一下JUnit4 中一些高階特性。一、高階 Fixture 上一篇文章中我們介紹了兩個 Fixture 標註，分別是 @Before 和 @After ，我們來看看他們是否適合完成如下功

Linux 系統應用程式設計——網路程式設計（高階篇）

一、網路超時檢測在網路通訊過程中，經常會出現不可預知的各種情況。例如網路線路突發故障、通訊一方異常結束等。一旦出現上述情況，很可能長時間都不會收到資料，而且無法判斷是沒有資料還是資料無法到達。如果使用的是TCP協議，可以檢測出來；但如果使用UDP協議的話，

C++比C多了什麼。。。（高階篇）

寫完了基礎的不同，在看看高階的一些用法吧。真正的高手都差在深入的研究上，我估計也就算是個裝高手的。關於基礎的不同可以參見：一、模板 1、函式模板模板(Templates

在Eclipse中使用JUnit4進行單元測試（高階篇）

通過前2篇文章，您一定對JUnit有了一個基本的瞭解，下面我們來探討一下JUnit4中一些高階特性。一、高階Fixture 上一篇文章中我們介紹了兩個Fixture標註，分別是@Before和@After，我們來看看他們是否適合完成如下功能：有一個類是負責對大檔案（超過50

Java-面向物件（高階篇）--抽象類與介面的應用

抽象類與介面的應用一. 抽象類的實際應用——模板設計通過物件的多型性可以為抽象類例項化，那麼抽象類該如何使用那？假設人分為學生和工人，學生和工人都可以說話，但是說的內容不相同，那麼說話的功能是一樣的，而說的內容由學生和工人自己來決定，此時可以利用抽象類

深入理解Java虛擬機器JVM高階特性與最佳實踐閱讀總結—— 第十二章 Java記憶體模型與執行緒

Java記憶體模型JMM，主要目標是定義程式中各個變數的訪問規則，即在虛擬機器中將變數儲存到記憶體和從記憶體讀取變數的底層細節，這裡的變數不包括執行緒私有的變數，如區域性引數；記憶體模型規定所有變數儲存在主記憶體；每個執行緒都有自己的工作記憶體，其中儲存了該執行緒用到的變數

Spark修煉之道（進階篇）——Spark入門到精通：第十四節 Spark Streaming 快取、Checkpoint機制

作者：周志湖微訊號：zhouzhihubeyond 主要內容 Spark Stream 快取 Checkpoint 案例 1. Spark Stream 快取通過前面一系列的課程介紹，我們知道DStream是由一系列的RDD構成的，

Spark修煉之道（進階篇）——Spark入門到精通：第十六節 Spark Streaming與Kafka

作者：周志湖主要內容 Spark Streaming與Kafka版的WordCount示例（一） Spark Streaming與Kafka版的WordCount示例（二） 1. Spark Streaming與Kafka版本的WordCount示例

Spark修煉之道（進階篇）——Spark入門到精通：第十五節 Kafka 0.8.2.1 叢集搭建

作者：周志湖微訊號：zhouzhihubeyond 本節為下一節Kafka與Spark Streaming做鋪墊主要內容 1.kafka 叢集搭建 1. kafka 叢集搭建 kafka 安裝與配置 tar -zxvf kafka_2

Spark修煉之道（進階篇）——Spark入門到精通：第九節 Spark SQL執行流程解析

1.整體執行流程使用下列程式碼對SparkSQL流程進行分析，讓大家明白LogicalPlan的幾種狀態，理解SparkSQL整體執行流程 // sc is an existing SparkContext. val sqlContext = new or

Spark修煉之道（進階篇）——Spark入門到精通：第六節 Spark程式設計模型（三)

作者：周志湖網名：搖擺少年夢微訊號：zhouzhihubeyond 本節主要內容 RDD transformation（續) RDD actions 1. RDD transformation（續) （1）repartitionAnd

Spark修煉之道（高階篇）——Spark原始碼閱讀：第十三節 Spark SQL之SQLContext（一)

1. SQLContext的建立

2. 核心成員變數 ——catalog

3. 核心成員變數 ——sqlParser

4. 核心成員變數 ——ddlParser

相關推薦