Spark SQL(5) CacheManage

阿新 • • 發佈：2020-07-26

Spark SQL(5) CacheManage

在spark sql的analyzed plan 生成之後，會經過一步withCachedData的操作，其實就是根據對logicalplan的快取，如果logicalPlan的查詢結果相同則會替換相對應的節點。這步發生在QueryExecution.withCachedData:

lazy val withCachedData: LogicalPlan = {
    assertAnalyzed()
    assertSupported()
    sparkSession.sharedState.cacheManager.useCachedData(analyzed)
  }

/** Replaces segments of the given logical plan with cached versions where possible. */
def useCachedData(plan: LogicalPlan): LogicalPlan = {
  val newPlan = plan transformDown {
    // Do not lookup the cache by hint node. Hint node is special, we should ignore it when
    // canonicalizing plans, so that plans which are same except hint can hit the same cache.
 
    // However, we also want to keep the hint info after cache lookup. Here we skip the hint
    // node, so that the returned caching plan won't replace the hint node and drop the hint info
    // from the original plan.
    case hint: ResolvedHint => hint

    case currentFragment =>
      lookupCachedData(currentFragment)
 
        .map(_.cachedRepresentation.withOutput(currentFragment.output))
        .getOrElse(currentFragment)
  }

  newPlan transformAllExpressions {
    case s: SubqueryExpression => s.withNewPlan(useCachedData(s.plan))
  }

　　這裡面主要是CacheManager.lookupCachedData方法，

/** Optionally returns cached data for the given [[LogicalPlan]]. */
  def lookupCachedData(plan: LogicalPlan): Option[CachedData] = readLock {
    cachedData.asScala.find(cd => plan.sameResult(cd.plan))
  }

@transient
private val cachedData = new java.util.LinkedList[CachedData]

/** Holds a cached logical plan and its data */
case class CachedData(plan: LogicalPlan, cachedRepresentation: InMemoryRelation)

　　從上面可以看到CacheManager是通過一個連結串列儲存了LogicalPlan和InMemoryRelation(葉子節點)，從而在執行的時候直接替換快取的結果。

此處有個問題，這個連結串列是什麼時候放進去的呢？其實需要呼叫dataset的persist方法即可

/**
   * Caches the data produced by the logical representation of the given [[Dataset]].
   * Unlike `RDD.cache()`, the default storage level is set to be `MEMORY_AND_DISK` because
   * recomputing the in-memory columnar representation of the underlying table is expensive.
   */
  def cacheQuery(
      query: Dataset[_],
      tableName: Option[String] = None,
      storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = writeLock {
    val planToCache = query.logicalPlan
    if (lookupCachedData(planToCache).nonEmpty) {
      logWarning("Asked to cache already cached data.")
    } else {
      val sparkSession = query.sparkSession
      val inMemoryRelation = InMemoryRelation(
        sparkSession.sessionState.conf.useCompression,
        sparkSession.sessionState.conf.columnBatchSize, storageLevel,
        sparkSession.sessionState.executePlan(AnalysisBarrier(planToCache)).executedPlan,
        tableName,
        planToCache.stats)
      cachedData.add(CachedData(planToCache, inMemoryRelation))
    }
  }

def persist(): this.type = {
  sparkSession.sharedState.cacheManager.cacheQuery(this)
  this
}

這裡其實就是通過後序遍歷的方式，檢視快取在cacheData中的邏輯計劃，如果匹配就把整個節點替換。　　

Spark SQL(5) CacheManage

Spark SQL(5) CacheManage 在spark sql的analyzed plan 生成之後，會經過一步withCachedData的操作，其實就是根據對logicalplan的快取，如果logicalPlan的查詢結果相同則會替換相對應的節點。這步發生在QueryExecuti

Spark SQL(5-2) CacheManage之InMemoryRelation

Spark SQL(5-2) CacheManage之InMemoryRelation 本來計劃中是沒有這節的，但是中午在看spark sql 記憶體管理模組的時候，腦子裡面突然問到，spark sql 快取到記憶體的資料是怎麼組織的；上網查了下部落格；然後自己也

Spark3學習【基於Java】5. Spark-Sql聯表查詢JOIN

大資料場景下，聯表遠比微小型關係型資料庫中使用的頻繁。網上有句話：傳統資料庫單機模式做Join的場景畢竟有限，也建議儘量減少使用Join。

位元組跳動在Spark SQL上的核心優化實踐 | 位元組跳動技術沙龍

10月26日，位元組跳動技術沙龍 | 大資料架構專場在上海位元組跳動總部圓滿結束。我們邀請到位元組跳動資料倉庫架構負責人-郭俊，Kyligence 大資料研發工程師-陶加濤，位元組跳動儲存工程師-徐明敏，阿里雲高階技術

Spark 系列（八）—— Spark SQL 之 DataFrame 和 Dataset

一、Spark SQL簡介 Spark SQL 是 Spark 中的一個子模組，主要用於操作結構化資料。它具有以下特點：

Spark 系列（九）—— Spark SQL 之 Structured API

一、建立DataFrame和Dataset 1.1 建立DataFrame Spark 中所有功能的入口點是 SparkSession，可以使用 SparkSession.builder() 建立。建立後應用程式就可以從現有 RDD，Hive 表或 Spark 資料來源建立 DataFrame。示例

Spark 系列（十）—— Spark SQL 外部資料來源

一、簡介 1.1 多資料來源支援 Spark 支援以下六個核心資料來源，同時 Spark 社群還提供了多達上百種資料來源的讀取方式，能夠滿足絕大部分使用場景。

Spark 系列（十一）—— Spark SQL 聚合函式 Aggregations

一、簡單聚合 1.1 資料準備 // 需要匯入 spark sql 內建的函式包 import org.apache.spark.sql.functions._

Spark-SQL讀不到Hive資料庫的新坑指北

背景本文講的是spark-sql這個命令列工具讀取hive資料的情況： Spark是2.3.1，HDP發行版

Spark SQL常見4種資料來源詳解

通用load/write方法手動指定選項 Spark SQL的DataFrame介面支援多種資料來源的操作。一個DataFrame可以進行RDDs方式的操作，也可以被註冊為臨時表。把DataFrame註冊為臨時表之後，就可以對該DataFrame執行SQL查詢

Spark SQL操作JSON欄位的小技巧

前言介紹Spark SQL的JSON支援，這是我們在Databricks中開發的一個功能，可以在Spark中更容易查詢和建立JSON資料。隨著網路和移動應用程式的普及，JSON已經成為Web服務API以及長期儲存的常用的交換格式。使用現有的工

Spark入門（六）Spark SQL shell啟動方式(元資料儲存在mysql)

一、hive配置檔案在spak/conf目錄新增hive-site.xml配置，設定mysql作為元資料儲存的資料庫

Spark SQL 入門建立DataFrame報錯：org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://local

Spark SQL 入門建立DataFrame 執行以下語句時 val df = spark.read.json(\"../examples/src/main/resources/people.json\")

DataFrame DataSet Spark SQL學習

RDD加上結構，比如說類名，就可以變成DataFrame,DataFrame，將每一類同這一類的類名繫結在一起就可以稱為DataSet.

【趙強老師】什麼是Spark SQL？

一、Spark SQL簡介 Spark SQL是Spark用來處理結構化資料的一個模組，它提供了一個程式設計抽象叫做DataFrame並且作為分散式SQL查詢引擎的作用。

Spark SQL : DataFrame repartition、coalesce 對比

repartition repartition 有三個過載的函式： 1) def repartition(numPartitions: Int): DataFrame /**

spark sql

1）dataframe和dataset dataframe和dataset是spark sql中的程式設計模型。他們兩個程式設計模型我們可以理解為一張mysql的二維表，表頭，表名，表字段，欄位型別，資料。RDD其實也可以理解二維表，但是RDD相較於data

Spark SQL Parser到Unresolved LogicPlan

Spark SQL Parser到Unresolved LogicPlan Spark SQL Parser簡單來說就是將sql語句解析成為運算元樹的過程，在這個過程中，spark sql採用了antrl4來完成。

Spark SQL(4)-Unresolved Plan到Analyzed Plan

Spark Sql(4)-Unresolved Plan到Analyzed Plan 在第三篇總結了Unresolved Plan的生成過程，在此之後就是將其轉換為Analyzed Plan。這這一步主要涉及到QueryExecution、Analyzer、catalog等。

Spark SQL(6) OptimizedPlan

Spark SQL(6) OptimizedPlan 在這一步spark sql主要應用一些規則，優化生成的Resolved Plan，這一步涉及到的有Optimizer。

Spark SQL(5) CacheManage

Spark SQL(5) CacheManage

相關推薦