第十篇：Spark SQL 源碼分析之 In-Memory Columnar Storage源碼分析之 query

阿新 • • 發佈：2017-09-26

pro .net asn 解析 partition store exec attr_ array

/** Spark SQL源碼分析系列文章*/

前面講到了Spark SQL In-Memory Columnar Storage的存儲結構是基於列存儲的。

那麽基於以上存儲結構，我們查詢cache在jvm內的數據又是如何查詢的，本文將揭示查詢In-Memory Data的方式。

一、引子

本例使用hive console裏查詢cache後的src表。 select value from src

當我們將src表cache到了內存後，再次查詢src，可以通過analyzed執行計劃來觀察內部調用。

即parse後，會形成InMemoryRelation結點，最後執行物理計劃時，會調用InMemoryColumnarTableScan這個結點的方法。

如下：

[java] view plain copy

scala> val exe = executePlan(sql("select value from src").queryExecution.analyzed)
14/09/26 10:30:26 INFO parse.ParseDriver: Parsing command: select value from src
14/09/26 10:30:26 INFO parse.ParseDriver: Parse Completed
exe: org.apache.spark.sql.hive.test.TestHive.QueryExecution =

== Parsed Logical Plan ==
Project [value#5]
InMemoryRelation [key#4,value#5], false, 1000, (HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None)
== Analyzed Logical Plan ==
Project [value#5]
InMemoryRelation [key#4,value#5], false, 1000, (HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None)
== Optimized Logical Plan ==
Project [value#5]
InMemoryRelation [key#4,value#5], false, 1000, (HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None)
== Physical Plan ==
InMemoryColumnarTableScan [value#5], (InMemoryRelation [key#4,value#5], false, 1000, (HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None)) //查詢內存中表的入口
Code Generation: false
== RDD ==

二、InMemoryColumnarTableScan

InMemoryColumnarTableScan是Catalyst裏的一個葉子結點，包含了要查詢的attributes，和InMemoryRelation（封裝了我們緩存的In-Columnar Storage數據結構）。執行葉子節點，出發execute方法對內存數據進行查詢。 1、查詢時，調用InMemoryRelation，對其封裝的內存數據結構的每個分區進行操作。 2、獲取要請求的attributes，如上，查詢請求的是src表的value屬性。 3、根據目的查詢表達式，來獲取在對應存儲結構中，請求列的index索引。 4、通過ColumnAccessor來對每個buffer進行訪問，獲取對應查詢數據，並封裝為Row對象返回。

技術分享

[java] view plain copy

private[sql] case class InMemoryColumnarTableScan(
attributes: Seq[Attribute],
relation: InMemoryRelation)
extends LeafNode {
override def output: Seq[Attribute] = attributes
override def execute() = {
relation.cachedColumnBuffers.mapPartitions { iterator =>
// Find the ordinals of the requested columns. If none are requested, use the first.
val requestedColumns = if (attributes.isEmpty) {
Seq(0)
} else {
attributes.map(a => relation.output.indexWhere(_.exprId == a.exprId)) //根據表達式exprId找出對應列的ByteBuffer的索引
}
iterator
.map(batch => requestedColumns.map(batch(_)).map(ColumnAccessor(_)))//根據索引取得對應請求列的ByteBuffer，並封裝為ColumnAccessor。
.flatMap { columnAccessors =>
val nextRow = new GenericMutableRow(columnAccessors.length) //Row的長度
new Iterator[Row] {
override def next() = {
var i = 0
while (i < nextRow.length) {
columnAccessors(i).extractTo(nextRow, i) //根據對應index和長度，從byterbuffer裏取得值，封裝到row裏
i += 1
}
nextRow
}
override def hasNext = columnAccessors.head.hasNext
}
}
}
}
}

查詢請求的列，如下：

[java] view plain copy

scala> exe.optimizedPlan
res93: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [value#5]
InMemoryRelation [key#4,value#5], false, 1000, (HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None)
scala> val relation = exe.optimizedPlan(1)
relation: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
InMemoryRelation [key#4,value#5], false, 1000, (HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None)
scala> val request_relation = exe.executedPlan
request_relation: org.apache.spark.sql.execution.SparkPlan =
InMemoryColumnarTableScan [value#5], (InMemoryRelation [key#4,value#5], false, 1000, (HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None))
scala> request_relation.output //請求的列，我們請求的只有value列
res95: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = ArrayBuffer(value#5)
scala> relation.output //默認保存在relation中的所有列
res96: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = ArrayBuffer(key#4, value#5)
scala> val attributes = request_relation.output
attributes: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = ArrayBuffer(value#5)

整個流程很簡潔，關鍵步驟是第三步。根據ExprId來查找到，請求列的索引 attributes.map(a => relation.output.indexWhere(_.exprId == a.exprId))

[java] view plain copy

//根據exprId找出對應ID
scala> val attr_index = attributes.map(a => relation.output.indexWhere(_.exprId == a.exprId))
attr_index: Seq[Int] = ArrayBuffer(1) //找到請求的列value的索引是1, 我們查詢就從Index為1的bytebuffer中，請求數據
scala> relation.output.foreach(e=>println(e.exprId))
ExprId(4) //對應<span style="font-family: Arial, Helvetica, sans-serif;">[key#4,value#5]</span>
ExprId(5)
scala> request_relation.output.foreach(e=>println(e.exprId))
ExprId(5)

三、ColumnAccessor

ColumnAccessor對應每一種類型，類圖如下：

技術分享

最後返回一個新的叠代器：

[java] view plain copy

new Iterator[Row] {
override def next() = {
var i = 0
while (i < nextRow.length) { //請求列的長度
columnAccessors(i).extractTo(nextRow, i)//調用columnType.setField(row, ordinal, extractSingle(buffer))解析buffer
i += 1
}
nextRow//返回解析後的row
}
override def hasNext = columnAccessors.head.hasNext
}

四、總結

Spark SQL In-Memory Columnar Storage的查詢相對來說還是比較簡單的，其查詢思想主要和存儲的數據結構有關。

即存儲時，按每列放到一個bytebuffer,形成一個bytebuffer數組。

查詢時，根據請求列的exprId查找到上述數組的索引，然後使用ColumnAccessor對buffer中字段進行解析，最後封裝為Row對象，返回。

——EOF——

創文章，轉載請註明：

轉載自：OopsOutOfMemory盛利的Blog，作者： OopsOutOfMemory

本文鏈接地址：http://blog.csdn.net/oopsoom/article/details/39577419

註：本文基於署名-非商業性使用-禁止演繹 2.5 中國大陸(CC BY-NC-ND 2.5 CN)協議，歡迎轉載、轉發和評論，但是請保留本文作者署名和文章鏈接。如若需要用於商業目的或者與授權方面的協商，請聯系我。

技術分享

轉自：http://blog.csdn.net/oopsoom/article/details/39577419

第十篇：Spark SQL 源碼分析之 In-Memory Columnar Storage源碼分析之 query

pro .net asn 解析 partition store exec attr_ array /** Spark SQL源碼分析系列文章*/ 前面講到了Spark SQL In-Memory Columnar Storage的存儲結構是基於列存儲的。那

第十篇：Spark SQL 源碼分析之 In-Memory Columnar Storage源碼分析之 query

一、引子

二、InMemoryColumnarTableScan

三、ColumnAccessor

四、總結

第十篇：Spark SQL 源碼分析之 In-Memory Columnar Storage源碼分析之 query

第九篇：Spark SQL 源碼分析之 In-Memory Columnar Storage源碼分析之 cache table

第四篇：Spark SQL Catalyst源碼分析之TreeNode Library

Databricks 第9篇：Spark SQL 基礎（資料型別、NULL語義）

Databricks 第11篇：Spark SQL 查詢（行轉列、列轉行、Lateral View、排序）

第二篇：Spark SQL Catalyst源碼分析之SqlParser

第一篇：Spark SQL源碼分析之核心流程

Spark修煉之道（高階篇）——Spark原始碼閱讀：第十二節 Spark SQL 處理流程分析

第十篇：K均值聚類(KMeans)

第十篇：雜貨鋪

史上最簡單的SpringCloud教程｜第十篇：高可用的服務註冊中心

第十篇：javaScript中的JSON總結

R實戰第十篇：列聯表和頻數表

nginx教程第十篇：應用舉例 & 踩過的坑

第67課：Spark SQL下采用Java和Scala實現Join的案例綜合實戰（鞏固前面學習的Spark SQL知識）

第68課：Spark SQL通過JDBC操作MySQL

第72課：Spark SQL UDF和UDAF解密與實戰

第73課：Spark SQL Thrift Server實戰

Spring Cloud系列教程 | 第十篇：Spring Cloud Config Server和Spring Cloud Bus以及Kafka和資料庫動態重新整理配置

第80課：Spark SQL網站搜尋綜合案例實戰

第十篇：Spark SQL 源碼分析之 In-Memory Columnar Storage源碼分析之 query

一、引子

二、InMemoryColumnarTableScan

三、ColumnAccessor

四、總結

相關推薦