Spark SQL 初探
早在Spark Submit 2013裡就有介紹到Spark SQL,不過更多的是介紹Catalyst查詢優化框架。經過一年的開發,在今年Spark Submit 2014上,Databricks宣佈放棄Shark 的開發,而轉投Spark SQL,理由是Shark繼承了Hive太多,優化出現了瓶頸,如圖:
今天把Spark最新的程式碼簽了下來,測試了一下:
1、編譯SparkSQL
-bash-3.2$ git config --global http.sslVerify false -bash-3.2$ git clone https://github.com/apache/spark.git 正克隆到 'spark'這裡還是需要先build一下的,sbt/sbt assembly(如何build匹配版本,請參考... remote: Reusing existing pack: 107821, done. remote: Counting objects: 103, done. remote: Compressing objects: 100% (72/72), done. remote: Total 107924 (delta 20), reused 64 (delta 16) Receiving objects: 100% (107924/107924), 69.06 MiB | 3.39 MiB/s, done. Resolving deltas: 100% (50174/50174), done.
)
執行 sbt/sbt hive/console也會進行編譯。
最新的spark sql提供了一個console,在這裡可以直接的執行互動式查下,也提供了幾個例子。
2、執行Spark SQL
官方提供給我們了一個測試用例。通過檢視log,find . -name TestHive* 找到了位於:
/app/hadoop/shengli/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveTestHive.scala 有興趣可以自己開啟 編譯 除錯下看看。
首先進入控制檯:
sbt/sbt hive/console
[info] Starting scala interpreter...
[info]
import org.apache.spark.sql.catalyst.analysis._
import org.apache.spark.sql.catalyst.dsl._
import org.apache.spark.sql.catalyst.errors._
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.catalyst.rules._
import org.apache.spark.sql.catalyst.types._
import org.apache.spark.sql.catalyst.util._
import org.apache.spark.sql.execution
import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive.test.TestHive._
import org.apache.spark.sql.parquet.ParquetTestData
Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_20).
Type in expressions to have them evaluated.
Type :help for more information.
scala>
檢視一下當前RunTime下都提供了哪些方法:
scala>
<init> DslAttribute DslExpression DslString DslSymbol
ParquetTestData SqlCmd analyzer autoConvertJoinSize binaryToLiteral
booleanToLiteral byteToLiteral cacheTable cacheTables catalog
classOf clear clone configure contains
createParquetFile createSchemaRDD createTable decimalToLiteral describedTable
doubleToLiteral emptyResult eq equals executePlan
executeSql execution finalize floatToLiteral get
getAll getClass getHiveFile getOption hashCode
hiveDevHome hiveFilesTemp hiveHome hivePlanner hiveQTestUtilTables
hiveconf hiveql hql inRepoTests inferSchema
intToLiteral isCached joinBroadcastTables jsonFile jsonRDD
loadTestTable logger logicalPlanToSparkQuery longToLiteral metastorePath
ne notify notifyAll numShufflePartitions optimizer
originalUdfs outputBuffer parquetFile parseSql parser
planner prepareForExecution registerRDDAsTable registerTestTable reset
runHive runSqlHive sessionState set shortToLiteral
sparkContext sql stringToLiteral symbolToUnresolvedAttribute synchronized
table testTables timestampToLiteral toDebugString toString
uncacheTable wait warehousePath
我們發現,這個測試用例裡面有一個testTables,由於這些成員都是lazy的,所以一開始沒有被載入:
檢視測試用例要載入哪些表:
scala> testTables
14/07/02 18:45:59 INFO spark.SecurityManager: Changing view acls to: hadoop
14/07/02 18:45:59 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop)
14/07/02 18:46:00 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/07/02 18:46:00 INFO Remoting: Starting remoting
14/07/02 18:46:00 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:42984]
14/07/02 18:46:00 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:42984]
14/07/02 18:46:00 INFO spark.SparkEnv: Registering MapOutputTracker
14/07/02 18:46:00 INFO spark.SparkEnv: Registering BlockManagerMaster
14/07/02 18:46:00 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20140702184600-9e16
14/07/02 18:46:00 INFO network.ConnectionManager: Bound socket to port 48348 with id = ConnectionManagerId(web02.dw,48348)
14/07/02 18:46:00 INFO storage.MemoryStore: MemoryStore started with capacity 1097.0 MB
14/07/02 18:46:00 INFO storage.BlockManagerMaster: Trying to register BlockManager
14/07/02 18:46:00 INFO storage.BlockManagerInfo: Registering block manager web02.dw:48348 with 1097.0 MB RAM
14/07/02 18:46:00 INFO storage.BlockManagerMaster: Registered BlockManager
14/07/02 18:46:00 INFO spark.HttpServer: Starting HTTP Server
14/07/02 18:46:01 INFO server.Server: jetty-8.1.14.v20131031
14/07/02 18:46:01 INFO server.AbstractConnector: Started [email protected]0.0.0.0:36260
14/07/02 18:46:01 INFO broadcast.HttpBroadcast: Broadcast server started at http://10.1.8.207:36260
14/07/02 18:46:01 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-ca40f66c-edc3-484f-b317-d3f512aab244
14/07/02 18:46:01 INFO spark.HttpServer: Starting HTTP Server
14/07/02 18:46:01 INFO server.Server: jetty-8.1.14.v20131031
14/07/02 18:46:01 INFO server.AbstractConnector: Started [email protected]0.0.0.0:57821
14/07/02 18:46:01 INFO server.Server: jetty-8.1.14.v20131031
14/07/02 18:46:02 INFO server.AbstractConnector: Started [email protected]0.0.0.0:4040
14/07/02 18:46:02 INFO ui.SparkUI: Started SparkUI at http://web02.dw:4040
metastore path is /tmp/sparkHiveMetastore8060064816530828092
warehousePath path is /tmp/sparkHiveWarehouse5366068035857129261
hiveHome path is Some(/home/hadoop/Java/lib/hive-0.6.0)
hiveDevHome path is None
res0: scala.collection.mutable.HashMap[String,org.apache.spark.sql.hive.test.TestHive.TestTable] = Map(sales -> TestTable(sales,WrappedArray(<function0>, <function0>)), src -> TestTable(src,WrappedArray(<function0>, <function0>)), src1 -> TestTable(src1,WrappedArray(<function0>, <function0>)), serdeins -> TestTable(serdeins,WrappedArray(<function0>, <function0>)), src_thrift -> TestTable(src_thrift,WrappedArray(<function0>)), srcpart -> TestTable(srcpart,WrappedArray(<function0>)), episodes -> TestTable(episodes,WrappedArray(<function0>, <function0>)), srcpart1 -> TestTable(srcpart1,WrappedArray(<function0>)))
測試select語句
1.首先宣告一個sql
2.這是測試用例會用hive的metastore,建立一個derby的資料庫
3.建立上述的所以表,並把資料載入進去。
4.Parse這條select * from sales 語句。
5. 生成SchemaRDD併產生查詢計劃。
6. 當對querySales這個RDD執行Action的時候,會計算這條sql的執行。
以下是執行的詳細結果:(可以看到log打出的大概執行步驟)
scala> val querySales = sql("select * from sales")
14/07/02 18:51:19 INFO test.TestHive$: Loading test table sales
14/07/02 18:51:19 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS sales (key STRING, value INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "([^ ]*) ([^ ]*)")
14/07/02 18:51:19 INFO parse.ParseDriver: Parse Completed
14/07/02 18:51:19 INFO analysis.Analyzer: Max iterations (2) reached for batch MultiInstanceRelations
14/07/02 18:51:19 INFO analysis.Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences
14/07/02 18:51:19 INFO sql.SQLContext$$anon$1: Max iterations (2) reached for batch Add exchange
14/07/02 18:51:19 INFO sql.SQLContext$$anon$1: Max iterations (2) reached for batch Prepare Expressions
14/07/02 18:51:19 INFO ql.Driver: <PERFLOG method=Driver.run>
14/07/02 18:51:19 INFO ql.Driver: <PERFLOG method=TimeToSubmit>
14/07/02 18:51:19 INFO ql.Driver: <PERFLOG method=compile>
14/07/02 18:51:19 INFO ql.Driver: <PERFLOG method=parse>
14/07/02 18:51:19 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF NOT EXISTS sales (key STRING, value INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "([^ ]*) ([^ ]*)")
14/07/02 18:51:19 INFO parse.ParseDriver: Parse Completed
14/07/02 18:51:19 INFO ql.Driver: </PERFLOG method=parse start=1404298279883 end=1404298279885 duration=2>
14/07/02 18:51:19 INFO ql.Driver: <PERFLOG method=semanticAnalyze>
14/07/02 18:51:19 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
14/07/02 18:51:19 INFO parse.SemanticAnalyzer: Creating table sales position=27
14/07/02 18:51:20 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
14/07/02 18:51:20 INFO metastore.ObjectStore: ObjectStore, initialize called
14/07/02 18:51:20 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
14/07/02 18:51:21 WARN bonecp.BoneCPConfig: Max Connections < 1. Setting to 20
14/07/02 18:51:25 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
14/07/02 18:51:25 INFO metastore.ObjectStore: Initialized ObjectStore
14/07/02 18:51:26 WARN bonecp.BoneCPConfig: Max Connections < 1. Setting to 20
14/07/02 18:51:26 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 0.12.0
14/07/02 18:51:27 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=sales
14/07/02 18:51:27 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_table : db=default tbl=sales
14/07/02 18:51:27 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
14/07/02 18:51:27 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
14/07/02 18:51:28 INFO ql.Driver: Semantic Analysis Completed
14/07/02 18:51:28 INFO ql.Driver: </PERFLOG method=semanticAnalyze start=1404298279885 end=1404298288331 duration=8446>
14/07/02 18:51:28 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
14/07/02 18:51:28 INFO ql.Driver: </PERFLOG method=compile start=1404298279840 end=1404298288340 duration=8500>
14/07/02 18:51:28 INFO ql.Driver: <PERFLOG method=Driver.execute>
14/07/02 18:51:28 INFO ql.Driver: Starting command: CREATE TABLE IF NOT EXISTS sales (key STRING, value INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "([^ ]*) ([^ ]*)")
14/07/02 18:51:28 INFO ql.Driver: </PERFLOG method=TimeToSubmit start=1404298279840 end=1404298288351 duration=8511>
14/07/02 18:51:28 INFO ql.Driver: <PERFLOG method=runTasks>
14/07/02 18:51:28 INFO ql.Driver: <PERFLOG method=task.DDL.Stage-0>
14/07/02 18:51:28 INFO metastore.HiveMetaStore: 0: create_table: Table(tableName:sales, dbName:default, owner:hadoop, createTime:1404298288, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:string, comment:null), FieldSchema(name:value, type:int, comment:null)], location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.RegexSerDe, parameters:{serialization.format=1, input.regex=([^ ]*) ([^ ]*)}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, privileges:PrincipalPrivilegeSet(userPrivileges:null, groupPrivileges:null, rolePrivileges:null))
14/07/02 18:51:28 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=create_table: Table(tableName:sales, dbName:default, owner:hadoop, createTime:1404298288, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:string, comment:null), FieldSchema(name:value, type:int, comment:null)], location:null, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.RegexSerDe, parameters:{serialization.format=1, input.regex=([^ ]*) ([^ ]*)}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, privileges:PrincipalPrivilegeSet(userPrivileges:null, groupPrivileges:null, rolePrivileges:null))
14/07/02 18:51:28 INFO ql.Driver: </PERFLOG method=task.DDL.Stage-0 start=1404298288351 end=1404298288589 duration=238>
14/07/02 18:51:28 INFO ql.Driver: </PERFLOG method=runTasks start=1404298288351 end=1404298288589 duration=238>
14/07/02 18:51:28 INFO ql.Driver: </PERFLOG method=Driver.execute start=1404298288340 end=1404298288589 duration=249>
14/07/02 18:51:28 INFO ql.Driver: OK
14/07/02 18:51:28 INFO ql.Driver: <PERFLOG method=releaseLocks>
14/07/02 18:51:28 INFO ql.Driver: </PERFLOG method=releaseLocks start=1404298288590 end=1404298288590 duration=0>
14/07/02 18:51:28 INFO ql.Driver: </PERFLOG method=Driver.run start=1404298279839 end=1404298288590 duration=8751>
14/07/02 18:51:28 INFO ql.Driver: <PERFLOG method=releaseLocks>
14/07/02 18:51:28 INFO ql.Driver: </PERFLOG method=releaseLocks start=1404298288590 end=1404298288590 duration=0>
14/07/02 18:51:28 INFO parse.ParseDriver: Parsing command: LOAD DATA LOCAL INPATH 'sql/hive/src/test/resources/data/files/sales.txt' INTO TABLE sales
14/07/02 18:51:28 INFO parse.ParseDriver: Parse Completed
14/07/02 18:51:28 INFO analysis.Analyzer: Max iterations (2) reached for batch MultiInstanceRelations
14/07/02 18:51:28 INFO analysis.Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences
14/07/02 18:51:28 INFO sql.SQLContext$$anon$1: Max iterations (2) reached for batch Add exchange
14/07/02 18:51:28 INFO sql.SQLContext$$anon$1: Max iterations (2) reached for batch Prepare Expressions
14/07/02 18:51:28 INFO ql.Driver: <PERFLOG method=Driver.run>
14/07/02 18:51:28 INFO ql.Driver: <PERFLOG method=TimeToSubmit>
14/07/02 18:51:28 INFO ql.Driver: <PERFLOG method=compile>
14/07/02 18:51:28 INFO ql.Driver: <PERFLOG method=parse>
14/07/02 18:51:28 INFO parse.ParseDriver: Parsing command: LOAD DATA LOCAL INPATH 'sql/hive/src/test/resources/data/files/sales.txt' INTO TABLE sales
14/07/02 18:51:28 INFO parse.ParseDriver: Parse Completed
14/07/02 18:51:28 INFO ql.Driver: </PERFLOG method=parse start=1404298288629 end=1404298288629 duration=0>
14/07/02 18:51:28 INFO ql.Driver: <PERFLOG method=semanticAnalyze>
14/07/02 18:51:28 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=sales
14/07/02 18:51:28 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_table : db=default tbl=sales
14/07/02 18:51:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/07/02 18:51:28 INFO ql.Driver: Semantic Analysis Completed
14/07/02 18:51:28 INFO ql.Driver: </PERFLOG method=semanticAnalyze start=1404298288630 end=1404298288942 duration=312>
14/07/02 18:51:28 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
14/07/02 18:51:28 INFO ql.Driver: </PERFLOG method=compile start=1404298288628 end=1404298288943 duration=315>
14/07/02 18:51:28 INFO ql.Driver: <PERFLOG method=Driver.execute>
14/07/02 18:51:28 INFO ql.Driver: Starting command: LOAD DATA LOCAL INPATH 'sql/hive/src/test/resources/data/files/sales.txt' INTO TABLE sales
14/07/02 18:51:28 INFO ql.Driver: </PERFLOG method=TimeToSubmit start=1404298288628 end=1404298288943 duration=315>
14/07/02 18:51:28 INFO ql.Driver: <PERFLOG method=runTasks>
14/07/02 18:51:28 INFO ql.Driver: <PERFLOG method=task.COPY.Stage-0>
14/07/02 18:51:28 INFO exec.Task: Copying data from file:/app/hadoop/spark/sql/hive/src/test/resources/data/files/sales.txt to file:/tmp/hive-hadoop/hive_2014-07-02_18-51-28_629_2309366591646930035-1/-ext-10000
14/07/02 18:51:28 INFO exec.Task: Copying file: file:/app/hadoop/spark/sql/hive/src/test/resources/data/files/sales.txt
14/07/02 18:51:29 INFO ql.Driver: </PERFLOG method=task.COPY.Stage-0 start=1404298288943 end=1404298289037 duration=94>
14/07/02 18:51:29 INFO ql.Driver: <PERFLOG method=task.MOVE.Stage-1>
14/07/02 18:51:29 INFO exec.Task: Loading data to table default.sales from file:/tmp/hive-hadoop/hive_2014-07-02_18-51-28_629_2309366591646930035-1/-ext-10000
14/07/02 18:51:29 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=sales
14/07/02 18:51:29 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_table : db=default tbl=sales
14/07/02 18:51:29 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=sales
14/07/02 18:51:29 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_table : db=default tbl=sales
14/07/02 18:51:29 INFO metastore.HiveMetaStore: 0: alter_table: db=default tbl=sales newtbl=sales
14/07/02 18:51:29 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=alter_table: db=default tbl=sales newtbl=sales
14/07/02 18:51:29 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=sales
14/07/02 18:51:29 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_table : db=default tbl=sales
14/07/02 18:51:29 INFO ql.Driver: </PERFLOG method=task.MOVE.Stage-1 start=1404298289037 end=1404298289196 duration=159>
14/07/02 18:51:29 INFO ql.Driver: <PERFLOG method=
相關推薦
Spark SQL 初探
早在Spark Submit 2013裡就有介紹到Spark SQL,不過更多的是介紹Catalyst查詢優化框架。經過一年的開發,在今年Spark Submit 2014上,Databricks宣佈放棄Shark 的開發,而轉投Spark SQL,理由是Shark繼承了
Spark-Sql之DataFrame實戰詳解
集合 case 編程方式 優化 所表 register 操作數 print ava 1、DataFrame簡介:
在Spark中,DataFrame是一種以RDD為基礎的分布式數據據集,類似於傳統數據庫聽二維表格,DataFrame帶有Schema元信息,即DataFram
Spark SQL編程指南(Python)【轉】
res 平臺 per 它的 split 執行 文件的 分組 不同 轉自:http://www.cnblogs.com/yurunmiao/p/4685310.html
前言
Spark SQL允許我們在Spark環境中使用SQL或者Hive SQL執行關系型查詢。它的核
Spark SQL 源代碼分析之Physical Plan 到 RDD的詳細實現
local 過濾 右連接 操作 images img mem sans 觀察
/** Spark SQL源代碼分析系列文章*/ 接上一篇文章Spark SQL Catalyst源代碼分析之Physical Plan。本文將介紹Physical Plan的toRDD的
spark-sql case when 問題
spark 大數據 hadoop spark-sqlSELECT CASE (pmod(datediff(f0.`4168388__c_0`,‘1970-01-04‘),7)+1) WHEN ‘1‘ THEN ‘星期日‘ WHEN ‘2‘ THEN ‘星期一‘ WHEN ‘3‘ THEN ‘星期二‘ WHE
Spark-Sql整合hive,在spark-sql命令和spark-shell命令下執行sql命令和整合調用hive
type with hql lac 命令 val driver spark集群 string 1.安裝Hive 如果想創建一個數據庫用戶,並且為數據庫賦值權限,可以參考:http://blog.csdn.net/tototuzuoquan/article/details/5
Spark SQL and DataFrame Guide(1.4.1)——之DataFrames
ati been -m displays txt -a 版本 ava form
Spark SQL是處理結構化數據的Spark模塊。它提供了DataFrames這樣的編程抽象。同一時候也能夠作為分布式SQL查詢引擎使用。
DataFrames
D
Spark SQL
mapr bsp 單機 模塊 ont 比較 分布 整合 技術 1.1. Spark SQL概述
1.1.1. 什麽是Spark SQL
Spark SQL是Spark用來處理結構化數據的一個模塊,它提供了一個編程抽象叫做DataFrame並且作為分布式SQL查詢引
spark SQL概述
hive 徹底 es2017 ima img cor com font size
Spark SQL是什麽?
何為結構化數據
sparkSQL與spark Core的關系
Spark SQL的前世今生:由Shark發展而來
Spark
Spark SQL 編程
ima art tps ext img rdd point .cn ram Spark SQL的依賴
Spark SQL的入口:SQLContext
官方網站參考 https://spark.apache.org/docs/1.6.2/sql-programmi
Pandas基礎學習與Spark Python初探
入學 init sparkconf sch 時間 com inux mas 取數 摘要:pandas是一個強大的Python數據分析工具包,pandas的兩個主要數據結構Series(一維)和DataFrame(二維)處理了金融,統計,社會中的絕大多數典型用例科學,以及許多
Spark SQL 之 Join 實現
結構 很多 找到 過濾 sql查詢優化 ade read 轉換成 分析
原文地址:Spark SQL 之 Join 實現
Spark SQL 之 Join 實現
塗小剛 2017-07-19 217標簽: spark , 數據庫
Join作為SQL中
Spark-SQL連接Hive
ces submit mat targe runt match tms force trying 第一步:修個Hive的配置文件hive-site.xml
添加如下屬性,取消本地元數據服務:
<property>
<name>hive.
【Spark SQL 源碼分析系列文章】
blog .com data 原創 org 分析 成了 系列 ice 從決定寫Spark SQL源碼分析的文章,到現在一個月的時間裏,陸陸續續差不多快完成了,這裏也做一個整合和索引,方便大家閱讀,這裏給出閱讀順序 :)
第一篇 Spark SQL源碼分析之核心流程
第二篇:Spark SQL Catalyst源碼分析之SqlParser
end from pop tco 循環 -c font 多個 再看 /** Spark SQL源碼分析系列文章*/
Spark SQL的核心執行流程我們已經分析完畢,可以參見Spark SQL核心執行流程,下面我們來分析執行流程中各個核心組件的工作職責。
第一篇:Spark SQL源碼分析之核心流程
example 協議 bst copyto name 分詞 oop 不同 spl /** Spark SQL源碼分析系列文章*/
自從去年Spark Submit 2013 Michael Armbrust分享了他的Catalyst,到至今1年多了,Spark SQ
第九篇:Spark SQL 源碼分析之 In-Memory Columnar Storage源碼分析之 cache table
gravity base field eof 授權 葉子節點 command ref gist /** Spark SQL源碼分析系列文章*/
Spark SQL 可以將數據緩存到內存中,我們可以見到的通過調用cache table tableName即可將一張表緩
第十篇:Spark SQL 源碼分析之 In-Memory Columnar Storage源碼分析之 query
pro .net asn 解析 partition store exec attr_ array /** Spark SQL源碼分析系列文章*/
前面講到了Spark SQL In-Memory Columnar Storage的存儲結構是基於列存儲的。
那
第四篇:Spark SQL Catalyst源碼分析之TreeNode Library
pla where 並且 手冊 input bst node lec esc /** Spark SQL源碼分析系列文章*/
前幾篇文章介紹了Spark SQL的Catalyst的核心運行流程、SqlParser,和Analyzer,本來打算直接寫Optimizer
Spark SQL中 RDD 轉換到 DataFrame
pre ase replace 推斷 expr context 利用反射 轉換 port 1.people.txtsoyo8, 35小周, 30小華, 19soyo,882./** * Created by soyo on 17-10-10. * 利用反射機制推斷RDD