如何檢視SparkSQL 生成的抽象語法樹?
前言
在《Spark SQL核心剖析》書中4.3章節,談到Catalyst體系中生成的抽象語法樹的節點都是以Context來結尾,在ANLTR4以及生成的SqlBaseParser解析SQL生成,其原始碼部分就是語法解析,其生成的抽象語法樹的節點都是ParserRuleContext的子類。
提出問題
ANLTR4解析SQL生成抽象語法樹,最終這顆樹長成什麼樣子,如何檢視?
原始碼分析
測試示例
spark.sql("select id, count(name) from student group by id").show()
原始碼入口
SparkSession的sql 方法如下:
def sql(sqlText: String): DataFrame = { // TODO 1. 生成LogicalPlan // sqlParser 為 SparkSqlParser val logicalPlan: LogicalPlan = sessionState.sqlParser.parsePlan(sqlText) // 根據 LogicalPlan val frame: DataFrame = Dataset.ofRows(self, logicalPlan) frame // sqlParser }
定位SparkSqlParser
入口原始碼涉及到SessionState這個關鍵類,其初始化程式碼如下:
lazy val sessionState: SessionState = { parentSessionState .map(_.clone(this)) .getOrElse { // 構建 org.apache.spark.sql.internal.SessionStateBuilder val state = SparkSession.instantiateSessionState( SparkSession.sessionStateClassName(sparkContext.conf), self) initialSessionOptions.foreach { case (k, v) => state.conf.setConfString(k, v) } state } }
org.apache.spark.sql.SparkSession$#sessionStateClassName 方法具體如下:
private def sessionStateClassName(conf: SparkConf): String = { // spark.sql.catalogImplementation, 分為 hive 和 in-memory模式,預設為 in-memory 模式 conf.get(CATALOG_IMPLEMENTATION) match { case "hive" => HIVE_SESSION_STATE_BUILDER_CLASS_NAME // hive 實現 org.apache.spark.sql.hive.HiveSessionStateBuilder case "in-memory" => classOf[SessionStateBuilder].getCanonicalName // org.apache.spark.sql.internal.SessionStateBuilder } }
其中,這裡用到了builder模式,org.apache.spark.sql.internal.SessionStateBuilder就是用來構建 SessionState的。在 SparkSession.instantiateSessionState 中有具體說明,如下:
/** * Helper method to create an instance of `SessionState` based on `className` from conf. * The result is either `SessionState` or a Hive based `SessionState`. */ private def instantiateSessionState( className: String, sparkSession: SparkSession): SessionState = { try { // org.apache.spark.sql.internal.SessionStateBuilder // invoke `new [Hive]SessionStateBuilder(SparkSession, Option[SessionState])` val clazz = Utils.classForName(className) val ctor = clazz.getConstructors.head ctor.newInstance(sparkSession, None).asInstanceOf[BaseSessionStateBuilder].build() } catch { case NonFatal(e) => throw new IllegalArgumentException(s"Error while instantiating '$className':", e) } }
其中,BaseSessionStateBuilder下面有兩個主要實現,分別為 org.apache.spark.sql.hive.HiveSessionStateBuilder(hive模式) 和 org.apache.spark.sql.internal.SessionStateBuilder(in-memory模式,預設)
org.apache.spark.sql.internal.BaseSessionStateBuilder#build 方法,原始碼如下:
/** * Build the [[SessionState]]. */ def build(): SessionState = { new SessionState( session.sharedState, conf, experimentalMethods, functionRegistry, udfRegistration, () => catalog, sqlParser, () => analyzer, () => optimizer, planner, streamingQueryManager, listenerManager, () => resourceLoader, createQueryExecution, createClone) }
SessionState中,包含了很多的引數,關鍵引數介紹如下:
conf:SparkConf物件,對SparkSession的配置
functionRegistry:FunctionRegistry物件,負責函式的註冊,其內部維護了一個map物件用於維護註冊的函式。
UDFRegistration:UDFRegistration物件,用於註冊UDF函式,其依賴於FunctionRegistry
catalogBuilder: () => SessionCatalog:返回SessionCatalog物件,其主要用於管理SparkSession的Catalog
sqlParser: ParserInterface, 實際為 SparkSqlParser 例項,其內部呼叫ASTBuilder將SQL解析為抽象語法樹
analyzerBuilder: () => Analyzer, org.apache.spark.sql.internal.BaseSessionStateBuilder.analyzer 自定義 org.apache.spark.sql.catalyst.analysis.Analyzer.Analyzer
optimizerBuilder: () => Optimizer, // org.apache.spark.sql.internal.BaseSessionStateBuilder.optimizer --> 自定義 org.apache.spark.sql.execution.SparkOptimizer.SparkOptimizer
planner: SparkPlanner, // org.apache.spark.sql.internal.BaseSessionStateBuilder.planner --> 自定義 org.apache.spark.sql.execution.SparkPlanner.SparkPlanner
resourceLoaderBuilder: () => SessionResourceLoader,返回資源載入器,主要用於載入函式的jar或資源
createQueryExecution: LogicalPlan => QueryExecution:根據LogicalPlan生成QueryExecution物件
parsePlan方法
SparkSqlParser沒有該方法的實現,具體是現在其父類 AbstractSqlParser中,如下:
/** Creates LogicalPlan for a given SQL string. */ // TODO 根據 sql語句生成 邏輯計劃 LogicalPlan override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser => val singleStatementContext: SqlBaseParser.SingleStatementContext = parser.singleStatement() astBuilder.visitSingleStatement(singleStatementContext) match { case plan: LogicalPlan => plan case _ => val position = Origin(None, None) throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position) } }
其中 parse 方法後面的方法是一個回撥函式,它在parse 方法中被呼叫,如下:
org.apache.spark.sql.execution.SparkSqlParser#parse原始碼如下:
private val substitutor = new VariableSubstitution(conf) // 引數替換器 protected override def parse[T](command: String)(toResult: SqlBaseParser => T): T = { super.parse(substitutor.substitute(command))(toResult) }
其中,substitutor是一個引數替換器,用於把SQL中的引數都替換掉,繼續看其父類AbstractSqlParser的parse 方法:
protected def parse[T](command: String)(toResult: SqlBaseParser => T): T = { logDebug(s"Parsing command: $command") // 詞法分析 val lexer = new SqlBaseLexer(new UpperCaseCharStream(CharStreams.fromString(command))) lexer.removeErrorListeners() lexer.addErrorListener(ParseErrorListener) lexer.legacy_setops_precedence_enbled = SQLConf.get.setOpsPrecedenceEnforced // 語法分析 val tokenStream = new CommonTokenStream(lexer) val parser = new SqlBaseParser(tokenStream) parser.addParseListener(PostProcessor) parser.removeErrorListeners() parser.addErrorListener(ParseErrorListener) parser.legacy_setops_precedence_enbled = SQLConf.get.setOpsPrecedenceEnforced try { try { // first, try parsing with potentially faster SLL mode parser.getInterpreter.setPredictionMode(PredictionMode.SLL) // 使用 AstBuilder 生成 Unresolved LogicalPlan toResult(parser) } catch { case e: ParseCancellationException => // if we fail, parse with LL mode tokenStream.seek(0) // rewind input stream parser.reset() // Try Again. parser.getInterpreter.setPredictionMode(PredictionMode.LL) toResult(parser) } } catch { case e: ParseException if e.command.isDefined => throw e case e: ParseException => throw e.withCommand(command) case e: AnalysisException => val position = Origin(e.line, e.startPosition) throw new ParseException(Option(command), e.message, position, position) } }
在這個方法中呼叫ANLTR4的API將SQL轉換為AST抽象語法樹,然後呼叫 toResult(parser) 方法,這個 toResult 方法就是parsePlan 方法的回撥方法。
截止到呼叫astBuilder.visitSingleStatement 方法之前, AST抽象語法樹已經生成。
列印生成的AST
修改原始碼
下面,看 astBuilder.visitSingleStatement 方法:
override def visitSingleStatement(ctx: SingleStatementContext): LogicalPlan = withOrigin(ctx) { val statement: StatementContext = ctx.statement printRuleContextInTreeStyle(statement, 1) // 呼叫accept 生成 邏輯運算元樹AST visit(statement).asInstanceOf[LogicalPlan] }
在使用訪問者模式訪問AST節點生成UnResolved LogicalPlan之前,我定義了一個方法用來列印剛解析生成的抽象語法樹, printRuleContextInTreeStyle 程式碼如下:
/** * 樹形列印抽象語法樹 */ private def printRuleContextInTreeStyle(ctx: ParserRuleContext, level:Int): Unit = { val prefix:String = "|" val curLevelStr: String = "-" * level val childLevelStr: String = "-" * (level + 1) println(s"${prefix}${curLevelStr} ${ctx.getClass.getCanonicalName}") val children: util.List[ParseTree] = ctx.children if( children == null || children.size() == 0) { return } children.iterator().foreach { case context: ParserRuleContext => printRuleContextInTreeStyle(context, level + 1) case _ => println(s"${prefix}${childLevelStr} ${ctx.getClass.getCanonicalName}") } }
三種SQL列印示例
SQL示例1(帶where)
select name from student where age > 18
其生成的AST如下:
|- org.apache.spark.sql.catalyst.parser.SqlBaseParser.StatementDefaultContext |-- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryContext |--- org.apache.spark.sql.catalyst.parser.SqlBaseParser.SingleInsertQueryContext |---- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryTermDefaultContext |----- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryPrimaryDefaultContext |------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.QuerySpecificationContext |------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QuerySpecificationContext |------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NamedExpressionSeqContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NamedExpressionContext |--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ExpressionContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext |----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext |------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.ColumnReferenceContext |------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext |-------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |--------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.FromClauseContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.FromClauseContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.RelationContext |--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableNameContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableIdentifierContext |----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext |------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableAliasContext |------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QuerySpecificationContext |------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ComparisonContext |--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ColumnReferenceContext |----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext |------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ComparisonOperatorContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ComparisonOperatorContext |--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ConstantDefaultContext |----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NumericLiteralContext |------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.IntegerLiteralContext |------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IntegerLiteralContext |---- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryOrganizationContext
SQL示例2(帶排序)
select name from student where age > 18 order by id desc
其生成的AST如下:
|- org.apache.spark.sql.catalyst.parser.SqlBaseParser.StatementDefaultContext |-- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryContext |--- org.apache.spark.sql.catalyst.parser.SqlBaseParser.SingleInsertQueryContext |---- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryTermDefaultContext |----- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryPrimaryDefaultContext |------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.QuerySpecificationContext |------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QuerySpecificationContext |------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NamedExpressionSeqContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NamedExpressionContext |--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ExpressionContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext |----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext |------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.ColumnReferenceContext |------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext |-------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |--------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.FromClauseContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.FromClauseContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.RelationContext |--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableNameContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableIdentifierContext |----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext |------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableAliasContext |------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QuerySpecificationContext |------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ComparisonContext |--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ColumnReferenceContext |----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext |------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ComparisonOperatorContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ComparisonOperatorContext |--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ConstantDefaultContext |----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NumericLiteralContext |------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.IntegerLiteralContext |------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IntegerLiteralContext |---- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryOrganizationContext |----- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryOrganizationContext |----- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryOrganizationContext |----- org.apache.spark.sql.catalyst.parser.SqlBaseParser.SortItemContext |------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.ExpressionContext |------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext |--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ColumnReferenceContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext |----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.SortItemContext
SQL示例2(帶分組)
select id, count(name) from student group by id
其生成的AST如下:
|- org.apache.spark.sql.catalyst.parser.SqlBaseParser.StatementDefaultContext |-- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryContext |--- org.apache.spark.sql.catalyst.parser.SqlBaseParser.SingleInsertQueryContext |---- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryTermDefaultContext |----- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryPrimaryDefaultContext |------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.QuerySpecificationContext |------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QuerySpecificationContext |------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NamedExpressionSeqContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NamedExpressionContext |--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ExpressionContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext |----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext |------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.ColumnReferenceContext |------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext |-------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |--------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NamedExpressionSeqContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.NamedExpressionContext |--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ExpressionContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext |----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext |------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.FunctionCallContext |------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QualifiedNameContext |-------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext |--------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |---------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.FunctionCallContext |------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ExpressionContext |-------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext |--------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext |---------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ColumnReferenceContext |----------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext |------------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |------------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.FunctionCallContext |------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.FromClauseContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.FromClauseContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.RelationContext |--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableNameContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableIdentifierContext |----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext |------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.TableAliasContext |------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.AggregationContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.AggregationContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.AggregationContext |-------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ExpressionContext |--------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.PredicatedContext |---------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ValueExpressionDefaultContext |----------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.ColumnReferenceContext |------------ org.apache.spark.sql.catalyst.parser.SqlBaseParser.IdentifierContext |------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |-------------- org.apache.spark.sql.catalyst.parser.SqlBaseParser.UnquotedIdentifierContext |---- org.apache.spark.sql.catalyst.parser.SqlBaseParser.QueryOrganizationContext
總結
在本篇文章中,主要從測試程式碼出發,到如何呼叫ANTLR4解析SQL得到生成AST,並且修改了原始碼來列印這個AST樹。儘管現在看來,使用ANTLR解析SQL生成AST是一個black box,但對於Spark SQL來說,其後續流程的輸入已經得