Spark SQL(6) OptimizedPlan

阿新 • • 發佈：2020-07-26

Spark SQL(6) OptimizedPlan

在這一步spark sql主要應用一些規則，優化生成的Resolved Plan，這一步涉及到的有Optimizer。

之前介紹在sparksession例項化的是會例項化sessionState，進而確定QueryExecution、Analyzer，Optimizer也是在這一步確定的：

  protected def optimizer: Optimizer = {
    new SparkOptimizer(catalog, experimentalMethods) {
      override def extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]] =
        super.extendedOperatorOptimizationRules ++ customOperatorOptimizationRules
    }
  }

　Optimizer也是RuleExecutor的子類，而SparkOptimizer是Optimizer子類，在analyzed步驟知道，其實主要規則就是RuleExecutor子類定義的batchs的規則

sparkOptimizer：

 override def batches: Seq[Batch] = (preOptimizationBatches ++ super.batches :+
    Batch("Optimize Metadata Only Query", Once, OptimizeMetadataOnlyQuery(catalog)) :+
    Batch("Extract Python UDF from Aggregate", Once, ExtractPythonUDFFromAggregate) :+
    Batch("Prune File Source Table Partitions", Once, PruneFileSourcePartitions) :+
    Batch("Push down operators to data source scan", Once, PushDownOperatorsToDataSource)) ++
    postHocOptimizationBatches :+
    Batch("User Provided Optimizers", fixedPoint, experimentalMethods.extraOptimizations: _*)

Optimizer：

def batches: Seq[Batch] = {
    val operatorOptimizationRuleSet =
      Seq(
        // Operator push down
        PushProjectionThroughUnion,
        ReorderJoin,
        EliminateOuterJoin,
        PushPredicateThroughJoin,
        PushDownPredicate,
        LimitPushDown,
        ColumnPruning,
        InferFiltersFromConstraints,
        // Operator combine
        CollapseRepartition,
        CollapseProject,
        CollapseWindow,
        CombineFilters,
        CombineLimits,
        CombineUnions,
        // Constant folding and strength reduction
        NullPropagation,
        ConstantPropagation,
        FoldablePropagation,
        OptimizeIn,
        ConstantFolding,
        ReorderAssociativeOperator,
        LikeSimplification,
        BooleanSimplification,
        SimplifyConditionals,
        RemoveDispensableExpressions,
        SimplifyBinaryComparison,
        PruneFilters,
        EliminateSorts,
        SimplifyCasts,
        SimplifyCaseConversionExpressions,
        RewriteCorrelatedScalarSubquery,
        EliminateSerialization,
        RemoveRedundantAliases,
        RemoveRedundantProject,
        SimplifyCreateStructOps,
        SimplifyCreateArrayOps,
        SimplifyCreateMapOps,
        CombineConcats) ++
        extendedOperatorOptimizationRules

    val operatorOptimizationBatch: Seq[Batch] = {
      val rulesWithoutInferFiltersFromConstraints =
        operatorOptimizationRuleSet.filterNot(_ == InferFiltersFromConstraints)
      Batch("Operator Optimization before Inferring Filters", fixedPoint,
        rulesWithoutInferFiltersFromConstraints: _*) ::
      Batch("Infer Filters", Once,
        InferFiltersFromConstraints) ::
      Batch("Operator Optimization after Inferring Filters", fixedPoint,
        rulesWithoutInferFiltersFromConstraints: _*) :: Nil
    }

    (Batch("Eliminate Distinct", Once, EliminateDistinct) ::
    // Technically some of the rules in Finish Analysis are not optimizer rules and belong more
    // in the analyzer, because they are needed for correctness (e.g. ComputeCurrentTime).
    // However, because we also use the analyzer to canonicalized queries (for view definition),
    // we do not eliminate subqueries or compute current time in the analyzer.
    Batch("Finish Analysis", Once,
      EliminateSubqueryAliases,
      EliminateView,
      ReplaceExpressions,
      ComputeCurrentTime,
      GetCurrentDatabase(sessionCatalog),
      RewriteDistinctAggregates,
      ReplaceDeduplicateWithAggregate) ::
    //////////////////////////////////////////////////////////////////////////////////////////
    // Optimizer rules start here
    //////////////////////////////////////////////////////////////////////////////////////////
    // - Do the first call of CombineUnions before starting the major Optimizer rules,
    //   since it can reduce the number of iteration and the other rules could add/move
    //   extra operators between two adjacent Union operators.
    // - Call CombineUnions again in Batch("Operator Optimizations"),
    //   since the other rules might make two separate Unions operators adjacent.
    Batch("Union", Once,
      CombineUnions) ::
    Batch("Pullup Correlated Expressions", Once,
      PullupCorrelatedPredicates) ::
    Batch("Subquery", Once,
      OptimizeSubqueries) ::
    Batch("Replace Operators", fixedPoint,
      ReplaceIntersectWithSemiJoin,
      ReplaceExceptWithFilter,
      ReplaceExceptWithAntiJoin,
      ReplaceDistinctWithAggregate) ::
    Batch("Aggregate", fixedPoint,
      RemoveLiteralFromGroupExpressions,
      RemoveRepetitionFromGroupExpressions) :: Nil ++
    operatorOptimizationBatch) :+
    Batch("Join Reorder", Once,
      CostBasedJoinReorder) :+
    Batch("Decimal Optimizations", fixedPoint,
      DecimalAggregates) :+
    Batch("Object Expressions Optimization", fixedPoint,
      EliminateMapObjects,
      CombineTypedFilters) :+
    Batch("LocalRelation", fixedPoint,
      ConvertToLocalRelation,
      PropagateEmptyRelation) :+
    // The following batch should be executed after batch "Join Reorder" and "LocalRelation".
    Batch("Check Cartesian Products", Once,
      CheckCartesianProducts) :+
    Batch("RewriteSubquery", Once,
      RewritePredicateSubquery,
      ColumnPruning,
      CollapseProject,
      RemoveRedundantProject)
  }

　　如上這便是在優化這步的所有的規則和策略例如消除子查詢別名，表示式替換、運算元下推、常量摺疊等優化規則，經過這一步之後，就進入物理計劃階段了。

舉個栗子：

在這裡面標出來的子查詢別名是在解析的時候加上的，但是在優化之後的logicalPlan中已經去掉，對應上述規則的（消除子查詢別名），還有裡面的運算元下推，在analyzed的運算元樹中，Join Inner中對age欄位做了條件限制，但是在優化後的LogicPlan中已經下推到離Project最近的位置；之後還有InferFiltersFromConstraints規則，對age、name增加了不為null的限定條件。

Spark SQL(6) OptimizedPlan

Spark SQL(6) OptimizedPlan

Spark SQL(6) OptimizedPlan

【夢溪筆談】6.spark-sql相關程式碼

自行編譯spark適配CDH 6.3.2的spark-sql

位元組跳動在Spark SQL上的核心優化實踐 | 位元組跳動技術沙龍

Spark 系列（八）—— Spark SQL 之 DataFrame 和 Dataset

Spark 系列（九）—— Spark SQL 之 Structured API

Spark 系列（十）—— Spark SQL 外部資料來源

Spark 系列（十一）—— Spark SQL 聚合函式 Aggregations

Spark-SQL讀不到Hive資料庫的新坑指北

Spark SQL常見4種資料來源詳解

Spark SQL操作JSON欄位的小技巧

Spark入門（六）Spark SQL shell啟動方式(元資料儲存在mysql)

Spark SQL 入門建立DataFrame報錯：org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://local

DataFrame DataSet Spark SQL學習

【趙強老師】什麼是Spark SQL？

Spark SQL : DataFrame repartition、coalesce 對比

spark sql

Spark SQL Parser到Unresolved LogicPlan

Spark SQL(4)-Unresolved Plan到Analyzed Plan

Spark SQL(5) CacheManage

Spark SQL(6) OptimizedPlan

Spark SQL(6) OptimizedPlan

相關推薦