Spark SQL 筆記(19)——spark SQL 總結(2) DataFrame VS SQL
阿新 • • 發佈:2018-12-22
1 DataFrame
DataFrame = RDD + Schema
- DataFrame is just a type alias for Dataset of Row
- DataFrame over RDD : Catalyst optimization&schemas
- DataFrame can handle : Text,JSON,Parquet,…
- Both SQL and API Functions in DF still Catalyst optimized
2 Schema
https://spark.apache.org/docs/2.1.3/sql-programming-guide.html#interoperating-with-rdds
- inferred
- explicit
3 Loading & Saving Results
https://spark.apache.org/docs/2.1.3/sql-programming-guide.html#save-modes
4 SQL Function Coverage
SQL 覆蓋面
- SQL 2003 support
- Runs all 99 of TPC-DS benchmark queries
- Subquery supports
- vectorization
5 外部資料來源
https://spark-packages.org/
- rdbms,need JDBC jars,
- Parquet,Phoenix,csv,avro,…