1. 程式人生 > 其它 >基於 Databend 和騰訊雲 COS 打造新型雲數倉

基於 Databend 和騰訊雲 COS 打造新型雲數倉

本篇文章向大家演示如何使用 Databend 基於騰訊雲 COS 構建新式數倉及其計算能力。如果你也在找一個低成本、高效能、支援彈性的數倉,Databend 可以為大家提供一個基於物件儲存的雲原生數倉解決方案。目前 Databend 支援資料的 stream load , copy into from stage , insert 等方式的資料寫入,部署上支援單機和叢集模式。需要更多支援新增微信: 82565387 。 文章較長,建議收藏 PC 端閱讀。

Databend 介紹

Databend 是一款使用 Rust 研發、開源、完全面向物件儲存架構的新式數倉,提供極速的彈性擴充套件能力,致力於打造按需、按量的 Data Cloud 產品體驗。具備以下特點:

•Vectorized Execution 和 Pull&Push-Based Processor Model

•真正的儲存、計算分離架構,高效能、低成本,按需按量使用

•完整的資料庫支援,相容 MySQL ,Clickhouse 協議, SQL Over http

•完善的事務性,支援 Data Time Travel, Database Zero Clone 等功能

•支援基於同一份資料的多租戶讀寫、共享操作

github repo: https://github.com/datafuselabs/databend

Docs:   https://databend.rs

關於 Databend 架構圖,參考:

https://databend.rs/doc/

騰訊雲 COS

物件儲存(Cloud Object Storage,COS)是由騰訊雲推出的無目錄層次結構、無資料格式限制,可容納海量資料且支援 HTTP/HTTPS 協議訪問的分散式儲存服務。騰訊雲 COS 的儲存桶空間無容量上限,無需分割槽管理,適用於 CDN 資料分發、資料永珍處理或大資料計算與分析的資料湖等多種場景。官網:https://cloud.tencent.com/product/cos

測試環境介紹

北京區: CVM SA2.8XLARGE64 & COS(ap-beijing)

作業系統: ubuntu-20

Databend : 使用進二制釋出版本 v0.6.99-nightly

下載地址:https://repo.databend.rs/databend/v0.6.99-nightly/databend-v0.6.99-nightly-x86_64-unknown-linux-gnu.tar.gz

本次測試安裝部署方式參考:https://databend.rs/doc/deploy/cos

叢集部署模式參考:https://databend.rs/doc/deploy/cluster_minio

測試資料

wget --no-check-certificate --continue https://transtats.bts.gov/PREZIP/
On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{1987..2021}_{1..12}.zip

表結構參考:cat create_ontime.sql

CREATE TABLE ontime
(
    Year                            UInt16 NOT NULL,
    Quarter                         UInt8 NOT NULL,
    Month                           UInt8 NOT NULL,
    DayofMonth                      UInt8 NOT NULL,
    DayOfWeek                       UInt8 NOT NULL,
    FlightDate                      Date NOT NULL,
    Reporting_Airline               String NOT NULL,
    DOT_ID_Reporting_Airline        Int32 NOT NULL,
    IATA_CODE_Reporting_Airline     String NOT NULL,
    Tail_Number                     String NOT NULL,
    Flight_Number_Reporting_Airline String NOT NULL,
    OriginAirportID                 Int32 NOT NULL,
    OriginAirportSeqID              Int32 NOT NULL,
    OriginCityMarketID              Int32 NOT NULL,
    Origin                          String NOT NULL,
    OriginCityName                  String NOT NULL,
    OriginState                     String NOT NULL,
    OriginStateFips                 String NOT NULL,
    OriginStateName                 String NOT NULL,
    OriginWac                       Int32 NOT NULL,
    DestAirportID                   Int32 NOT NULL,
    DestAirportSeqID                Int32 NOT NULL,
    DestCityMarketID                Int32 NOT NULL,
    Dest                            String NOT NULL,
    DestCityName                    String NOT NULL,
    DestState                       String NOT NULL,
    DestStateFips                   String NOT NULL,
    DestStateName                   String NOT NULL,
    DestWac                         Int32 NOT NULL,
    CRSDepTime                      Int32 NOT NULL,
    DepTime                         Int32 NOT NULL,
    DepDelay                        Int32 NOT NULL,
    DepDelayMinutes                 Int32 NOT NULL,
    DepDel15                        Int32 NOT NULL,
    DepartureDelayGroups            String NOT NULL,
    DepTimeBlk                      String NOT NULL,
    TaxiOut                         Int32 NOT NULL,
    WheelsOff                       Int32 NOT NULL,
    WheelsOn                        Int32 NOT NULL,
    TaxiIn                          Int32 NOT NULL,
    CRSArrTime                      Int32 NOT NULL,
    ArrTime                         Int32 NOT NULL,
    ArrDelay                        Int32 NOT NULL,
    ArrDelayMinutes                 Int32 NOT NULL,
    ArrDel15                        Int32 NOT NULL,
    ArrivalDelayGroups              Int32 NOT NULL,
    ArrTimeBlk                      String NOT NULL,
    Cancelled                       UInt8 NOT NULL,
    CancellationCode                String NOT NULL,
    Diverted                        UInt8 NOT NULL,
    CRSElapsedTime                  Int32 NOT NULL,
    ActualElapsedTime               Int32 NOT NULL,
    AirTime                         Int32 NOT NULL,
    Flights                         Int32 NOT NULL,
    Distance                        Int32 NOT NULL,
    DistanceGroup                   UInt8 NOT NULL,
    CarrierDelay                    Int32 NOT NULL,
    WeatherDelay                    Int32 NOT NULL,
    NASDelay                        Int32 NOT NULL,
    SecurityDelay                   Int32 NOT NULL,
    LateAircraftDelay               Int32 NOT NULL,
    FirstDepTime                    String NOT NULL,
    TotalAddGTime                   String NOT NULL,
    LongestAddGTime                 String NOT NULL,
    DivAirportLandings              String NOT NULL,
    DivReachedDest                  String NOT NULL,
    DivActualElapsedTime            String NOT NULL,
    DivArrDelay                     String NOT NULL,
    DivDistance                     String NOT NULL,
    Div1Airport                     String NOT NULL,
    Div1AirportID                   Int32 NOT NULL,
    Div1AirportSeqID                Int32 NOT NULL,
    Div1WheelsOn                    String NOT NULL,
    Div1TotalGTime                  String NOT NULL,
    Div1LongestGTime                String NOT NULL,
    Div1WheelsOff                   String NOT NULL,
    Div1TailNum                     String NOT NULL,
    Div2Airport                     String NOT NULL,
    Div2AirportID                   Int32 NOT NULL,
    Div2AirportSeqID                Int32 NOT NULL,
    Div2WheelsOn                    String NOT NULL,
    Div2TotalGTime                  String NOT NULL,
    Div2LongestGTime                String NOT NULL,
    Div2WheelsOff                   String NOT NULL,
    Div2TailNum                     String NOT NULL,
    Div3Airport                     String NOT NULL,
    Div3AirportID                   Int32 NOT NULL,
    Div3AirportSeqID                Int32 NOT NULL,
    Div3WheelsOn                    String NOT NULL,
    Div3TotalGTime                  String NOT NULL,
    Div3LongestGTime                String NOT NULL,
    Div3WheelsOff                   String NOT NULL,
    Div3TailNum                     String NOT NULL,
    Div4Airport                     String NOT NULL,
    Div4AirportID                   Int32 NOT NULL,
    Div4AirportSeqID                Int32 NOT NULL,
    Div4WheelsOn                    String NOT NULL,
    Div4TotalGTime                  String NOT NULL,
    Div4LongestGTime                String NOT NULL,
    Div4WheelsOff                   String NOT NULL,
    Div4TailNum                     String NOT NULL,
    Div5Airport                     String NOT NULL,
    Div5AirportID                   Int32 NOT NULL,
    Div5AirportSeqID                Int32 NOT NULL,
    Div5WheelsOn                    String NOT NULL,
    Div5TotalGTime                  String NOT NULL,
    Div5LongestGTime                String NOT NULL,
    Div5WheelsOff                   String NOT NULL,
    Div5TailNum                     String NOT NULL
);

載入表結構:

cat create_ontime.sql | mysql -h127.0.0.1 -P3307 -uroot

資料載入

cat load_ontime.sh

echo "unzip ontime ,input your ontime zip dir: ./load_ontime.sh zip_dir"

ls $1/*.zip |xargs -I{} -P 4 bash -c "echo {}; unzip -q {} '*.csv' -d ./dataset"

if [ $? -eq  0 ];
then
    echo "unzip success"
else
    echo "unzip was wrong!!!"
    exit 1
fi

cat create_ontime.sql |mysql -h127.0.0.1 -P3307 -uroot
if [ $? -eq  0 ];
then
    echo "Ontime table create success"
else
    echo "Ontime table create was wrong!!!"
    exit 1
fi


time ls ./dataset/*.csv|xargs -P 8 -I{} curl -H "insert_sql:insert into ontime format CSV" -H "skip_header:1" -F "upload=@{}" -XPUT http://localhost:8081/v1/streaming_load


使用方法

./load_ontime.sh ZIP檔案目錄

基於 Ontime 測試 SQL 展示

Q1 查詢2000年到2008年每天的總的航班總

(0.494 sec., 143.75 million rows/sec., 431.25 MB/sec)

mysql> SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;\
+-----------+---------+\
| DayOfWeek | c |\
+-----------+---------+\
| 5 | 8732422 |\
| 1 | 8730614 |\
| 4 | 8710843 |\
| 3 | 8685626 |\
| 2 | 8639632 |\
| 7 | 8274367 |\
| 6 | 7514194 |\
+-----------+---------+\
7 rows in set (0.50 sec)\
Read 71000000 rows, 213 MB in 0.494 sec., 143.75 million rows/sec., 431.25 MB/sec.
mysql> explain SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: DayOfWeek:UInt8, count() as c:UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[DayOfWeek]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[DayOfWeek]], aggr=[[count()]] |\
| Filter: ((Year >= 2000) and (Year <= 2008)) |\
| ReadDataSource: scan schema: [Year:UInt16, DayOfWeek:UInt8], statistics: [read_rows: 71000000, read_bytes: 213000000, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 4], filters: [((Year >= 2000) AND (Year <= 2008))]] |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.01 sec)

Q2 查詢 2000 年到 2008 年延遲超過 10 分鐘,每天總的延遲發生情況

( 0.543 sec., 130.71 million rows/sec., 914.95 GB/sec.)

mysql> SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;\
+-----------+---------+\
| DayOfWeek | c |\
+-----------+---------+\
| 5 | 2175733 |\
| 4 | 2012848 |\
| 1 | 1898879 |\
| 7 | 1880896 |\
| 3 | 1757508 |\
| 2 | 1665303 |\
| 6 | 1510894 |\
+-----------+---------+\
7 rows in set (0.54 sec)\
Read 71000000 rows, 497 MB in 0.543 sec., 130.71 million rows/sec., 914.95 MB/sec.
mysql> explain SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: DayOfWeek:UInt8, count() as c:UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[DayOfWeek]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[DayOfWeek]], aggr=[[count()]] |\
| Filter: (((DepDelay > 10) and (Year >= 2000)) and (Year <= 2008)) |\
| ReadDataSource: scan schema: [Year:UInt16, DayOfWeek:UInt8, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 497000000, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 4, 31], filters: [(((DepDelay > 10) AND (Year >= 2000)) AND (Year <= 2008))]] |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q3 2000-2008年機場的延誤次數,顯示最高的10條

(0.679 sec., 104.59 million rows/sec., 1.78 GB/sec.)

Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;\
+--------+--------+\
| Origin | c |\
+--------+--------+\
| ORD | 860911 |\
| ATL | 831822 |\
| DFW | 614403 |\
| LAX | 402671 |\
| PHX | 400475 |\
| LAS | 362026 |\
| DEN | 352893 |\
| EWR | 302267 |\
| DTW | 296832 |\
| IAH | 290729 |\
+--------+--------+\
10 rows in set (0.69 sec)\
Read 71000000 rows, 1.21 GB in 0.679 sec., 104.59 million rows/sec., 1.78 GB/sec.
mysql> explain SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Limit: 10 |\
| Projection: Origin:String, count() as c:UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[Origin]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[Origin]], aggr=[[count()]] |\
| Filter: (((DepDelay > 10) and (Year >= 2000)) and (Year <= 2008)) |\
| ReadDataSource: scan schema: [Year:UInt16, Origin:String, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 1271665856, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 14, 31], filters: [(((DepDelay > 10) AND (Year >= 2000)) AND (Year <= 2008))]] |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
7 rows in set (0.00 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q4 2007年各航空公司延誤的次數

(0.188 sec., 79.77 million rows/sec., 1.28 GB/sec.)

mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, count() FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count() DESC;\
+---------+---------+\
| Carrier | count() |\
+---------+---------+\
| WN | 296451 |\
| AA | 179769 |\
| MQ | 152293 |\
| OO | 147019 |\
| US | 140199 |\
| UA | 135061 |\
| XE | 108571 |\
| EV | 104055 |\
| NW | 102206 |\
| DL | 98427 |\
| CO | 81039 |\
| YV | 79553 |\
| FL | 64583 |\
| OH | 60532 |\
| AS | 54326 |\
| B6 | 53716 |\
| 9E | 48578 |\
| F9 | 24100 |\
| AQ | 6764 |\
| HA | 4059 |\
+---------+---------+\
20 rows in set (0.19 sec)\
Read 15000000 rows, 240 MB in 0.188 sec., 79.77 million rows/sec., 1.28 GB/sec.
mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, count() FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count() DESC;\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, count():UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[count()]] |\
| Filter: ((DepDelay > 10) and (Year = 2007)) |\
| ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 15000000, read_bytes: 250239306, partitions_scanned: 15, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [((DepDelay > 10) AND (Year = 2007))]] |\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.00 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q5 2007年各航空公司延誤的千分比

(0.265 sec., 56.58 million rows/sec., 905.28 MB/sec.)

mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year=2007 GROUP BY Carrier ORDER BY c3 DESC;\
+---------+--------------------+\
| Carrier | c3 |\
+---------+--------------------+\
| EV | 363.53123668047823 |\
| AS | 339.1453631738303 |\
| US | 288.8039271022377 |\
| AA | 283.6112877194699 |\
| MQ | 281.7663100792978 |\
| B6 | 280.5745625489684 |\
| UA | 275.63356884257615 |\
| YV | 270.25567158804466 |\
| OH | 256.4567516268981 |\
| WN | 253.62165713752844 |\
| CO | 250.77750030171651 |\
| XE | 249.71881878589517 |\
| NW | 246.56113247419944 |\
| F9 | 246.52209492635023 |\
| OO | 245.90051515354253 |\
| FL | 245.4143692596491 |\
| DL | 206.82764258051773 |\
| 9E | 187.66780889391967 |\
| AQ | 145.9016393442623 |\
| HA | 72.25634178905207 |\
+---------+--------------------+\
20 rows in set (0.27 sec)\
Read 15000000 rows, 240 MB in 0.265 sec., 56.58 million rows/sec., 905.28 MB/sec.
mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year=2007 GROUP BY Carrier ORDER BY c3 DESC;\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, (avg(cast((DepDelay > 10) as Int8)) * 1000) as c3:Float64 |\
| Sort: (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64 |\
| Expression: IATA_CODE_Reporting_Airline:String, (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64 (Before OrderBy) |\
| AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]] |\
| AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]] |\
| Expression: IATA_CODE_Reporting_Airline:String, cast((DepDelay > 10) as Int8):Int8 (Before GroupBy) |\
| Filter: (Year = 2007) |\
| ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 15000000, read_bytes: 250239306, partitions_scanned: 15, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [(Year = 2007)]] |\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
8 rows in set (0.00 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q6 2000-2008年各航空公司延誤的千分比

(0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.)

mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year>=2000 AND Year <=2008 GROUP BY Carrier ORDER BY c3 DESC;\
+---------+--------------------+\
| Carrier | c3 |\
+---------+--------------------+\
| AS | 293.05649076611434 |\
| EV | 282.0709981074399 |\
| YV | 270.3897636688929 |\
| B6 | 257.40594891667007 |\
| FL | 249.28742951361826 |\
| XE | 246.59005902424192 |\
| MQ | 245.3695989400477 |\
| WN | 233.38127235928863 |\
| DH | 227.11013827345042 |\
| F9 | 226.08455653226812 |\
| UA | 224.42824657703645 |\
| OH | 215.52882835147614 |\
| AA | 211.97122176454556 |\
| US | 206.60330294168244 |\
| HP | 205.31690167066455 |\
| OO | 202.4243177198239 |\
| NW | 191.7393936377831 |\
| TW | 188.6912623180138 |\
| DL | 187.84162871590732 |\
| CO | 187.71301306878976 |\
| 9E | 181.6396991511518 |\
| RU | 181.46244295416398 |\
| TZ | 176.8928125899626 |\
| AQ | 145.65911608293766 |\
| HA | 79.38672451825789 |\
+---------+--------------------+\
25 rows in set (0.94 sec)\
Read 71000000 rows, 1.14 GB in 0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.
mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year>=2000 AND Year <=2008 GROUP BY Carrier ORDER BY c3 DESC;\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, (avg(cast((DepDelay > 10) as Int8)) * 1000) as c3:Float64 |\
| Sort: (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64 |\
| Expression: IATA_CODE_Reporting_Airline:String, (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64 (Before OrderBy) |\
| AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]] |\
| AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]] |\
| Expression: IATA_CODE_Reporting_Airline:String, cast((DepDelay > 10) as Int8):Int8 (Before GroupBy) |\
| Filter: ((Year >= 2000) and (Year <= 2008)) |\
| ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 1179110760, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [((Year >= 2000) AND (Year <= 2008))]] |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
8 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.003 sec., 0 rows/sec., 0 B/sec.

Q7 2000-2008年各航空公司平均延誤時間

(0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.)

ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;\
+---------+--------------------+\
| Carrier | c3 |\
+---------+--------------------+\
| B6 | 16789.739456036365 |\
| NW | 11717.623092632819 |\
| F9 | 11232.889558936127 |\
| XE | 17092.548853057146 |\
| YV | 17971.53933699898 |\
| US | 11868.7097884053 |\
| RU | 12556.249210602802 |\
| AS | 14735.545887755581 |\
| HA | 6851.555976883671 |\
| OH | 12655.103820799075 |\
| UA | 14594.243159716054 |\
| TZ | 12618.760195758565 |\
| EV | 16374.703330010156 |\
| HP | 11625.682112859839 |\
| DH | 15311.949983190174 |\
| DL | 10943.456441165357 |\
| 9E | 13091.087573576122 |\
| FL | 15192.451732538268 |\
| MQ | 14125.201554023559 |\
| AQ | 7323.278123603293 |\
| OO | 11600.594852741107 |\
| AA | 13508.78515494305 |\
| TW | 10842.722114986364 |\
| WN | 10484.932610056378 |\
| CO | 12671.595978518368 |\
+---------+--------------------+\
25 rows in set (0.74 sec)\
Read 71000000 rows, 1.14 GB in 0.727 sec., 97.6 million rows/sec., 1.56 GB/sec.
mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;\
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, (avg(DepDelay) * 1000) as c3:Float64 |\
| Expression: IATA_CODE_Reporting_Airline:String, (avg(DepDelay) * 1000):Float64 (Before Projection) |\
| AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(DepDelay)]] |\
| AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(DepDelay)]] |\
| Filter: ((Year >= 2000) and (Year <= 2008)) |\
| ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 1179110760, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [((Year >= 2000) AND (Year <= 2008))]] |\
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q8 每年航班延誤平均時間

(0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.)

mysql> SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year;\
+------+--------------------+\
| Year | avg(DepDelay) |\
+------+--------------------+\
| 1987 | 12.380385692195556 |\
| 1988 | 7.345867511864449 |\
| 1989 | 8.81845473300008 |\
| 1990 | 7.966702606180775 |\
| 1991 | 6.940411174086677 |\
| 1992 | 6.687364706154975 |\
| 1993 | 7.207721091071671 |\
| 1994 | 7.758752042452116 |\
| 1995 | 9.328649903752932 |\
| 1996 | 11.14468468976826 |\
| 1997 | 9.919225483813925 |\
| 1998 | 10.884314711941435 |\
| 1999 | 11.567390524113748 |\
| 2000 | 13.456897681824556 |\
| 2001 | 10.895474364001354 |\
| 2002 | 9.97856700710386 |\
| 2003 | 9.778465263372038 |\
| 2004 | 11.936799840656898 |\
| 2005 | 12.60167890747495 |\
| 2006 | 14.237297887039372 |\
| 2007 | 15.431738868356579 |\
| 2008 | 14.654588068064287 |\
| 2009 | 13.168984006133062 |\
| 2010 | 13.202976628175891 |\
| 2011 | 13.496191548097778 |\
| 2012 | 13.155971481255131 |\
| 2013 | 14.901210490900201 |\
| 2014 | 15.513697266113969 |\
| 2015 | 14.638336410280733 |\
| 2016 | 14.643883269504837 |\
| 2017 | 15.70225324299191 |\
| 2018 | 16.16188254545747 |\
| 2019 | 16.983263489524507 |\
| 2020 | 10.624498278073712 |\
| 2021 | 15.289615417399649 |\
+------+--------------------+\
35 rows in set (1.04 sec)\
Read 201816232 rows, 1.21 GB in 1.030 sec., 195.93 million rows/sec., 1.18 GB/sec.
mysql> explain SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year;\
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: Year:UInt16, avg(DepDelay):Float64 |\
| AggregatorFinal: groupBy=[[Year]], aggr=[[avg(DepDelay)]] |\
| AggregatorPartial: groupBy=[[Year]], aggr=[[avg(DepDelay)]] |\
| ReadDataSource: scan schema: [Year:UInt16, DepDelay:Int32], statistics: [read_rows: 201816232, read_bytes: 1210897392, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0, 31]] |\
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
4 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q9 每年有多少航班

(0.509 sec., 396.54 million rows/sec., 793.08 MB/sec.)

mysql> SELECT Year, count(*) as c1 FROM ontime GROUP BY Year;\
+------+---------+\
| Year | c1 |\
+------+---------+\
| 1987 | 440403 |\
| 1988 | 5202096 |\
| 1989 | 5041200 |\
| 1990 | 5270893 |\
| 1991 | 5076925 |\
| 1992 | 5092157 |\
| 1993 | 5070501 |\
| 1994 | 5180048 |\
| 1995 | 5327435 |\
| 1996 | 5351983 |\
| 1997 | 5411843 |\
| 1998 | 5384721 |\
| 1999 | 5527884 |\
| 2000 | 5683047 |\
| 2001 | 5967780 |\
| 2002 | 5271359 |\
| 2003 | 6488540 |\
| 2004 | 7129270 |\
| 2005 | 7140596 |\
| 2006 | 7141922 |\
| 2007 | 7455458 |\
| 2008 | 7009726 |\
| 2009 | 6450285 |\
| 2010 | 6450117 |\
| 2011 | 6085281 |\
| 2012 | 6096762 |\
| 2013 | 6369482 |\
| 2014 | 5819811 |\
| 2015 | 5819079 |\
| 2016 | 5617658 |\
| 2017 | 5674621 |\
| 2018 | 7213446 |\
| 2019 | 7422037 |\
| 2020 | 4688354 |\
| 2021 | 5443512 |\
+------+---------+\
35 rows in set (0.52 sec)\
Read 201816232 rows, 403.63 MB in 0.509 sec., 396.54 million rows/sec., 793.08 MB/sec.
mysql> explain SELECT Year, count(*) as c1 FROM ontime GROUP BY Year;\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: Year:UInt16, count() as c1:UInt64 |\
| AggregatorFinal: groupBy=[[Year]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[Year]], aggr=[[count()]] |\
| ReadDataSource: scan schema: [Year:UInt16], statistics: [read_rows: 201816232, read_bytes: 403632464, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0]] |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
4 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q10 計算每月延遲15分鐘的航班平均數

(0.891 sec., 226.44 million rows/sec., 1.59 GB/sec.)

mysql> SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a;\
+-------------------+\
| avg(cnt) |\
+-------------------+\
| 81474.99019607843 |\
+-------------------+\
1 row in set (0.90 sec)\
Read 201816232 rows, 1.41 GB in 0.891 sec., 226.44 million rows/sec., 1.59 GB/sec.
mysql> explain SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a;\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: avg(cnt):Float64 |\
| AggregatorFinal: groupBy=[[]], aggr=[[avg(cnt)]] |\
| AggregatorPartial: groupBy=[[]], aggr=[[avg(cnt)]] |\
| Projection: Year:UInt16, Month:UInt8, count() as cnt:UInt64 |\
| AggregatorFinal: groupBy=[[Year, Month]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[Year, Month]], aggr=[[count()]] |\
| Filter: (DepDel15 = 1) |\
| ReadDataSource: scan schema: [Year:UInt16, Month:UInt8, DepDel15:Int32], statistics: [read_rows: 201816232, read_bytes: 1412713624, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0, 2, 33], filters: [(DepDel15 = 1)]] |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
8 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q11 計算每月航班平均數

(0.561 sec., 359.58 million rows/sec., 1.08 GB/sec.)

mysql> SELECT avg(c1) FROM (SELECT Year,Month,count(*) AS c1 FROM ontime GROUP BY Year,Month) a;\
+-------------------+\
| avg(c1) |\
+-------------------+\
| 494647.6274509804 |\
+-------------------+\
1 row in set (0.57 sec)\
Read 201816232 rows, 605.45 MB in 0.561 sec., 359.58 million rows/sec., 1.08 GB/sec.
mysql> explain SELECT avg(c1) FROM (SELECT Year,Month,count(*) AS c1 FROM ontime GROUP BY Year,Month) a;\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: avg(c1):Float64 |\
| AggregatorFinal: groupBy=[[]], aggr=[[avg(c1)]] |\
| AggregatorPartial: groupBy=[[]], aggr=[[avg(c1)]] |\
| Projection: Year:UInt16, Month:UInt8, count() as c1:UInt64 |\
| AggregatorFinal: groupBy=[[Year, Month]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[Year, Month]], aggr=[[count()]] |\
| ReadDataSource: scan schema: [Year:UInt16, Month:UInt8], statistics: [read_rows: 201816232, read_bytes: 605448696, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0, 2]] |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
7 rows in set (0.02 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q12 顯示10個兩個城市直飛線航班最多的前10個

(2.930 sec., 68.87 million rows/sec., 2.91 GB/sec.)

mysql> SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;\
+-------------------+-------------------+--------+\
| OriginCityName | DestCityName | c |\
+-------------------+-------------------+--------+\
| San Francisco, CA | Los Angeles, CA | 514878 |\
| Los Angeles, CA | San Francisco, CA | 512147 |\
| New York, NY | Chicago, IL | 456042 |\
| Chicago, IL | New York, NY | 448756 |\
| Chicago, IL | Minneapolis, MN | 437913 |\
| Minneapolis, MN | Chicago, IL | 433688 |\
| Los Angeles, CA | Las Vegas, NV | 428942 |\
| Las Vegas, NV | Los Angeles, CA | 422825 |\
| New York, NY | Boston, MA | 419405 |\
| Boston, MA | New York, NY | 416324 |\
+-------------------+-------------------+--------+\
10 rows in set (2.94 sec)\
Read 201816232 rows, 8.54 GB in 2.930 sec., 68.87 million rows/sec., 2.91 GB/sec.
mysql> explain SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;\
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Limit: 10 |\
| Projection: OriginCityName:String, DestCityName:String, count() as c:UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[OriginCityName, DestCityName]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[OriginCityName, DestCityName]], aggr=[[count()]] |\
| ReadDataSource: scan schema: [OriginCityName:String, DestCityName:String], statistics: [read_rows: 201816232, read_bytes: 9829664815, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [15, 24]] |\
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.00 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q13 顯示飛機最多航班的10個城市

(1.223 sec., 165.05 million rows/sec., 3.49 GB/sec.)

mysql> SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;\
+-----------------------+----------+\
| OriginCityName | c |\
+-----------------------+----------+\
| Chicago, IL | 12545243 |\
| Atlanta, GA | 10900284 |\
| Dallas/Fort Worth, TX | 9011081 |\
| Houston, TX | 6844476 |\
| Los Angeles, CA | 6695628 |\
| New York, NY | 6309911 |\
| Denver, CO | 6283055 |\
| Phoenix, AZ | 5658884 |\
| Washington, DC | 4998047 |\
| San Francisco, CA | 4673365 |\
+-----------------------+----------+\
10 rows in set (1.23 sec)\
Read 201816232 rows, 4.27 GB in 1.223 sec., 165.05 million rows/sec., 3.49 GB/sec.
mysql> explain SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Limit: 10 |\
| Projection: OriginCityName:String, count() as c:UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[OriginCityName]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[OriginCityName]], aggr=[[count()]] |\
| ReadDataSource: scan schema: [OriginCityName:String], statistics: [read_rows: 201816232, read_bytes: 4914707403, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [15]] |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q14 查詢 ontime 表總共有多少行

(0.002 sec., 443.51 rows/sec., 443.51 B/sec.)

mysql> SELECT count(*) FROM ontime;\
+-----------+\
| count() |\
+-----------+\
| 201816232 |\
+-----------+\
1 row in set (0.01 sec)\
Read 1 rows, 1 B in 0.002 sec., 443.51 rows/sec., 443.51 B/sec.
mysql> explain SELECT count(*) FROM ontime;\
+-----------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-----------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: count():UInt64 |\
| Projection: 201816232 as count():UInt64 |\
| Expression: 201816232:UInt64 (Exact Statistics) |\
| ReadDataSource: scan schema: [dummy:UInt8], statistics: [read_rows: 1, read_bytes: 1, partitions_scanned: 1, partitions_total: 1] |\
+-----------------------------------------------------------------------------------------------------------------------------------------+\
4 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.


更多效能測試

Databend On Amazon S3 Performance

https://databend.rs/doc/performance/ec2-s3-performance

Databend On Alibaba Cloud ECS OSS Performance

https://databend.rs/doc/performance/ecs-oss-performance
\

Databend On Wasabi Performance

https://databend.rs/doc/performance/ec2-wasabi-performance

需要支援請新增微信: 82565387  獲取更多幫助。

微信:Databend 文件:https://databend.rs