基於 Databend 和騰訊雲 COS 打造新型雲數倉
本篇文章向大家演示如何使用 Databend 基於騰訊雲 COS 構建新式數倉及其計算能力。如果你也在找一個低成本、高效能、支援彈性的數倉,Databend 可以為大家提供一個基於物件儲存的雲原生數倉解決方案。目前 Databend 支援資料的 stream load , copy into from stage , insert 等方式的資料寫入,部署上支援單機和叢集模式。需要更多支援新增微信: 82565387 。 文章較長,建議收藏 PC 端閱讀。
Databend 介紹
Databend 是一款使用 Rust 研發、開源、完全面向物件儲存架構的新式數倉,提供極速的彈性擴充套件能力,致力於打造按需、按量的 Data Cloud 產品體驗。具備以下特點:
•Vectorized Execution 和 Pull&Push-Based Processor Model
•真正的儲存、計算分離架構,高效能、低成本,按需按量使用
•完整的資料庫支援,相容 MySQL ,Clickhouse 協議, SQL Over http
•完善的事務性,支援 Data Time Travel, Database Zero Clone 等功能
•支援基於同一份資料的多租戶讀寫、共享操作
github repo: https://github.com/datafuselabs/databend
Docs: https://databend.rs
關於 Databend 架構圖,參考:
騰訊雲 COS
物件儲存(Cloud Object Storage,COS)是由騰訊雲推出的無目錄層次結構、無資料格式限制,可容納海量資料且支援 HTTP/HTTPS 協議訪問的分散式儲存服務。騰訊雲 COS 的儲存桶空間無容量上限,無需分割槽管理,適用於 CDN 資料分發、資料永珍處理或大資料計算與分析的資料湖等多種場景。官網:https://cloud.tencent.com/product/cos
測試環境介紹
北京區: CVM SA2.8XLARGE64 & COS(ap-beijing)
作業系統: ubuntu-20
Databend : 使用進二制釋出版本 v0.6.99-nightly
本次測試安裝部署方式參考:https://databend.rs/doc/deploy/cos
叢集部署模式參考:https://databend.rs/doc/deploy/cluster_minio
測試資料
wget --no-check-certificate --continue https://transtats.bts.gov/PREZIP/
On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{1987..2021}_{1..12}.zip
表結構參考:cat create_ontime.sql
CREATE TABLE ontime
(
Year UInt16 NOT NULL,
Quarter UInt8 NOT NULL,
Month UInt8 NOT NULL,
DayofMonth UInt8 NOT NULL,
DayOfWeek UInt8 NOT NULL,
FlightDate Date NOT NULL,
Reporting_Airline String NOT NULL,
DOT_ID_Reporting_Airline Int32 NOT NULL,
IATA_CODE_Reporting_Airline String NOT NULL,
Tail_Number String NOT NULL,
Flight_Number_Reporting_Airline String NOT NULL,
OriginAirportID Int32 NOT NULL,
OriginAirportSeqID Int32 NOT NULL,
OriginCityMarketID Int32 NOT NULL,
Origin String NOT NULL,
OriginCityName String NOT NULL,
OriginState String NOT NULL,
OriginStateFips String NOT NULL,
OriginStateName String NOT NULL,
OriginWac Int32 NOT NULL,
DestAirportID Int32 NOT NULL,
DestAirportSeqID Int32 NOT NULL,
DestCityMarketID Int32 NOT NULL,
Dest String NOT NULL,
DestCityName String NOT NULL,
DestState String NOT NULL,
DestStateFips String NOT NULL,
DestStateName String NOT NULL,
DestWac Int32 NOT NULL,
CRSDepTime Int32 NOT NULL,
DepTime Int32 NOT NULL,
DepDelay Int32 NOT NULL,
DepDelayMinutes Int32 NOT NULL,
DepDel15 Int32 NOT NULL,
DepartureDelayGroups String NOT NULL,
DepTimeBlk String NOT NULL,
TaxiOut Int32 NOT NULL,
WheelsOff Int32 NOT NULL,
WheelsOn Int32 NOT NULL,
TaxiIn Int32 NOT NULL,
CRSArrTime Int32 NOT NULL,
ArrTime Int32 NOT NULL,
ArrDelay Int32 NOT NULL,
ArrDelayMinutes Int32 NOT NULL,
ArrDel15 Int32 NOT NULL,
ArrivalDelayGroups Int32 NOT NULL,
ArrTimeBlk String NOT NULL,
Cancelled UInt8 NOT NULL,
CancellationCode String NOT NULL,
Diverted UInt8 NOT NULL,
CRSElapsedTime Int32 NOT NULL,
ActualElapsedTime Int32 NOT NULL,
AirTime Int32 NOT NULL,
Flights Int32 NOT NULL,
Distance Int32 NOT NULL,
DistanceGroup UInt8 NOT NULL,
CarrierDelay Int32 NOT NULL,
WeatherDelay Int32 NOT NULL,
NASDelay Int32 NOT NULL,
SecurityDelay Int32 NOT NULL,
LateAircraftDelay Int32 NOT NULL,
FirstDepTime String NOT NULL,
TotalAddGTime String NOT NULL,
LongestAddGTime String NOT NULL,
DivAirportLandings String NOT NULL,
DivReachedDest String NOT NULL,
DivActualElapsedTime String NOT NULL,
DivArrDelay String NOT NULL,
DivDistance String NOT NULL,
Div1Airport String NOT NULL,
Div1AirportID Int32 NOT NULL,
Div1AirportSeqID Int32 NOT NULL,
Div1WheelsOn String NOT NULL,
Div1TotalGTime String NOT NULL,
Div1LongestGTime String NOT NULL,
Div1WheelsOff String NOT NULL,
Div1TailNum String NOT NULL,
Div2Airport String NOT NULL,
Div2AirportID Int32 NOT NULL,
Div2AirportSeqID Int32 NOT NULL,
Div2WheelsOn String NOT NULL,
Div2TotalGTime String NOT NULL,
Div2LongestGTime String NOT NULL,
Div2WheelsOff String NOT NULL,
Div2TailNum String NOT NULL,
Div3Airport String NOT NULL,
Div3AirportID Int32 NOT NULL,
Div3AirportSeqID Int32 NOT NULL,
Div3WheelsOn String NOT NULL,
Div3TotalGTime String NOT NULL,
Div3LongestGTime String NOT NULL,
Div3WheelsOff String NOT NULL,
Div3TailNum String NOT NULL,
Div4Airport String NOT NULL,
Div4AirportID Int32 NOT NULL,
Div4AirportSeqID Int32 NOT NULL,
Div4WheelsOn String NOT NULL,
Div4TotalGTime String NOT NULL,
Div4LongestGTime String NOT NULL,
Div4WheelsOff String NOT NULL,
Div4TailNum String NOT NULL,
Div5Airport String NOT NULL,
Div5AirportID Int32 NOT NULL,
Div5AirportSeqID Int32 NOT NULL,
Div5WheelsOn String NOT NULL,
Div5TotalGTime String NOT NULL,
Div5LongestGTime String NOT NULL,
Div5WheelsOff String NOT NULL,
Div5TailNum String NOT NULL
);
載入表結構:
cat create_ontime.sql | mysql -h127.0.0.1 -P3307 -uroot
資料載入
cat load_ontime.sh
echo "unzip ontime ,input your ontime zip dir: ./load_ontime.sh zip_dir"
ls $1/*.zip |xargs -I{} -P 4 bash -c "echo {}; unzip -q {} '*.csv' -d ./dataset"
if [ $? -eq 0 ];
then
echo "unzip success"
else
echo "unzip was wrong!!!"
exit 1
fi
cat create_ontime.sql |mysql -h127.0.0.1 -P3307 -uroot
if [ $? -eq 0 ];
then
echo "Ontime table create success"
else
echo "Ontime table create was wrong!!!"
exit 1
fi
time ls ./dataset/*.csv|xargs -P 8 -I{} curl -H "insert_sql:insert into ontime format CSV" -H "skip_header:1" -F "upload=@{}" -XPUT http://localhost:8081/v1/streaming_load
使用方法
./load_ontime.sh ZIP檔案目錄
基於 Ontime 測試 SQL 展示
Q1 查詢2000年到2008年每天的總的航班總
(0.494 sec., 143.75 million rows/sec., 431.25 MB/sec)
mysql> SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;\
+-----------+---------+\
| DayOfWeek | c |\
+-----------+---------+\
| 5 | 8732422 |\
| 1 | 8730614 |\
| 4 | 8710843 |\
| 3 | 8685626 |\
| 2 | 8639632 |\
| 7 | 8274367 |\
| 6 | 7514194 |\
+-----------+---------+\
7 rows in set (0.50 sec)\
Read 71000000 rows, 213 MB in 0.494 sec., 143.75 million rows/sec., 431.25 MB/sec.
mysql> explain SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: DayOfWeek:UInt8, count() as c:UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[DayOfWeek]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[DayOfWeek]], aggr=[[count()]] |\
| Filter: ((Year >= 2000) and (Year <= 2008)) |\
| ReadDataSource: scan schema: [Year:UInt16, DayOfWeek:UInt8], statistics: [read_rows: 71000000, read_bytes: 213000000, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 4], filters: [((Year >= 2000) AND (Year <= 2008))]] |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.01 sec)
Q2 查詢 2000 年到 2008 年延遲超過 10 分鐘,每天總的延遲發生情況
( 0.543 sec., 130.71 million rows/sec., 914.95 GB/sec.)
mysql> SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;\
+-----------+---------+\
| DayOfWeek | c |\
+-----------+---------+\
| 5 | 2175733 |\
| 4 | 2012848 |\
| 1 | 1898879 |\
| 7 | 1880896 |\
| 3 | 1757508 |\
| 2 | 1665303 |\
| 6 | 1510894 |\
+-----------+---------+\
7 rows in set (0.54 sec)\
Read 71000000 rows, 497 MB in 0.543 sec., 130.71 million rows/sec., 914.95 MB/sec.
mysql> explain SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: DayOfWeek:UInt8, count() as c:UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[DayOfWeek]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[DayOfWeek]], aggr=[[count()]] |\
| Filter: (((DepDelay > 10) and (Year >= 2000)) and (Year <= 2008)) |\
| ReadDataSource: scan schema: [Year:UInt16, DayOfWeek:UInt8, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 497000000, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 4, 31], filters: [(((DepDelay > 10) AND (Year >= 2000)) AND (Year <= 2008))]] |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q3 2000-2008年機場的延誤次數,顯示最高的10條
(0.679 sec., 104.59 million rows/sec., 1.78 GB/sec.)
Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;\
+--------+--------+\
| Origin | c |\
+--------+--------+\
| ORD | 860911 |\
| ATL | 831822 |\
| DFW | 614403 |\
| LAX | 402671 |\
| PHX | 400475 |\
| LAS | 362026 |\
| DEN | 352893 |\
| EWR | 302267 |\
| DTW | 296832 |\
| IAH | 290729 |\
+--------+--------+\
10 rows in set (0.69 sec)\
Read 71000000 rows, 1.21 GB in 0.679 sec., 104.59 million rows/sec., 1.78 GB/sec.
mysql> explain SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Limit: 10 |\
| Projection: Origin:String, count() as c:UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[Origin]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[Origin]], aggr=[[count()]] |\
| Filter: (((DepDelay > 10) and (Year >= 2000)) and (Year <= 2008)) |\
| ReadDataSource: scan schema: [Year:UInt16, Origin:String, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 1271665856, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 14, 31], filters: [(((DepDelay > 10) AND (Year >= 2000)) AND (Year <= 2008))]] |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
7 rows in set (0.00 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q4 2007年各航空公司延誤的次數
(0.188 sec., 79.77 million rows/sec., 1.28 GB/sec.)
mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, count() FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count() DESC;\
+---------+---------+\
| Carrier | count() |\
+---------+---------+\
| WN | 296451 |\
| AA | 179769 |\
| MQ | 152293 |\
| OO | 147019 |\
| US | 140199 |\
| UA | 135061 |\
| XE | 108571 |\
| EV | 104055 |\
| NW | 102206 |\
| DL | 98427 |\
| CO | 81039 |\
| YV | 79553 |\
| FL | 64583 |\
| OH | 60532 |\
| AS | 54326 |\
| B6 | 53716 |\
| 9E | 48578 |\
| F9 | 24100 |\
| AQ | 6764 |\
| HA | 4059 |\
+---------+---------+\
20 rows in set (0.19 sec)\
Read 15000000 rows, 240 MB in 0.188 sec., 79.77 million rows/sec., 1.28 GB/sec.
mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, count() FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count() DESC;\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, count():UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[count()]] |\
| Filter: ((DepDelay > 10) and (Year = 2007)) |\
| ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 15000000, read_bytes: 250239306, partitions_scanned: 15, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [((DepDelay > 10) AND (Year = 2007))]] |\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.00 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q5 2007年各航空公司延誤的千分比
(0.265 sec., 56.58 million rows/sec., 905.28 MB/sec.)
mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year=2007 GROUP BY Carrier ORDER BY c3 DESC;\
+---------+--------------------+\
| Carrier | c3 |\
+---------+--------------------+\
| EV | 363.53123668047823 |\
| AS | 339.1453631738303 |\
| US | 288.8039271022377 |\
| AA | 283.6112877194699 |\
| MQ | 281.7663100792978 |\
| B6 | 280.5745625489684 |\
| UA | 275.63356884257615 |\
| YV | 270.25567158804466 |\
| OH | 256.4567516268981 |\
| WN | 253.62165713752844 |\
| CO | 250.77750030171651 |\
| XE | 249.71881878589517 |\
| NW | 246.56113247419944 |\
| F9 | 246.52209492635023 |\
| OO | 245.90051515354253 |\
| FL | 245.4143692596491 |\
| DL | 206.82764258051773 |\
| 9E | 187.66780889391967 |\
| AQ | 145.9016393442623 |\
| HA | 72.25634178905207 |\
+---------+--------------------+\
20 rows in set (0.27 sec)\
Read 15000000 rows, 240 MB in 0.265 sec., 56.58 million rows/sec., 905.28 MB/sec.
mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year=2007 GROUP BY Carrier ORDER BY c3 DESC;\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, (avg(cast((DepDelay > 10) as Int8)) * 1000) as c3:Float64 |\
| Sort: (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64 |\
| Expression: IATA_CODE_Reporting_Airline:String, (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64 (Before OrderBy) |\
| AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]] |\
| AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]] |\
| Expression: IATA_CODE_Reporting_Airline:String, cast((DepDelay > 10) as Int8):Int8 (Before GroupBy) |\
| Filter: (Year = 2007) |\
| ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 15000000, read_bytes: 250239306, partitions_scanned: 15, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [(Year = 2007)]] |\
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
8 rows in set (0.00 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q6 2000-2008年各航空公司延誤的千分比
(0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.)
mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year>=2000 AND Year <=2008 GROUP BY Carrier ORDER BY c3 DESC;\
+---------+--------------------+\
| Carrier | c3 |\
+---------+--------------------+\
| AS | 293.05649076611434 |\
| EV | 282.0709981074399 |\
| YV | 270.3897636688929 |\
| B6 | 257.40594891667007 |\
| FL | 249.28742951361826 |\
| XE | 246.59005902424192 |\
| MQ | 245.3695989400477 |\
| WN | 233.38127235928863 |\
| DH | 227.11013827345042 |\
| F9 | 226.08455653226812 |\
| UA | 224.42824657703645 |\
| OH | 215.52882835147614 |\
| AA | 211.97122176454556 |\
| US | 206.60330294168244 |\
| HP | 205.31690167066455 |\
| OO | 202.4243177198239 |\
| NW | 191.7393936377831 |\
| TW | 188.6912623180138 |\
| DL | 187.84162871590732 |\
| CO | 187.71301306878976 |\
| 9E | 181.6396991511518 |\
| RU | 181.46244295416398 |\
| TZ | 176.8928125899626 |\
| AQ | 145.65911608293766 |\
| HA | 79.38672451825789 |\
+---------+--------------------+\
25 rows in set (0.94 sec)\
Read 71000000 rows, 1.14 GB in 0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.
mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year>=2000 AND Year <=2008 GROUP BY Carrier ORDER BY c3 DESC;\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, (avg(cast((DepDelay > 10) as Int8)) * 1000) as c3:Float64 |\
| Sort: (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64 |\
| Expression: IATA_CODE_Reporting_Airline:String, (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64 (Before OrderBy) |\
| AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]] |\
| AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]] |\
| Expression: IATA_CODE_Reporting_Airline:String, cast((DepDelay > 10) as Int8):Int8 (Before GroupBy) |\
| Filter: ((Year >= 2000) and (Year <= 2008)) |\
| ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 1179110760, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [((Year >= 2000) AND (Year <= 2008))]] |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
8 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.003 sec., 0 rows/sec., 0 B/sec.
Q7 2000-2008年各航空公司平均延誤時間
(0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.)
ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;\
+---------+--------------------+\
| Carrier | c3 |\
+---------+--------------------+\
| B6 | 16789.739456036365 |\
| NW | 11717.623092632819 |\
| F9 | 11232.889558936127 |\
| XE | 17092.548853057146 |\
| YV | 17971.53933699898 |\
| US | 11868.7097884053 |\
| RU | 12556.249210602802 |\
| AS | 14735.545887755581 |\
| HA | 6851.555976883671 |\
| OH | 12655.103820799075 |\
| UA | 14594.243159716054 |\
| TZ | 12618.760195758565 |\
| EV | 16374.703330010156 |\
| HP | 11625.682112859839 |\
| DH | 15311.949983190174 |\
| DL | 10943.456441165357 |\
| 9E | 13091.087573576122 |\
| FL | 15192.451732538268 |\
| MQ | 14125.201554023559 |\
| AQ | 7323.278123603293 |\
| OO | 11600.594852741107 |\
| AA | 13508.78515494305 |\
| TW | 10842.722114986364 |\
| WN | 10484.932610056378 |\
| CO | 12671.595978518368 |\
+---------+--------------------+\
25 rows in set (0.74 sec)\
Read 71000000 rows, 1.14 GB in 0.727 sec., 97.6 million rows/sec., 1.56 GB/sec.
mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;\
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, (avg(DepDelay) * 1000) as c3:Float64 |\
| Expression: IATA_CODE_Reporting_Airline:String, (avg(DepDelay) * 1000):Float64 (Before Projection) |\
| AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(DepDelay)]] |\
| AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(DepDelay)]] |\
| Filter: ((Year >= 2000) and (Year <= 2008)) |\
| ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 1179110760, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [((Year >= 2000) AND (Year <= 2008))]] |\
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q8 每年航班延誤平均時間
(0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.)
mysql> SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year;\
+------+--------------------+\
| Year | avg(DepDelay) |\
+------+--------------------+\
| 1987 | 12.380385692195556 |\
| 1988 | 7.345867511864449 |\
| 1989 | 8.81845473300008 |\
| 1990 | 7.966702606180775 |\
| 1991 | 6.940411174086677 |\
| 1992 | 6.687364706154975 |\
| 1993 | 7.207721091071671 |\
| 1994 | 7.758752042452116 |\
| 1995 | 9.328649903752932 |\
| 1996 | 11.14468468976826 |\
| 1997 | 9.919225483813925 |\
| 1998 | 10.884314711941435 |\
| 1999 | 11.567390524113748 |\
| 2000 | 13.456897681824556 |\
| 2001 | 10.895474364001354 |\
| 2002 | 9.97856700710386 |\
| 2003 | 9.778465263372038 |\
| 2004 | 11.936799840656898 |\
| 2005 | 12.60167890747495 |\
| 2006 | 14.237297887039372 |\
| 2007 | 15.431738868356579 |\
| 2008 | 14.654588068064287 |\
| 2009 | 13.168984006133062 |\
| 2010 | 13.202976628175891 |\
| 2011 | 13.496191548097778 |\
| 2012 | 13.155971481255131 |\
| 2013 | 14.901210490900201 |\
| 2014 | 15.513697266113969 |\
| 2015 | 14.638336410280733 |\
| 2016 | 14.643883269504837 |\
| 2017 | 15.70225324299191 |\
| 2018 | 16.16188254545747 |\
| 2019 | 16.983263489524507 |\
| 2020 | 10.624498278073712 |\
| 2021 | 15.289615417399649 |\
+------+--------------------+\
35 rows in set (1.04 sec)\
Read 201816232 rows, 1.21 GB in 1.030 sec., 195.93 million rows/sec., 1.18 GB/sec.
mysql> explain SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year;\
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: Year:UInt16, avg(DepDelay):Float64 |\
| AggregatorFinal: groupBy=[[Year]], aggr=[[avg(DepDelay)]] |\
| AggregatorPartial: groupBy=[[Year]], aggr=[[avg(DepDelay)]] |\
| ReadDataSource: scan schema: [Year:UInt16, DepDelay:Int32], statistics: [read_rows: 201816232, read_bytes: 1210897392, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0, 31]] |\
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
4 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q9 每年有多少航班
(0.509 sec., 396.54 million rows/sec., 793.08 MB/sec.)
mysql> SELECT Year, count(*) as c1 FROM ontime GROUP BY Year;\
+------+---------+\
| Year | c1 |\
+------+---------+\
| 1987 | 440403 |\
| 1988 | 5202096 |\
| 1989 | 5041200 |\
| 1990 | 5270893 |\
| 1991 | 5076925 |\
| 1992 | 5092157 |\
| 1993 | 5070501 |\
| 1994 | 5180048 |\
| 1995 | 5327435 |\
| 1996 | 5351983 |\
| 1997 | 5411843 |\
| 1998 | 5384721 |\
| 1999 | 5527884 |\
| 2000 | 5683047 |\
| 2001 | 5967780 |\
| 2002 | 5271359 |\
| 2003 | 6488540 |\
| 2004 | 7129270 |\
| 2005 | 7140596 |\
| 2006 | 7141922 |\
| 2007 | 7455458 |\
| 2008 | 7009726 |\
| 2009 | 6450285 |\
| 2010 | 6450117 |\
| 2011 | 6085281 |\
| 2012 | 6096762 |\
| 2013 | 6369482 |\
| 2014 | 5819811 |\
| 2015 | 5819079 |\
| 2016 | 5617658 |\
| 2017 | 5674621 |\
| 2018 | 7213446 |\
| 2019 | 7422037 |\
| 2020 | 4688354 |\
| 2021 | 5443512 |\
+------+---------+\
35 rows in set (0.52 sec)\
Read 201816232 rows, 403.63 MB in 0.509 sec., 396.54 million rows/sec., 793.08 MB/sec.
mysql> explain SELECT Year, count(*) as c1 FROM ontime GROUP BY Year;\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: Year:UInt16, count() as c1:UInt64 |\
| AggregatorFinal: groupBy=[[Year]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[Year]], aggr=[[count()]] |\
| ReadDataSource: scan schema: [Year:UInt16], statistics: [read_rows: 201816232, read_bytes: 403632464, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0]] |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
4 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q10 計算每月延遲15分鐘的航班平均數
(0.891 sec., 226.44 million rows/sec., 1.59 GB/sec.)
mysql> SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a;\
+-------------------+\
| avg(cnt) |\
+-------------------+\
| 81474.99019607843 |\
+-------------------+\
1 row in set (0.90 sec)\
Read 201816232 rows, 1.41 GB in 0.891 sec., 226.44 million rows/sec., 1.59 GB/sec.
mysql> explain SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a;\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: avg(cnt):Float64 |\
| AggregatorFinal: groupBy=[[]], aggr=[[avg(cnt)]] |\
| AggregatorPartial: groupBy=[[]], aggr=[[avg(cnt)]] |\
| Projection: Year:UInt16, Month:UInt8, count() as cnt:UInt64 |\
| AggregatorFinal: groupBy=[[Year, Month]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[Year, Month]], aggr=[[count()]] |\
| Filter: (DepDel15 = 1) |\
| ReadDataSource: scan schema: [Year:UInt16, Month:UInt8, DepDel15:Int32], statistics: [read_rows: 201816232, read_bytes: 1412713624, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0, 2, 33], filters: [(DepDel15 = 1)]] |\
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
8 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q11 計算每月航班平均數
(0.561 sec., 359.58 million rows/sec., 1.08 GB/sec.)
mysql> SELECT avg(c1) FROM (SELECT Year,Month,count(*) AS c1 FROM ontime GROUP BY Year,Month) a;\
+-------------------+\
| avg(c1) |\
+-------------------+\
| 494647.6274509804 |\
+-------------------+\
1 row in set (0.57 sec)\
Read 201816232 rows, 605.45 MB in 0.561 sec., 359.58 million rows/sec., 1.08 GB/sec.
mysql> explain SELECT avg(c1) FROM (SELECT Year,Month,count(*) AS c1 FROM ontime GROUP BY Year,Month) a;\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: avg(c1):Float64 |\
| AggregatorFinal: groupBy=[[]], aggr=[[avg(c1)]] |\
| AggregatorPartial: groupBy=[[]], aggr=[[avg(c1)]] |\
| Projection: Year:UInt16, Month:UInt8, count() as c1:UInt64 |\
| AggregatorFinal: groupBy=[[Year, Month]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[Year, Month]], aggr=[[count()]] |\
| ReadDataSource: scan schema: [Year:UInt16, Month:UInt8], statistics: [read_rows: 201816232, read_bytes: 605448696, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0, 2]] |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
7 rows in set (0.02 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q12 顯示10個兩個城市直飛線航班最多的前10個
(2.930 sec., 68.87 million rows/sec., 2.91 GB/sec.)
mysql> SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;\
+-------------------+-------------------+--------+\
| OriginCityName | DestCityName | c |\
+-------------------+-------------------+--------+\
| San Francisco, CA | Los Angeles, CA | 514878 |\
| Los Angeles, CA | San Francisco, CA | 512147 |\
| New York, NY | Chicago, IL | 456042 |\
| Chicago, IL | New York, NY | 448756 |\
| Chicago, IL | Minneapolis, MN | 437913 |\
| Minneapolis, MN | Chicago, IL | 433688 |\
| Los Angeles, CA | Las Vegas, NV | 428942 |\
| Las Vegas, NV | Los Angeles, CA | 422825 |\
| New York, NY | Boston, MA | 419405 |\
| Boston, MA | New York, NY | 416324 |\
+-------------------+-------------------+--------+\
10 rows in set (2.94 sec)\
Read 201816232 rows, 8.54 GB in 2.930 sec., 68.87 million rows/sec., 2.91 GB/sec.
mysql> explain SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;\
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Limit: 10 |\
| Projection: OriginCityName:String, DestCityName:String, count() as c:UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[OriginCityName, DestCityName]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[OriginCityName, DestCityName]], aggr=[[count()]] |\
| ReadDataSource: scan schema: [OriginCityName:String, DestCityName:String], statistics: [read_rows: 201816232, read_bytes: 9829664815, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [15, 24]] |\
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.00 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q13 顯示飛機最多航班的10個城市
(1.223 sec., 165.05 million rows/sec., 3.49 GB/sec.)
mysql> SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;\
+-----------------------+----------+\
| OriginCityName | c |\
+-----------------------+----------+\
| Chicago, IL | 12545243 |\
| Atlanta, GA | 10900284 |\
| Dallas/Fort Worth, TX | 9011081 |\
| Houston, TX | 6844476 |\
| Los Angeles, CA | 6695628 |\
| New York, NY | 6309911 |\
| Denver, CO | 6283055 |\
| Phoenix, AZ | 5658884 |\
| Washington, DC | 4998047 |\
| San Francisco, CA | 4673365 |\
+-----------------------+----------+\
10 rows in set (1.23 sec)\
Read 201816232 rows, 4.27 GB in 1.223 sec., 165.05 million rows/sec., 3.49 GB/sec.
mysql> explain SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
| Limit: 10 |\
| Projection: OriginCityName:String, count() as c:UInt64 |\
| Sort: count():UInt64 |\
| AggregatorFinal: groupBy=[[OriginCityName]], aggr=[[count()]] |\
| AggregatorPartial: groupBy=[[OriginCityName]], aggr=[[count()]] |\
| ReadDataSource: scan schema: [OriginCityName:String], statistics: [read_rows: 201816232, read_bytes: 4914707403, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [15]] |\
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\
6 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
Q14 查詢 ontime 表總共有多少行
(0.002 sec., 443.51 rows/sec., 443.51 B/sec.)
mysql> SELECT count(*) FROM ontime;\
+-----------+\
| count() |\
+-----------+\
| 201816232 |\
+-----------+\
1 row in set (0.01 sec)\
Read 1 rows, 1 B in 0.002 sec., 443.51 rows/sec., 443.51 B/sec.
mysql> explain SELECT count(*) FROM ontime;\
+-----------------------------------------------------------------------------------------------------------------------------------------+\
| explain |\
+-----------------------------------------------------------------------------------------------------------------------------------------+\
| Projection: count():UInt64 |\
| Projection: 201816232 as count():UInt64 |\
| Expression: 201816232:UInt64 (Exact Statistics) |\
| ReadDataSource: scan schema: [dummy:UInt8], statistics: [read_rows: 1, read_bytes: 1, partitions_scanned: 1, partitions_total: 1] |\
+-----------------------------------------------------------------------------------------------------------------------------------------+\
4 rows in set (0.01 sec)\
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.
更多效能測試
Databend On Amazon S3 Performance
https://databend.rs/doc/performance/ec2-s3-performance
Databend On Alibaba Cloud ECS OSS Performance
https://databend.rs/doc/performance/ecs-oss-performance
\
Databend On Wasabi Performance
https://databend.rs/doc/performance/ec2-wasabi-performance
需要支援請新增微信: 82565387 獲取更多幫助。
微信:Databend 文件:https://databend.rs