1. 程式人生 > >使用Data Lake Analytics + OSS分析CSV格式的TPC-H數據集

使用Data Lake Analytics + OSS分析CSV格式的TPC-H數據集

tdi mit key lag rand hone part exist any

  • Data Lake Analytics(DLA)簡介
    關於Data Lake的概念,更多閱讀可以參考:
    https://en.wikipedia.org/wiki/Data_lake
  • 以及AWS和Azure關於Data Lake的解讀:
    https://amazonaws-china.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
    https://azure.microsoft.com/en-us/solutions/data-lake/

    終於,阿裏雲現在也有了自己的數據湖分析產品:https://www.aliyun.com/product/datalakeanalytics

    可以點擊申請使用(目前公測階段還屬於邀測模式,我們會盡快審批申請),體驗本教程的TPC-H CSV數據格式的數據分析之旅。

    產品文檔:https://help.aliyun.com/product/70174.html

    1. 開通Data Lake Analytics與OSS服務
      如果您已經開通,可以跳過該步驟。如果沒有開通,可以參考:https://help.aliyun.com/document_detail/70386.html
      進行產品開通服務申請。

    2. 下載TPC-H測試數據集
      可以從這下載TPC-H 100MB的數據集:
      https://public-datasets-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/tpch_100m_data.zip

    3. 上傳數據文件到OSS
      登錄阿裏雲官網的OSS控制臺:https://oss.console.aliyun.com/overview
      規劃您要使用的OSS bucket,創建或選擇好後,點擊“文件管理”,因為有8個數據文件,為每個數據文件創建對應的文件目錄:

    技術分享圖片

    創建好8個目錄如下:

    技術分享圖片

    點擊進入目錄,上傳相應的數據文件,例如,customer目錄,則上傳customer.tbl文件。

    技術分享圖片

    上傳好後,如下圖。然後,依次把其他7個數據文件也上傳到對應的目錄下。

    技術分享圖片

    至此,8個數據文件都上傳到了您的OSS bucket中:

    oss://xxx/tpch_100m/customer/customer.tbl
    oss://xxx/tpch_100m/lineitem/lineitem.tbl

    oss://xxx/tpch_100m/nation/nation.tbl
    oss://xxx/tpch_100m/orders/orders.tbl
    oss://xxx/tpch_100m/part/part.tbl
    oss://xxx/tpch_100m/partsupp/partsupp.tbl
    oss://xxx/tpch_100m/region/region.tbl
    oss://xxx/tpch_100m/supplier/supplier.tbl

    1. 登錄Data Lake Analytics控制臺
      https://openanalytics.console.aliyun.com/
      點擊“登錄數據庫”,輸入開通服務時分配的用戶名和密碼,登錄Data Lake Analytics控制臺。

    2. 創建Schema和Table
      輸入創建SCHEMA的語句,點擊“同步執行”。

    CREATE SCHEMA tpch_100m with DBPROPERTIES(
    LOCATION = ‘oss://test-bucket-julian-1/tpch_100m/‘,
    catalog=‘oss‘
    );
    (註意:目前在同一個阿裏雲region,Data Lake Analytics的schema名全局唯一,建議schema名盡量根據業務定義,已有重名schema,在創建時會提示報錯,則請換一個schema名字。)

    Schema創建好後,在“數據庫”的下拉框中,選擇剛剛創建的schema。然後在SQL文本框中輸入建表語句,點擊同步執行。
    建表語句語法參考:https://help.aliyun.com/document_detail/72006.html

    image.png | left

    TPC-H對應的8個表的建表語句如下,分別貼入文檔框中執行(LOCATION子句中的數據文件位置請根據您的實際OSS bucket目錄相應修改)。(註意:目前控制臺中還不支持多個SQL語句執行,請單條語句執行。)

    CREATE EXTERNAL TABLE nation (
    N_NATIONKEY INT,
    N_NAME STRING,
    N_ID STRING,
    N_REGIONKEY INT,
    N_COMMENT STRING
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|‘
    STORED AS TEXTFILE
    LOCATION ‘oss://test-bucket-julian-1/tpch_100m/nation‘;

    CREATE EXTERNAL TABLE lineitem (
    L_ORDERKEY INT,
    L_PARTKEY INT,
    L_SUPPKEY INT,
    L_LINENUMBER INT,
    L_QUANTITY DOUBLE,
    L_EXTENDEDPRICE DOUBLE,
    L_DISCOUNT DOUBLE,
    L_TAX DOUBLE,
    L_RETURNFLAG STRING,
    L_LINESTATUS STRING,
    L_SHIPDATE DATE,
    L_COMMITDATE DATE,
    L_RECEIPTDATE DATE,
    L_SHIPINSTRUCT STRING,
    L_SHIPMODE STRING,
    L_COMMENT STRING
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|‘
    STORED AS TEXTFILE
    LOCATION ‘oss://test-bucket-julian-1/tpch_100m/lineitem‘;

    CREATE EXTERNAL TABLE orders (
    O_ORDERKEY INT,
    O_CUSTKEY INT,
    O_ORDERSTATUS STRING,
    O_TOTALPRICE DOUBLE,
    O_ORDERDATE DATE,
    O_ORDERPRIORITY STRING,
    O_CLERK STRING,
    O_SHIPPRIORITY INT,
    O_COMMENT STRING
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|‘
    STORED AS TEXTFILE
    LOCATION ‘oss://test-bucket-julian-1/tpch_100m/orders‘;

    CREATE EXTERNAL TABLE supplier (
    S_SUPPKEY INT,
    S_NAME STRING,
    S_ADDRESS STRING,
    S_NATIONKEY INT,
    S_PHONE STRING,
    S_ACCTBAL DOUBLE,
    S_COMMENT STRING
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|‘
    STORED AS TEXTFILE
    LOCATION ‘oss://test-bucket-julian-1/tpch_100m/supplier‘;

    CREATE EXTERNAL TABLE partsupp (
    PS_PARTKEY INT,
    PS_SUPPKEY INT,
    PS_AVAILQTY INT,
    PS_SUPPLYCOST DOUBLE,
    PS_COMMENT STRING
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|‘
    STORED AS TEXTFILE
    LOCATION ‘oss://test-bucket-julian-1/tpch_100m/partsupp‘;

    CREATE EXTERNAL TABLE customer (
    C_CUSTKEY INT,
    C_NAME STRING,
    C_ADDRESS STRING,
    C_NATIONKEY INT,
    C_PHONE STRING,
    C_ACCTBAL DOUBLE,
    C_MKTSEGMENT STRING,
    C_COMMENT STRING
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|‘
    STORED AS TEXTFILE
    LOCATION ‘oss://test-bucket-julian-1/tpch_100m/customer‘;

    CREATE EXTERNAL TABLE part (
    P_PARTKEY INT,
    P_NAME STRING,
    P_MFGR STRING,
    P_BRAND STRING,
    P_TYPE STRING,
    P_SIZE INT,
    P_CONTAINER STRING,
    P_RETAILPRICE DOUBLE,
    P_COMMENT STRING
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|‘
    STORED AS TEXTFILE
    LOCATION ‘oss://test-bucket-julian-1/tpch_100m/part‘;

    CREATE EXTERNAL TABLE region (
    R_REGIONKEY INT,
    R_NAME STRING,
    R_COMMENT STRING
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|‘
    STORED AS TEXTFILE
    LOCATION ‘oss://test-bucket-julian-1/tpch_100m/region‘;
    建表完畢後,刷新頁面,在左邊導航條中能看到schema下的8張表。

    image.png | left

    1. 執行TPC-H查詢
      TPC-H總共22條查詢,如下:
      Q1:

    SELECT l_returnflag,
    l_linestatus,
    Sum(l_quantity) AS sum_qty,
    Sum(l_extendedprice) AS sum_base_price,
    Sum(l_extendedprice (1 - l_discount)) AS sum_disc_price,
    Sum(l_extendedprice
    (1 - l_discount) (1 + l_tax)) AS sum_charge,
    Avg(l_quantity) AS avg_qty,
    Avg(l_extendedprice) AS avg_price,
    Avg(l_discount) AS avg_disc,
    Count(
    ) AS count_order
    FROM lineitem
    WHERE l_shipdate <= date ‘1998-12-01‘ - INTERVAL ‘93‘ day
    GROUP BY l_returnflag,
    l_linestatus
    ORDER BY l_returnflag,
    l_linestatus
    LIMIT 1;
    Q2:

    SELECT s_acctbal,
    s_name,
    n_name,
    p_partkey,
    p_mfgr,
    s_address,
    s_phone,
    s_comment
    FROM part,
    supplier,
    partsupp,
    nation,
    region
    WHERE p_partkey = ps_partkey
    AND s_suppkey = ps_suppkey
    AND p_size = 35
    AND p_type LIKE ‘%NICKEL‘
    AND s_nationkey = n_nationkey
    AND n_regionkey = r_regionkey
    AND r_name = ‘MIDDLE EAST‘
    Q3:

    SELECT l_orderkey,
    Sum(l_extendedprice * (1 - l_discount)) AS revenue,
    o_orderdate,
    o_shippriority
    FROM customer,
    orders,
    lineitem
    WHERE c_mktsegment = ‘AUTOMOBILE‘
    AND c_custkey = o_custkey
    AND l_orderkey = o_orderkey
    AND o_orderdate < date ‘1995-03-31‘
    AND l_shipdate > date ‘1995-03-31‘
    GROUP BY l_orderkey,
    o_orderdate,
    o_shippriority
    ORDER BY revenue DESC,
    o_orderdate
    LIMIT 10;
    Q4:

    SELECT o_orderpriority,
    Count(*) AS order_count
    FROM orders,
    lineitem
    WHERE o_orderdate >= date ‘1997-10-01‘
    AND o_orderdate < date ‘1997-10-01‘ + INTERVAL ‘3‘ month
    AND l_orderkey = o_orderkey
    AND l_commitdate < l_receiptdate
    GROUP BY o_orderpriority
    ORDER BY o_orderpriority
    LIMIT 1;
    Q5:

    SELECT n_name,
    Sum(l_extendedprice * (1 - l_discount)) AS revenue
    FROM customer,
    orders,
    lineitem,
    supplier,
    nation,
    region
    WHERE c_custkey = o_custkey
    AND l_orderkey = o_orderkey
    AND l_suppkey = s_suppkey
    AND c_nationkey = s_nationkey
    AND s_nationkey = n_nationkey
    AND n_regionkey = r_regionkey
    AND r_name = ‘ASIA‘
    AND o_orderdate >= date ‘1995-01-01‘
    AND o_orderdate < date ‘1995-01-01‘ + INTERVAL ‘1‘ year
    GROUP BY n_name
    ORDER BY revenue DESC
    LIMIT 1;
    Q6:

    SELECT sum(l_extendedprice * l_discount) AS revenue
    FROM lineitem
    WHERE l_shipdate >= date ‘1995-01-01‘
    AND l_shipdate < date ‘1995-01-01‘ + interval ‘1‘ year
    AND l_discount between 0.04 - 0.01 AND 0.04 + 0.01
    AND l_quantity < 24
    LIMIT 1;
    Q7:

    SELECT supp_nation,
    cust_nation,
    l_year,
    Sum(volume) AS revenue
    FROM (
    SELECT n1.n_name AS supp_nation,
    n2.n_name AS cust_nation,
    Extract(year FROM l_shipdate) AS l_year,
    l_extendedprice * (1 - l_discount) AS volume
    FROM supplier,
    lineitem,
    orders,
    customer,
    nation n1,
    nation n2
    WHERE s_suppkey = l_suppkey
    AND o_orderkey = l_orderkey
    AND c_custkey = o_custkey
    AND s_nationkey = n1.n_nationkey
    AND c_nationkey = n2.n_nationkey
    AND ( (
    n1.n_name = ‘GERMANY‘
    AND n2.n_name = ‘INDIA‘)
    OR (
    n1.n_name = ‘INDIA‘
    AND n2.n_name = ‘GERMANY‘) )
    AND l_shipdate BETWEEN date ‘1995-01-01‘ AND date ‘1996-12-31‘ ) AS shipping
    GROUP BY supp_nation,
    cust_nation,
    l_year
    ORDER BY supp_nation,
    cust_nation,
    l_year
    LIMIT 1;
    Q8:

    SELECT o_year,
    Sum(
    CASE
    WHEN nation = ‘INDIA‘ THEN volume
    ELSE 0
    end) / Sum(volume) AS mkt_share
    FROM (
    SELECT Extract(year FROM o_orderdate) AS o_year,
    l_extendedprice * (1 - l_discount) AS volume,
    n2.n_name AS nation
    FROM part,
    supplier,
    lineitem,
    orders,
    customer,
    nation n1,
    nation n2,
    region
    WHERE p_partkey = l_partkey
    AND s_suppkey = l_suppkey
    AND l_orderkey = o_orderkey
    AND o_custkey = c_custkey
    AND c_nationkey = n1.n_nationkey
    AND n1.n_regionkey = r_regionkey
    AND r_name = ‘ASIA‘
    AND s_nationkey = n2.n_nationkey
    AND o_orderdate BETWEEN date ‘1995-01-01‘ AND date ‘1996-12-31‘
    AND p_type = ‘STANDARD ANODIZED STEEL‘ ) AS all_nations
    GROUP BY o_year
    ORDER BY o_year
    LIMIT 1;
    Q9:

    SELECT nation,
    o_year,
    Sum(amount) AS sum_profit
    FROM (
    SELECT n_name AS nation,
    Extract(year FROM o_orderdate) AS o_year,
    l_extendedprice (1 - l_discount) - ps_supplycost l_quantity AS amount
    FROM part,
    supplier,
    lineitem,
    partsupp,
    orders,
    nation
    WHERE s_suppkey = l_suppkey
    AND ps_suppkey = l_suppkey
    AND ps_partkey = l_partkey
    AND p_partkey = l_partkey
    AND o_orderkey = l_orderkey
    AND s_nationkey = n_nationkey
    AND p_name LIKE ‘%aquamarine%‘ ) AS profit
    GROUP BY nation,
    o_year
    ORDER BY nation,
    o_year DESC
    LIMIT 1;
    Q10:

    SELECT c_custkey,
    c_name,
    Sum(l_extendedprice * (1 - l_discount)) AS revenue,
    c_acctbal,
    n_name,
    c_address,
    c_phone,
    c_comment
    FROM customer,
    orders,
    lineitem,
    nation
    WHERE c_custkey = o_custkey
    AND l_orderkey = o_orderkey
    AND o_orderdate >= date ‘1994-08-01‘
    AND o_orderdate < date ‘1994-08-01‘ + INTERVAL ‘3‘ month
    AND l_returnflag = ‘R‘
    AND c_nationkey = n_nationkey
    GROUP BY c_custkey,
    c_name,
    c_acctbal,
    c_phone,
    n_name,
    c_address,
    c_comment
    ORDER BY revenue DESC
    LIMIT 20;
    Q11:

    SELECT ps_partkey,
    Sum(ps_supplycost ps_availqty) AS value
    FROM partsupp,
    supplier,
    nation
    WHERE ps_suppkey = s_suppkey
    AND s_nationkey = n_nationkey
    AND n_name = ‘PERU‘
    GROUP BY ps_partkey
    HAVING Sum(ps_supplycost
    ps_availqty) >
    (
    SELECT Sum(ps_supplycost ps_availqty) 0.0001000000 as sum_value
    FROM partsupp,
    supplier,
    nation
    WHERE ps_suppkey = s_suppkey
    AND s_nationkey = n_nationkey
    AND n_name = ‘PERU‘
    )
    ORDER BY value DESC
    LIMIT 1;
    Q12:

    SELECT l_shipmode, sum(case when o_orderpriority = ‘1-URGENT‘ or o_orderpriority = ‘2-HIGH‘ then 1
    else 0
    end) AS high_line_count, sum(case when o_orderpriority <> ‘1-URGENT‘ and o_orderpriority <> ‘2-HIGH‘ then 1
    else 0
    end) AS low_line_count
    FROM orders,
    lineitem
    WHERE o_orderkey = l_orderkey
    AND l_shipmode in (‘MAIL‘, ‘TRUCK‘)
    AND l_commitdate < l_receiptdate
    AND l_shipdate < l_commitdate
    AND l_receiptdate >= date ‘1996-01-01‘
    AND l_receiptdate < date ‘1996-01-01‘ + interval ‘1‘ year
    GROUP BY l_shipmode
    ORDER BY l_shipmode
    LIMIT 1;
    Q13:

    SELECT c_count, count(*) AS custdist
    FROM (
    SELECT c_custkey, count(o_orderkey) AS c_count
    FROM customer,
    orders
    WHERE c_custkey = o_custkey
    AND o_comment NOT LIKE ‘%pending%accounts%‘
    GROUP BY c_custkey ) AS c_orders
    GROUP BY c_count
    ORDER BY custdist DESC, c_count DESC
    LIMIT 1;
    Q14:

    SELECT 100.00 sum(case when p_type like ‘PROMO%‘ then l_extendedprice (1 - l_discount)
    else 0
    end) / sum(l_extendedprice * (1 - l_discount)) AS promo_revenue
    FROM lineitem,
    part
    WHERE l_partkey = p_partkey
    AND l_shipdate >= date ‘1996-01-01‘
    AND l_shipdate < date ‘1996-01-01‘ + interval ‘1‘ month
    LIMIT 1;
    Q15:

    WITH revenue0 AS
    (
    SELECT l_suppkey AS supplier_no, sum(l_extendedprice * (1 - l_discount)) AS total_revenue
    FROM lineitem
    WHERE l_shipdate >= date ‘1993-01-01‘
    AND l_shipdate < date ‘1993-01-01‘ + interval ‘3‘ month
    GROUP BY l_suppkey
    )
    SELECT s_suppkey, s_name, s_address, s_phone, total_revenue
    FROM supplier, revenue0
    WHERE s_suppkey = supplier_no
    AND total_revenue IN (
    SELECT max(total_revenue)
    FROM revenue0 )
    ORDER BY s_suppkey;
    Q16:

    SELECT p_brand, p_type, p_size, count(distinct ps_suppkey) AS supplier_cnt
    FROM partsupp,
    part
    WHERE p_partkey = ps_partkey
    AND p_brand <> ‘Brand#23‘
    AND p_type NOT LIKE ‘PROMO BURNISHED%‘
    AND p_size IN (1, 13, 10, 28, 21, 35, 31, 11)
    AND ps_suppkey NOT IN (
    SELECT s_suppkey
    FROM supplier
    WHERE s_comment LIKE ‘%Customer%Complaints%‘ )
    GROUP BY p_brand, p_type, p_size
    ORDER BY supplier_cnt DESC, p_brand, p_type, p_size
    LIMIT 1;
    Q17:

    SELECT
    sum(l_extendedprice) / 7.0 AS avg_yearly
    FROM
    lineitem,
    part
    WHERE p_partkey = l_partkey
    AND p_brand = ‘Brand#44‘
    AND p_container = ‘WRAP PKG‘
    AND l_quantity < (
    SELECT
    0.2 * avg(l_quantity)
    FROM
    lineitem, part
    WHERE
    l_partkey = p_partkey
    );
    Q18:

    SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity)
    FROM customer,
    orders,
    lineitem
    WHERE o_orderkey IN (
    SELECT l_orderkey
    FROM lineitem
    GROUP BY l_orderkey
    HAVING sum(l_quantity) > 315 )
    AND c_custkey = o_custkey
    AND o_orderkey = l_orderkey
    GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
    ORDER BY o_totalprice DESC, o_orderdate
    LIMIT 100;
    Q19:

    SELECT sum(l_extendedprice* (1 - l_discount)) AS revenue
    FROM lineitem,
    part
    WHERE ( p_partkey = l_partkey and p_brand = ‘Brand#12‘
    and p_container in (‘SM CASE‘, ‘SM BOX‘, ‘SM PACK‘, ‘SM PKG‘)
    and l_quantity >= 6 and l_quantity <= 6 + 10
    and p_size between 1 and 5
    and l_shipmode in (‘AIR‘, ‘AIR REG‘)
    and l_shipinstruct = ‘DELIVER IN PERSON‘ )
    or ( p_partkey = l_partkey and p_brand = ‘Brand#13‘
    and p_container in (‘MED BAG‘, ‘MED BOX‘, ‘MED PKG‘, ‘MED PACK‘)
    and l_quantity >= 10 and l_quantity <= 10 + 10
    and p_size between 1 and 10
    and l_shipmode in (‘AIR‘, ‘AIR REG‘)
    and l_shipinstruct = ‘DELIVER IN PERSON‘ )
    or ( p_partkey = l_partkey and p_brand = ‘Brand#24‘
    and p_container in (‘LG CASE‘, ‘LG BOX‘, ‘LG PACK‘, ‘LG PKG‘)
    and l_quantity >= 21 and l_quantity <= 21 + 10
    and p_size between 1 and 15
    and l_shipmode in (‘AIR‘, ‘AIR REG‘)
    and l_shipinstruct = ‘DELIVER IN PERSON‘ )
    LIMIT 1;
    Q20:

    with temp_table as
    (
    select 0.5 * sum(l_quantity) as col1
    from lineitem,
    partsupp
    where l_partkey = ps_partkey and l_suppkey = ps_suppkey
    and l_shipdate >= date ‘1993-01-01‘
    and l_shipdate < date ‘1993-01-01‘ + interval ‘1‘ year
    )
    select s_name, s_address
    from supplier,
    nation
    where s_suppkey in (
    select ps_suppkey
    from partsupp,
    temp_table
    where ps_partkey in (
    select p_partkey
    from part
    where p_name like ‘dark%‘ )
    and ps_availqty > temp_table.col1 )
    and s_nationkey = n_nationkey and n_name = ‘JORDAN‘
    order by s_name
    limit 1;
    Q21:

    select
    s_name,
    count() as numwait
    from
    supplier,
    lineitem l1,
    orders,
    nation
    where
    s_suppkey = l1.l_suppkey
    and o_orderkey = l1.l_orderkey
    and o_orderstatus = ‘F‘
    and l1.l_receiptdate > l1.l_commitdate
    and exists (
    select

    from
    lineitem l2
    where
    l2.l_orderkey = l1.l_orderkey
    and l2.l_suppkey <> l1.l_suppkey
    )
    and not exists (
    select
    *
    from
    lineitem l3
    where
    l3.l_orderkey = l1.l_orderkey
    and l3.l_suppkey <> l1.l_suppkey
    and l3.l_receiptdate > l3.l_commitdate
    )
    and s_nationkey = n_nationkey
    and n_name = ‘SAUDI ARABIA‘
    group by
    s_name
    order by
    numwait desc,
    s_name
    limit 100;
    Q22:

    with temp_table_1 as
    (
    select avg(c_acctbal) as avg_value
    from customer
    where c_acctbal > 0.00 and substring(c_phone from 1 for 2)
    in (‘33‘, ‘29‘, ‘37‘, ‘35‘, ‘25‘, ‘27‘, ‘43‘)
    ),
    temp_table_2 as
    (
    select count() as count1
    from orders, customer
    where o_custkey = c_custkey
    )
    select cntrycode, count(
    ) as numcust, sum(c_acctbal) as totacctbal
    from (
    select substring(c_phone from 1 for 2) as cntrycode, c_acctbal
    from customer, temp_table_1, temp_table_2
    where substring(c_phone
    from 1
    for 2) in (‘33‘, ‘29‘, ‘37‘, ‘35‘, ‘25‘, ‘27‘, ‘43‘)
    and c_acctbal > temp_table_1.avg_value
    and temp_table_2.count1 = 0) as custsale
    group by cntrycode
    order by cntrycode
    limit 1;

    1. 異步執行查詢
      Data Lake Analytics支持“同步執行”模式和“異步執行”模式。“同步執行”模式下,控制臺界面等待執行結果返回;“異步執行”模式下,立刻返回查詢任務的ID。

    image.png | left

    點擊“執行狀態”,可以看到該異步查詢任務的執行狀態,主要分為:“RUNNING”,“SUCCESS”,“FAILURE”。

    image.png | left

    點擊“刷新”,當STATUS變為“SUCCESS”時,表示查詢成功,同時可查看查詢耗時“ELAPSE_TIME”和查詢掃描的數據字節數“SCANNED_DATA_BYTES”。

    image.png | left

    1. 查看查詢歷史
      點擊“執行歷史”,可以看到您執行的查詢的歷史詳細信息,包括:
      1)查詢語句;
      2)查詢耗時與執行具體時間;
      3)查詢結果返回行數;
      4)查詢狀態;
      5)查詢掃描的字節數;
      6)結果集回寫到的目標OSS文件(Data Lake Analytics會將查詢結果集保存用戶的bucket中)。

    image.png | left

    查詢結果文件自動上傳到用戶同region的OSS bucket中,其中包括結果數據文件和結果集元數據描述文件。

    {QueryLocation}/{query_name}|Unsaved}/{yyyy}/{mm}/{dd}/{query_id}/xxx.csv
    {QueryLocation}/{query_name}|Unsaved}/{yyyy}/{mm}/{dd}/{query_id}/xxx.csv.metadata
    其中QueryLocation為:

    aliyun-oa-query-results-<your_account_id>-<oss_region>
    image.png | left

    1. 後續
      至此,本教程一步一步教您如何利用Data Lake Analytics雲產品分析您OSS上的CSV格式的數據文件。除了CSV文件外,Data Lake Analytics還支持Parquet、ORC、json、RCFile、AVRO等多種格式文件的數據分析能力。特別是Parquet、ORC,相比CSV文件,有極大的性能和成本優勢(同樣內容的數據集,擁有更小的存儲空間、更快的查詢性能,這也意味著更低的分析成本)。
      後續陸續會有更多教程和文章,手把手教您輕松使用Data Lake Analytics進行數據湖上數據分析和探索,開啟您的雲上低成本、即存即用的數據分析和探索之旅。

    使用Data Lake Analytics + OSS分析CSV格式的TPC-H數據集