1. 程式人生 > >Impala查詢功能測試

Impala查詢功能測試

來源:https://yq.aliyun.com/articles/25641

關於 Impala 使用方法的一些測試,包括載入資料、檢視資料庫、聚合關聯查詢、子查詢等等。

1. 準備測試資料

以下測試以 impala 使用者來執行:

$ su - impala
-bash-4.1$ whoami
impala
$ hdfs dfs -ls /user
Found 5 items
drwxr-xr-x   - hdfs   hadoop          0 2014-09-22 18:36 /user/hdfs
drwxrwxrwt   - mapred hadoop          0 2014-07-23 21:37 /user/history
drwxr-xr-x   - hive   hadoop          0 2014-08-04 16:57 /user/hive
drwxr-xr-x   - impala hadoop          0 2014-10-24 10:13 /user/impala
drwxr-xr-x   - root   hadoop          0 2014-09-22 10:22 /user/root

準備一些測試資料,tab1.csv 檔案內容如下:

1,true,123.123,2012-10-24 08:55:00 
2,false,1243.5,2012-10-25 13:40:00
3,false,24453.325,2008-08-22 09:33:21.123
4,false,243423.325,2007-05-12 22:32:21.33454
5,true,243.325,1953-04-22 09:11:33

tab1.csv 檔案內容如下:

1,true,12789.123
2,false,1243.5
3,false,24453.325
4,false,2423.3254
50,true,243.325
60,false,243565423.325
70,true,243.325
80,false,243423.325
90,true,243.325

將這兩個表上傳到 hdfs:

$ hdfs dfs -mkdir -p sample_data/tab1 sample_data/tab2

$ hdfs dfs -put tab1.csv /user/impala/sample_data/tab1
$ hdfs dfs -ls /user/impala/sample_data/tab1
Found 1 items
-rw-r--r--   3 impala hadoop        193 2014-10-24 10:13 /user/impala/sample_data/tab1/tab1.csv

$ hdfs dfs -put tab2.csv /user/impala/sample_data/tab2
$ 
hdfs dfs -ls /user/impala/sample_data/tab2 Found 1 items -rw-r--r-- 3 impala hadoop 158 2014-10-24 10:13 /user/impala/sample_data/tab2/tab2.csv

在 impala 中建表,建表語句如下:

DROP TABLE IF EXISTS tab1;
CREATE EXTERNAL TABLE tab1 (
   id INT,
   col_1 BOOLEAN,
   col_2 DOUBLE,
   col_3 TIMESTAMP
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/impala/sample_data/tab1';

DROP TABLE IF EXISTS tab2;
CREATE EXTERNAL TABLE tab2 (
   id INT,
   col_1 BOOLEAN,
   col_2 DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/impala/sample_data/tab2';

DROP TABLE IF EXISTS tab3;
CREATE TABLE tab3 (
   id INT,
   col_1 BOOLEAN,
   col_2 DOUBLE,
   month INT,
   day INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

其中 tab1 和 tab2 都是外部表,tab3 是內部表。

將上面 sql 儲存在 init.sql 語句,然後執行下面命令進行建立表:

$ impala-shell -i localhost -f init.sql

也可以進入到 impala-shell 命令列模式,直接執行 sql 語句。

2. 查看錶結構

檢視所有資料庫:

[192.168.56.121:21000] > show databases;
Query: show databases
+------------------+
| name             |
+------------------+
| _impala_builtins |
| default          |          
| testdb           |
+------------------+
Returned 3 row(s) in 0.05s

檢視預設資料庫下的所有表:

[192.168.56.121:21000] > show tables;
Query: show tables
+------+
| name |
+------+
| tab1 |
| tab2 |
| tab3 |
+------+
Returned 3 row(s) in 0.01s

檢視 tab1 表結構:

[192.168.56.121:21000] > describe tab1;
Query: describe tab1
+-------+-----------+---------+
| name  | type      | comment |
+-------+-----------+---------+
| id    | int       |         |
| col_1 | boolean   |         |
| col_2 | double    |         |
| col_3 | timestamp |         |
+-------+-----------+---------+
Returned 4 row(s) in 0.07s

3. impala-shell 命令

使用 impala-shell 進入命令列互動模式:

$ impala-shell -i localhost

傳入一個檔案:

$ impala-shell -i localhost -f init.sql

執行指定的 sql:

$ impala-shell -i localhost -q 'select count(*) from tab1;'

4. 匯入資料並查詢

匯入資料:

  • 準備資料
  • 建立表
  • 加資料匯入到建立的表

查詢資料:

[192.168.56.121:21000] > SELECT * FROM tab1;
Query: select * FROM tab1
+----+-------+------------+-------------------------------+
| id | col_1 | col_2      | col_3                         |
+----+-------+------------+-------------------------------+
| 1  | true  | 123.123    | 2012-10-24 08:55:00           |
| 2  | false | 1243.5     | 2012-10-25 13:40:00           |
| 3  | false | 24453.325  | 2008-08-22 09:33:21.123000000 |
| 4  | false | 243423.325 | 2007-05-12 22:32:21.334540000 |
| 5  | true  | 243.325    | 1953-04-22 09:11:33           |
+----+-------+------------+-------------------------------+
Returned 5 row(s) in 0.24s
[192.168.56.121:21000] > SELECT * FROM tab2;
Query: select * FROM tab2
+----+-------+---------------+
| id | col_1 | col_2         |
+----+-------+---------------+
| 1  | true  | 12789.123     |
| 2  | false | 1243.5        |
| 3  | false | 24453.325     |
| 4  | false | 2423.3254     |
| 50 | true  | 243.325       |
| 60 | false | 243565423.325 |
| 70 | true  | 243.325       |
| 80 | false | 243423.325    |
| 90 | true  | 243.325       |
+----+-------+---------------+
Returned 9 row(s) in 0.44s
[192.168.56.121:21000] > SELECT * FROM tab2 LIMIT 5;
Query: select * FROM tab2 LIMIT 5
+----+-------+-----------+
| id | col_1 | col_2     |
+----+-------+-----------+
| 1  | true  | 12789.123 |
| 2  | false | 1243.5    |
| 3  | false | 24453.325 |
| 4  | false | 2423.3254 |
| 50 | true  | 243.325   |
+----+-------+-----------+
Returned 5 row(s) in 0.44s

帶 OFFSET 語句查詢

帶 OFFSET 語句查詢,需要和 order by 一起使用,起始編號從 0 開始往後偏移,offset 為 0 時,其結果和去掉 offset 的 limit 結果一致。

測試如下:

[192.168.56.121:21000] > SELECT * FROM tab2 order by id LIMIT 3 offset 0;
Query: select * FROM tab2 order by id LIMIT 3 offset 0
+----+-------+-----------+
| id | col_1 | col_2     |
+----+-------+-----------+
| 1  | true  | 12789.123 |
| 2  | false | 1243.5    |
| 3  | false | 24453.325 |
+----+-------+-----------+
Returned 3 row(s) in 0.45s
[192.168.56.121:21000] > SELECT * FROM tab2 order by id LIMIT 3 offset 2;
Query: select * FROM tab2 order by id LIMIT 3 offset 2
+----+-------+-----------+
| id | col_1 | col_2     |
+----+-------+-----------+
| 3  | false | 24453.325 |
| 4  | false | 2423.3254 |
| 50 | true  | 243.325   |
+----+-------+-----------+
Returned 3 row(s) in 0.45s

5. join 連線查詢

5.1 左外連線:

[192.168.56.121:21000] > SELECT tab1.id,tab1.col_1,tab2.col_2 FROM tab1 LEFT OUTER JOIN tab2 USING (id);
+----+-------+-----------+
| id | col_1 | col_2     |
+----+-------+-----------+
| 1  | true  | 12789.123 |
| 2  | false | 1243.5    |
| 3  | false | 24453.325 |
| 4  | false | 2423.3254 |
| 5  | true  | NULL      |
+----+-------+-----------+
Returned 5 row(s) in 1.12s

以上 SQL 語句等同於下面語句,用法同樣適用於多個欄位:

SELECT tab1.id,tab1.col_1,tab2.col_2 FROM tab1 LEFT OUTER JOIN tab2 where tab1.id=tab2.id;

由上可以看到左邊表 tab1 的記錄都查詢出來了,右邊表 tab2 只查詢出跟 tab1 關聯的記錄。

5.2 內連線

[192.168.56.121:21000] > SELECT tab1.id,tab1.col_1,tab2.col_2 FROM tab1 INNER JOIN tab2 USING (id);
+----+-------+-----------+
| id | col_1 | col_2     |
+----+-------+-----------+
| 1  | true  | 12789.123 |
| 2  | false | 1243.5    |
| 3  | false | 24453.325 |
| 4  | false | 2423.3254 |
+----+-------+-----------+
Returned 4 row(s) in 0.53s

以上語句可以修改為:

-- 下面語句都是內連線
SELECT tab1.id,tab1.col_1,tab2.col_2 FROM tab1 JOIN tab2 USING (id);
SELECT tab1.id,tab1.col_1,tab2.col_2 FROM tab1 , tab2 where tab1.id=tab2.id ;

查詢結果為:

[192.168.56.121:21000] > SELECT tab1.id,tab1.col_1,tab2.col_2 FROM tab1 , tab2 where tab1.id=tab2.id ;
Query: select tab1.id,tab1.col_1,tab2.col_2 FROM tab1 , tab2 where tab1.id=tab2.id
+----+-------+-----------+
| id | col_1 | col_2     |
+----+-------+-----------+
| 1  | true  | 12789.123 |
| 2  | false | 1243.5    |
| 3  | false | 24453.325 |
| 4  | false | 2423.3254 |
+----+-------+-----------+
Returned 4 row(s) in 0.38s

如果去掉 where 語句,會提示錯誤:

[192.168.56.121:21000] > select tab1.id,tab1.col_1,tab2.col_2 FROM tab1 , tab2;
Query: select tab1.id,tab1.col_1,tab2.col_2 FROM tab1 , tab2
ERROR: NotImplementedException: Join with 'default.tab2' requires at least one conjunctive equality predicate. To perform a Cartesian product between two tables, use a CROSS JOIN.

5.3 自連線

impala 允許自連線,例如:

-- Combine fields from both parent and child rows.
SELECT lhs.id, rhs.parent, lhs.c1, rhs.c2 FROM tree_data lhs, tree_data rhs WHERE lhs.id = rhs.parent;

5.4 交叉連線

為了避免產生大量的結果集,impala 不允許下面形式的笛卡爾連線:

SELECT ... FROM t1 JOIN t2;
SELECT ... FROM t1, t2;

如果,你的確想使用笛卡爾連線,建議使用 cross join:

[192.168.56.121:21000] > select tab1.id,tab1.col_1,tab2.col_2 FROM tab1 CROSS JOIN tab2 where tab1.id<3;
Query: select tab1.id,tab1.col_1,tab2.col_2 FROM tab1 CROSS JOIN tab2 where tab1.id<3
+----+-------+---------------+
| id | col_1 | col_2         |
+----+-------+---------------+
| 1  | true  | 12789.123     |
| 1  | true  | 1243.5        |
| 1  | true  | 24453.325     |
| 1  | true  | 2423.3254     |
| 1  | true  | 243.325       |
| 1  | true  | 243565423.325 |
| 1  | true  | 243.325       |
| 1  | true  | 243423.325    |
| 1  | true  | 243.325       |
| 2  | false | 12789.123     |
| 2  | false | 1243.5        |
| 2  | false | 24453.325     |
| 2  | false | 2423.3254     |
| 2  | false | 243.325       |
| 2  | false | 243565423.325 |
| 2  | false | 243.325       |
| 2  | false | 243423.325    |
| 2  | false | 243.325       |
+----+-------+---------------+
Returned 18 row(s) in 0.41s

5.5 等值連線和非等值連線

預設地,impala的兩表連線需要一個等值的比較,或者使用 ON、USING、WHERE 語句。在Impala 1.2.2 之後,非等值連線也支援。同樣需要避免因為產生大量的結果集而造成記憶體溢位。一旦你想使用非等值連線,建議使用 cross 連線並增加額外的 where 語句。

[192.168.56.121:21000] > select tab1.id,tab1.col_1,tab1.col_2,tab2.col_2 FROM tab1 CROSS JOIN tab2 where tab1.col_2 >tab2.col_2 ;
+----+-------+------------+-----------+
| id | col_1 | col_2      | col_2     |
+----+-------+------------+-----------+
| 2  | false | 1243.5     | 243.325   |
| 2  | false | 1243.5     | 243.325   |
| 2  | false | 1243.5     | 243.325   |
| 3  | false | 24453.325  | 12789.123 |
| 3  | false | 24453.325  | 1243.5    |
| 3  | false | 24453.325  | 2423.3254 |
| 3  | false | 24453.325  | 243.325   |
| 3  | false | 24453.325  | 243.325   |
| 3  | false | 24453.325  | 243.325   |
| 4  | false | 243423.325 | 12789.123 |
| 4  | false | 243423.325 | 1243.5    |
| 4  | false | 243423.325 | 24453.325 |
| 4  | false | 243423.325 | 2423.3254 |
| 4  | false | 243423.325 | 243.325   |
| 4  | false | 243423.325 | 243.325   |
| 4  | false | 243423.325 | 243.325   |
+----+-------+------------+-----------+
Returned 16 row(s) in 0.41s

查詢出來的結果會有一些重複的記錄,這個時候可以通過 distinct 去重。

5.6 半連線

左半連線是為了實現 in 語句,左邊的記錄會查詢出來,而不管右邊表有多少匹配的記錄。Impala 2.0版本之後,支援右半連線。

[192.168.56.121:21000] > SELECT tab1.id,tab1.col_1,tab2.col_2 FROM tab1 LEFT SEMI JOIN tab2 USING (id);
Query: select tab1.id,tab1.col_1,tab2.col_2 FROM tab1 LEFT SEMI JOIN tab2 USING (id)
+----+-------+-----------+
| id | col_1 | col_2     |
+----+-------+-----------+
| 1  | true  | 12789.123 |
| 2  | false | 1243.5    |
| 3  | false | 24453.325 |
| 4  | false | 2423.3254 |
+----+-------+-----------+
Returned 4 row(s) in 0.41s

5.7 自然連線(不支援)

Impala 不支援 NATURAL JOIN 操作,以避免產生不一致或者大量的結果。自然連線不適應 ON 和 USING 語句,而是自動的關聯所有列相同值的記錄。這種連線是不建議的,特別是當表結構發生變化的時候,如新增或者刪除列的時候,會產生不一樣的結果集。

-- 'NATURAL' is interpreted as an alias for 't1' and Impala attempts an inner join,
-- resulting in an error because inner joins require explicit comparisons between columns.
SELECT t1.c1, t2.c2 FROM t1 NATURAL JOIN t2;
ERROR: NotImplementedException: Join with 't2' requires at least one conjunctive equality predicate.
  To perform a Cartesian product between two tables, use a CROSS JOIN.

-- If you expect the tables to have identically named columns with matching values,
-- list the corresponding column names in a USING clause.
SELECT t1.c1, t2.c2 FROM t1 JOIN t2 USING (id, type_flag, name, address);

5.8 反連線(Impala 2.0 / CDH 5.2 以上版本)

Impala 2.0 / CDH 5.2 以上版本中支援反連線,包括左反連線和右反連線。左反連線的意思是返回左邊表不在右邊表中的記錄。

找出 tab2 的 id 不在 tab1 中的記錄:

[192.168.56.121:21000] > SELECT tab2.id FROM tab2 LEFT ANTI JOIN tab1 USING (id);
+----+
| id |
+----+
| 50 |
| 60 |
| 70 |
| 80 |
| 90 |
+----+
Returned 5 row(s) in 0.41s

6. 聚合查詢

聚合關聯查詢:

[192.168.56.121:21000] > select tab1.col_1, MAX(tab2.col_2), MIN(tab2.col_2) FROM tab2 JOIN tab1 USING (id) GROUP BY col_1 ORDER BY 1 LIMIT 5 ;
+-------+-----------------+-----------------+
| col_1 | max(tab2.col_2) | min(tab2.col_2) |
+-------+-----------------+-----------------+
| false | 24453.325       | 1243.5          |
| true  | 12789.123       | 12789.123       |
+-------+-----------------+-----------------+

聚合關聯子查詢:

[192.168.56.121:21000] > select tab2.* FROM tab2, (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2 FROM tab2, tab1 WHERE tab1.id = tab2.id GROUP BY col_1) subquery1 WHERE subquery1.max_col2 = tab2.col_2 ;
+----+-------+-----------+
| id | col_1 | col_2     |
+----+-------+-----------+
| 1  | true  | 12789.123 |
| 3  | false | 24453.325 |
+----+-------+-----------+
Returned 2 row(s) in 0.54s

Impala 2版本中,支援where 條件子查詢,包括 IN 、EXISTS 和比較符的子查詢:

select tab2.* from tab2 where tab2.id IN (select max(id) from tab1)
select tab2.* from tab2 where tab2.id EXISTS (select max(id) from tab1)
select tab2.* from tab2 where tab2.id