Greenplum5.9.0簡單使用
一、環境準備
GP叢集環境情況如下圖所示,master實現容錯,配置standby master。兩個節點為segment節點,其中每個幾點配置兩個segment,未配置mirror segment。
二、使用案例
2.1登入資料庫
登入Greenplum資料庫,預設的資料庫為postgres
[[email protected] ~]$ psql -d testDB psql (8.2.15) Type "help" for help. testDB=# |
2.2建立資料庫
testDB=# create database mydb; CREATE DATABASE |
2.3建立表
Greenplum中建立表語句與普通的資料庫的建表語句區別不大,不同之處主要包括以下幾點:
- 在greenplum中建表時需要指定表的分佈鍵(DISTRIBUTED BY)。
- 如果表需要用某個欄位分割槽,可以使用PARTITION BY將表建成分割槽表。
- 使用like操作建立一樣表結構的表。
Greenplum有兩種資料分佈策略:
- Hash分佈。通過hash值路由到特定的Segment,語法為DISTRIBUTED BY(..),如果不指定分佈鍵,則預設第一個欄位為分佈鍵。
- 隨機分佈。也叫平均分佈,在執行關聯等操作時效能較差,語法為Distributed randomly
以下兩種方式結果一樣,都是以id作為分佈鍵。
testDB=# create table test1(id int,name varchar(128)); NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. CREATE TABLE |
testDB=# create table test2(id int,name varchar(128)) distributed by (id); CREATE TABLE |
描述表資訊
testDB=# \d test1 Table "public.test1" Column | Type | Modifiers --------+------------------------+----------- id | integer | name | character varying(128) | Distributed by: (id) |
下面的建立表語句採用了隨機分佈的方式
testDB=# create table test3(id int,name varchar(128)) distributed randomly; CREATE TABLE |
使用like建立表
testDB=# create table test3_like(like test3); NOTICE: Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table CREATE TABLE |
2.4顯示資料庫
testDB=# \l List of databases Name | Owner | Encoding | Access privileges -----------+---------+----------+--------------------- mydb | gpadmin | UTF8 | postgres | gpadmin | UTF8 | template0 | gpadmin | UTF8 | =c/gpadmin : gpadmin=CTc/gpadmin template1 | gpadmin | UTF8 | =c/gpadmin : gpadmin=CTc/gpadmin testDB | gpadmin | UTF8 | (5 rows) |
2.5顯示錶
testDB=# \dt List of relations Schema | Name | Type | Owner | Storage --------+------------+-------+---------+--------- public | test01 | table | gpadmin | heap public | test1 | table | gpadmin | heap public | test2 | table | gpadmin | heap public | test3 | table | gpadmin | heap public | test3_like | table | gpadmin | heap (5 rows) |
2.6插入資料
testDB=# insert into test1 values(1,'tom'),(2,'jack'),(3,'Bob'); INSERT 0 3 |
2.7查詢資料
Greenplum的資料分佈在所有的Segment上,當從一個表查詢資料的時候,Master的資料展現順序是以Master先接收到的資料的順序,每個segment的資料到達Master的順序是隨機的,所以Select的結果的順序是隨機的。
testDB=# select * from test1; id | name ----+------ 3 | Bob 1 | tom 2 | jack (3 rows) |
2.8create table as與select into
create table as與select into功能一樣,是根據select的結果建立一個新表,在臨時分析資料的時候十分方便。在建立表時如果預設不指定分佈鍵,那麼Greenplum根據執行的select得到的結果集來選擇,不用再次重分佈資料的欄位作為表的分佈鍵
testDB=# create table test4 as select * from test1; NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'id' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. SELECT 3 |
可以手動指定分佈鍵
testDB=# create table test5 as select * from test1 distributed by(name); SELECT 3 |
查看錶結構資訊
testDB=# \d test5 Table "public.test5" Column | Type | Modifiers --------+------------------------+----------- id | integer | name | character varying(128) | Distributed by: (name) |
Select into比create table as簡單,但是select into不能指定分佈鍵,只能使用預設的分佈鍵。
testDB=# select * into test6 from test1; NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'id' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. SELECT 3 |
2.9 Explain
Explain用於查詢一個表的執行計劃,它在SQL優化的時候經常要用到。
testDB=# explain select * from orders join customer on orders.cid=customer.id; QUERY PLAN ----------------------------------------------------------------------------------------------- Gather Motion 4:1 (slice2; segments: 4) (cost=3.07..6.21 rows=1 width=21) -> Hash Join (cost=3.07..6.21 rows=1 width=21) Hash Cond: orders.cid = customer.id -> Redistribute Motion 4:4 (slice1; segments: 4) (cost=0.00..3.09 rows=1 width=13) Hash Key: orders.cid -> Seq Scan on orders (cost=0.00..3.03 rows=1 width=13) -> Hash (cost=3.03..3.03 rows=1 width=8) -> Seq Scan on customer (cost=0.00..3.03 rows=1 width=8) (8 rows) |
2.10分析函式之開窗函式
testDB=# select depname,empno,salary,rank() over (partition by depname order by salary desc),row_number() over (partition by depname order by salary desc) from empsalary; depname | empno | salary | rank | row_number -----------+-------+--------+------+------------ develop | 9 | 10000 | 1 | 1 develop | 7 | 9000 | 2 | 2 develop | 10 | 9000 | 2 | 3 develop | 5 | 7000 | 4 | 4 personnel | 6 | 5600 | 1 | 1 personnel | 3 | 3900 | 2 | 2 sales | 8 | 8000 | 1 | 1 sales | 4 | 5500 | 2 | 2 sales | 1 | 5000 | 3 | 3 sales | 2 | 4500 | 4 | 4 (10 rows) |
2.11分割槽表
按照時間分割槽,建立2018725~2018730
CREATE TABLE test_partition_range ( id NUMERIC, name CHARACTER VARYING(32), dw_end_date DATE ) DISTRIBUTED BY (id) PARTITION BY range (dw_end_date) ( PARTITION p2018725 start('2018-7-25'::DATE) END ('2018-7-26'::DATE), PARTITION p2018726 start('2018-7-26'::DATE) END ('2018-7-27'::DATE), PARTITION p2018727 start('2018-7-27'::DATE) END ('2018-7-28'::DATE), PARTITION p2018728 start('2018-7-28'::DATE) END ('2018-7-29'::DATE), PARTITION p2018729 start('2018-7-29'::DATE) END ('2018-7-30'::DATE), PARTITION p2018730 start('2018-7-30'::DATE) END ('2018-7-31'::DATE) ); |
查看錶資訊
testDB=# \d+ test_partition_range; Table "public.test_partition_range" Column | Type | Modifiers | Storage | Description -------------+-----------------------+-----------+----------+------------- id | numeric | | main | name | character varying(32) | | extended | dw_end_date | date | | plain | Child tables: test_partition_range_1_prt_p2018725, test_partition_range_1_prt_p2018726, test_partition_range_1_prt_p2018727, test_partition_range_1_prt_p2018728, test_partition_range_1_prt_p2018729, test_partition_range_1_prt_p2018730 Has OIDs: no Distributed by: (id) |
使用every建立2018725~2018730分割槽表
CREATE TABLE test_partition_every ( id NUMERIC, name CHARACTER VARYING(32), dw_end_date DATE ) DISTRIBUTED BY (id) PARTITION BY range (dw_end_date) (PARTITION p201807 start('2018-7-25'::DATE) END ('2018-7-30'::DATE) every('1 days'::interval)); |
查看錶資訊
testDB=# \d+ test_partition_every; Table "public.test_partition_every" Column | Type | Modifiers | Storage | Description -------------+-----------------------+-----------+----------+------------- id | numeric | | main | name | character varying(32) | | extended | dw_end_date | date | | plain | Child tables: test_partition_every_1_prt_p201807_1, test_partition_every_1_prt_p201807_2, test_partition_every_1_prt_p201807_3, test_partition_every_1_prt_p201807_4, test_partition_every_1_prt_p201807_5 Has OIDs: no Distributed by: (id) |
建立list分割槽
CREATE TABLE test_partition_list ( id NUMERIC, city CHARACTER VARYING(32) ) DISTRIBUTED BY (id) PARTITION BY list (city) ( PARTITION shanghai VALUES ('shanghai'), PARTITION beijing VALUES ('beijing'), PARTITION guangzhou VALUES ('guangzhou'), DEFAULT PARTITION other_city ) ; |
查看錶資訊
testDB=# \d+ test_partition_list; Table "public.test_partition_list" Column | Type | Modifiers | Storage | Description --------+-----------------------+-----------+----------+------------- id | numeric | | main | city | character varying(32) | | extended | Child tables: test_partition_list_1_prt_beijing, test_partition_list_1_prt_guangzhou, test_partition_list_1_prt_other_city, test_partition_list_1_prt_shanghai Has OIDs: no Distributed by: (id) |
2.12外部表
Greenplum在資料載入上有一個明顯的優勢,就是支援資料併發載入,gpfdist就是併發載入的工具,在資料庫中對應的就是外部表。 gpfdist的實現架構圖如下所示
啟動gpfdist及建立外部表的步驟如下:
- 首先在檔案伺服器上啟動gpfdist的服務,指定檔案目錄及埠
$ gpfdist -d /home/gpadmin -p 8081 -l /home/gpadmin/log & |
檢視是否成功
[[email protected] tmp]$ cat gpfdist.log nohup: ignoring input Serving HTTP on port 8888, directory /home/gpadmin |
檢視埠號
[[email protected] greenplum-db]$ lsof -i:8888 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME gpfdist 51700 gpadmin 6u IPv6 124806 0t0 TCP *:ddi-tcp-1 (LISTEN) |
- 準備需要載入的資料放在/home/gpadmin目錄下,或者其子目錄下,並建立外部表。
id INTEGER, name VARCHAR(128) ) location ('gpfdist://192.168.10.136:8888/test_ext.txt') format 'text' (delimiter AS '|' NULL AS '' ESCAPE 'off') encoding 'GB18030' Log errors INTO test_ext_err segment reject limit 10 rows; |
查看錶資訊
testDB=# \d test_ext External table "public.test_ext" Column | Type | Modifiers --------+------------------------+----------- id | integer | name | character varying(128) | Type: readable Encoding: GB18030 Format type: text Format options: delimiter '|' null '' escape 'off' External location: gpfdist://192.168.10.136:8888/test_ext.txt Segment reject limit: 10 rows Error table: test_ext_err |