Greenplum5.9.0簡單使用

阿新 • • 發佈：2018-12-16

一、環境準備

GP叢集環境情況如下圖所示，master實現容錯，配置standby master。兩個節點為segment節點，其中每個幾點配置兩個segment，未配置mirror segment。

二、使用案例

2.1登入資料庫

登入Greenplum資料庫，預設的資料庫為postgres

[[email protected] ~]$ psql -d testDB

psql (8.2.15)

Type "help" for help.

testDB=#

2.2建立資料庫

testDB=# create database mydb;

CREATE DATABASE

2.3建立表

Greenplum中建立表語句與普通的資料庫的建表語句區別不大，不同之處主要包括以下幾點：

在greenplum中建表時需要指定表的分佈鍵（DISTRIBUTED BY）。
如果表需要用某個欄位分割槽，可以使用PARTITION BY將表建成分割槽表。
使用like操作建立一樣表結構的表。

Greenplum有兩種資料分佈策略：

Hash分佈。通過hash值路由到特定的Segment，語法為DISTRIBUTED BY（..），如果不指定分佈鍵，則預設第一個欄位為分佈鍵。
隨機分佈。也叫平均分佈，在執行關聯等操作時效能較差，語法為Distributed randomly

以下兩種方式結果一樣，都是以id作為分佈鍵。

testDB=# create table test1(id int,name varchar(128));

NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.

HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.

CREATE TABLE

testDB=# create table test2(id int,name varchar(128)) distributed by (id);

CREATE TABLE

描述表資訊

testDB=# \d test1

Table "public.test1"

Column | Type | Modifiers

--------+------------------------+-----------

id | integer |

name | character varying(128) |

Distributed by: (id)

下面的建立表語句採用了隨機分佈的方式

testDB=# create table test3(id int,name varchar(128)) distributed randomly;

CREATE TABLE

使用like建立表

testDB=# create table test3_like(like test3);

NOTICE: Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table

CREATE TABLE

2.4顯示資料庫

testDB=# \l

List of databases

Name | Owner | Encoding | Access privileges

-----------+---------+----------+---------------------

mydb | gpadmin | UTF8 |

postgres | gpadmin | UTF8 |

template0 | gpadmin | UTF8 | =c/gpadmin

: gpadmin=CTc/gpadmin

template1 | gpadmin | UTF8 | =c/gpadmin

: gpadmin=CTc/gpadmin

testDB | gpadmin | UTF8 |

(5 rows)

2.5顯示錶

testDB=# \dt

List of relations

Schema | Name | Type | Owner | Storage

--------+------------+-------+---------+---------

public | test01 | table | gpadmin | heap

public | test1 | table | gpadmin | heap

public | test2 | table | gpadmin | heap

public | test3 | table | gpadmin | heap

public | test3_like | table | gpadmin | heap

(5 rows)

2.6插入資料

testDB=# insert into test1 values(1,'tom'),(2,'jack'),(3,'Bob');

INSERT 0 3

2.7查詢資料

Greenplum的資料分佈在所有的Segment上，當從一個表查詢資料的時候，Master的資料展現順序是以Master先接收到的資料的順序，每個segment的資料到達Master的順序是隨機的，所以Select的結果的順序是隨機的。

testDB=# select * from test1;

id | name

----+------

3 | Bob

1 | tom

2 | jack

(3 rows)

2.8create table as與select into

create table as與select into功能一樣，是根據select的結果建立一個新表，在臨時分析資料的時候十分方便。在建立表時如果預設不指定分佈鍵，那麼Greenplum根據執行的select得到的結果集來選擇，不用再次重分佈資料的欄位作為表的分佈鍵

testDB=# create table test4 as select * from test1;

NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'id' as the Greenplum Database data distribution key for this table.

HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.

SELECT 3

可以手動指定分佈鍵

testDB=# create table test5 as select * from test1 distributed by(name);

SELECT 3

查看錶結構資訊

testDB=# \d test5

Table "public.test5"

Column | Type | Modifiers

--------+------------------------+-----------

id | integer |

name | character varying(128) |

Distributed by: (name)

Select into比create table as簡單，但是select into不能指定分佈鍵，只能使用預設的分佈鍵。

testDB=# select * into test6 from test1;

NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'id' as the Greenplum Database data distribution key for this table.

HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.

SELECT 3

2.9 Explain

Explain用於查詢一個表的執行計劃，它在SQL優化的時候經常要用到。

testDB=# explain select * from orders join customer on orders.cid=customer.id;

QUERY PLAN

-----------------------------------------------------------------------------------------------

Gather Motion 4:1 (slice2; segments: 4) (cost=3.07..6.21 rows=1 width=21)

-> Hash Join (cost=3.07..6.21 rows=1 width=21)

Hash Cond: orders.cid = customer.id

-> Redistribute Motion 4:4 (slice1; segments: 4) (cost=0.00..3.09 rows=1 width=13)

Hash Key: orders.cid

-> Seq Scan on orders (cost=0.00..3.03 rows=1 width=13)

-> Hash (cost=3.03..3.03 rows=1 width=8)

-> Seq Scan on customer (cost=0.00..3.03 rows=1 width=8)

(8 rows)

2.10分析函式之開窗函式

testDB=# select depname,empno,salary,rank() over (partition by depname order by salary desc),row_number() over (partition by depname order by salary desc) from empsalary;

depname | empno | salary | rank | row_number

-----------+-------+--------+------+------------

develop | 9 | 10000 | 1 | 1

develop | 7 | 9000 | 2 | 2

develop | 10 | 9000 | 2 | 3

develop | 5 | 7000 | 4 | 4

personnel | 6 | 5600 | 1 | 1

personnel | 3 | 3900 | 2 | 2

sales | 8 | 8000 | 1 | 1

sales | 4 | 5500 | 2 | 2

sales | 1 | 5000 | 3 | 3

sales | 2 | 4500 | 4 | 4

(10 rows)

2.11分割槽表

按照時間分割槽，建立2018725~2018730

CREATE TABLE test_partition_range (

id NUMERIC,

name CHARACTER VARYING(32),

dw_end_date DATE

) DISTRIBUTED BY (id) PARTITION BY range (dw_end_date) (

PARTITION p2018725 start('2018-7-25'::DATE) END ('2018-7-26'::DATE),

PARTITION p2018726 start('2018-7-26'::DATE) END ('2018-7-27'::DATE),

PARTITION p2018727 start('2018-7-27'::DATE) END ('2018-7-28'::DATE),

PARTITION p2018728 start('2018-7-28'::DATE) END ('2018-7-29'::DATE),

PARTITION p2018729 start('2018-7-29'::DATE) END ('2018-7-30'::DATE),

PARTITION p2018730 start('2018-7-30'::DATE) END ('2018-7-31'::DATE)

);

查看錶資訊

testDB=# \d+ test_partition_range;

Table "public.test_partition_range"

Column | Type | Modifiers | Storage | Description

-------------+-----------------------+-----------+----------+-------------

id | numeric | | main |

name | character varying(32) | | extended |

dw_end_date | date | | plain |

Child tables: test_partition_range_1_prt_p2018725,

test_partition_range_1_prt_p2018726,

test_partition_range_1_prt_p2018727,

test_partition_range_1_prt_p2018728,

test_partition_range_1_prt_p2018729,

test_partition_range_1_prt_p2018730

Has OIDs: no

Distributed by: (id)

使用every建立2018725~2018730分割槽表

CREATE TABLE test_partition_every (

id NUMERIC,

name CHARACTER VARYING(32),

dw_end_date DATE

) DISTRIBUTED BY (id) PARTITION BY range (dw_end_date) (PARTITION p201807 start('2018-7-25'::DATE) END ('2018-7-30'::DATE) every('1 days'::interval));

查看錶資訊

testDB=# \d+ test_partition_every;

Table "public.test_partition_every"

Column | Type | Modifiers | Storage | Description

-------------+-----------------------+-----------+----------+-------------

id | numeric | | main |

name | character varying(32) | | extended |

dw_end_date | date | | plain |

Child tables: test_partition_every_1_prt_p201807_1,

test_partition_every_1_prt_p201807_2,

test_partition_every_1_prt_p201807_3,

test_partition_every_1_prt_p201807_4,

test_partition_every_1_prt_p201807_5

Has OIDs: no

Distributed by: (id)

建立list分割槽

CREATE TABLE test_partition_list (

id NUMERIC,

city CHARACTER VARYING(32)

) DISTRIBUTED BY (id) PARTITION BY list (city) (

PARTITION shanghai VALUES ('shanghai'),

PARTITION beijing VALUES ('beijing'),

PARTITION guangzhou VALUES ('guangzhou'),

DEFAULT PARTITION other_city

) ;

查看錶資訊

testDB=# \d+ test_partition_list;

Table "public.test_partition_list"

Column | Type | Modifiers | Storage | Description

--------+-----------------------+-----------+----------+-------------

id | numeric | | main |

city | character varying(32) | | extended |

Child tables: test_partition_list_1_prt_beijing,

test_partition_list_1_prt_guangzhou,

test_partition_list_1_prt_other_city,

test_partition_list_1_prt_shanghai

Has OIDs: no

Distributed by: (id)

2.12外部表

Greenplum在資料載入上有一個明顯的優勢，就是支援資料併發載入，gpfdist就是併發載入的工具，在資料庫中對應的就是外部表。 gpfdist的實現架構圖如下所示

啟動gpfdist及建立外部表的步驟如下：

首先在檔案伺服器上啟動gpfdist的服務，指定檔案目錄及埠

$ gpfdist -d /home/gpadmin -p 8081 -l /home/gpadmin/log &

檢視是否成功

[[email protected] tmp]$ cat gpfdist.log

nohup: ignoring input

Serving HTTP on port 8888, directory /home/gpadmin

檢視埠號

[[email protected] greenplum-db]$ lsof -i:8888

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

gpfdist 51700 gpadmin 6u IPv6 124806 0t0 TCP *:ddi-tcp-1 (LISTEN)

準備需要載入的資料放在/home/gpadmin目錄下，或者其子目錄下，並建立外部表。

id INTEGER,

name VARCHAR(128)

) location ('gpfdist://192.168.10.136:8888/test_ext.txt') format 'text' (delimiter AS '|' NULL AS '' ESCAPE 'off') encoding 'GB18030' Log errors

INTO test_ext_err segment reject limit 10 rows;

查看錶資訊

testDB=# \d test_ext

External table "public.test_ext"

Column | Type | Modifiers

--------+------------------------+-----------

id | integer |

name | character varying(128) |

Type: readable

Encoding: GB18030

Format type: text

Format options: delimiter '|' null '' escape 'off'

External location: gpfdist://192.168.10.136:8888/test_ext.txt

Segment reject limit: 10 rows

Error table: test_ext_err

Greenplum5.9.0簡單使用

一、環境準備

二、使用案例

2.1登入資料庫

2.2建立資料庫

2.3建立表

2.4顯示資料庫

2.5顯示錶

2.6插入資料

2.7查詢資料

2.8create table as與select into

2.9 Explain

2.10分析函式之開窗函式

2.11分割槽表

2.12外部表

Greenplum5.9.0簡單使用

hdfs2.9.0簡單開發

tensorflow 1.9.0 語音識別簡單實現

setup factory 9.0.3 打包工具的簡單設定

Spring4.0系列9-websocket簡單應用

spark 1.6.0 core原始碼分析9 從簡單例子看action

Fuel 9.0 部署Openstack Mitaka詳細

9.0異常處理

MONyog_5.6.9.0 key激活|監控MYSQL

Python error: Microsoft Visual C++ 9.0 is required 解決方案

Robomongo 0.9.0 連接mongo數據庫時，提示連接失敗的解決方案

【9.0】對於java集合的叠代器的底層分析

VS2008下安裝與配置DirectShow SDK 9.0 及 DirectShow AMCap改裝的問題

Xcode 9.0 新增功能大全

深度學習服務器環境配置: Ubuntu17.04+Nvidia GTX 1080+CUDA 9.0+cuDNN 7.0+TensorFlow 1.3

CentOS 7中以runfile形式安裝CUDA 9.0

Hadoop2.9.0安裝

開發人員學Linux(14)：CentOS7安裝配置大數據平臺Hadoop2.9.0

k8s 1.9.0-手動安裝-2

Qt5.9.0正式版動態編譯 (VS2017) 支持WebEngine 、 ICU 、OpenSSL

Greenplum5.9.0簡單使用

一、環境準備

二、使用案例

2.1登入資料庫

2.2建立資料庫

2.3建立表

2.4顯示資料庫

2.5顯示錶

2.6插入資料

2.7查詢資料

2.8create table as與select into

2.9 Explain

2.10分析函式之開窗函式

2.11分割槽表

2.12外部表

相關推薦