1. 程式人生 > >[轉帖]Loading Data into HAWQ

[轉帖]Loading Data into HAWQ

Loading data into the database is required to start using it but how? There are several approaches to achieve this basic requirement but achieve the result by approaching the problem in different ways. This allows you to load data that best matches your use case.

Table Setup
This table will be used for the testing in HAWQ. I have this table created in a single node VM running Hortonworks HDP with HAWQ 2.0 installed. I’m using the default Resource Manager too.

CREATE TABLE test_data
(id int,
 fname text,
 lname text)
 DISTRIBUTED RANDOMLY;

Singleton
Let’s start with probably the worst way first. Sometimes this way is ideal because you have very little data to load but in most cases, avoid singleton inserts. This approach inserts just a single tuple in a single transaction.

head si_test_data.sql
insert into test_data (id, fname, lname) values (1, 'jon_00001', 'roberts_00001');
insert into test_data (id, fname, lname) values (2, 'jon_00002', 'roberts_00002');
insert into test_data (id, fname, lname) values (3, 'jon_00003', 'roberts_00003');
insert into test_data (id, fname, lname) values (4, 'jon_00004', 'roberts_00004');
insert into test_data (id, fname, lname) values (5, 'jon_00005', 'roberts_00005');
insert into test_data (id, fname, lname) values (6, 'jon_00006', 'roberts_00006');
insert into test_data (id, fname, lname) values (7, 'jon_00007', 'roberts_00007');
insert into test_data (id, fname, lname) values (8, 'jon_00008', 'roberts_00008');
insert into test_data (id, fname, lname) values (9, 'jon_00009', 'roberts_00009');
insert into test_data (id, fname, lname) values (10, 'jon_00010', 'roberts_00010');

This repeats for 10,000 tuples.

time psql -f si_test_data.sql > /dev/null
real	5m49.527s

As you can see, this is pretty slow and not recommended for inserting large amounts of data. Nearly 6 minutes to load 10,000 tuples is crawling.

COPY
If you are familiar with PostgreSQL then you will feel right at home with this technique. This time, the data is in a file named test_data.txt and it is not wrapped with an insert statement.

head test_data.txt
1|jon_00001|roberts_00001
2|jon_00002|roberts_00002
3|jon_00003|roberts_00003
4|jon_00004|roberts_00004
5|jon_00005|roberts_00005
6|jon_00006|roberts_00006
7|jon_00007|roberts_00007
8|jon_00008|roberts_00008
9|jon_00009|roberts_00009
10|jon_00010|roberts_00010
COPY test_data FROM '/home/gpadmin/test_data.txt' WITH DELIMITER '|';
COPY 10000
Time: 128.580 ms

This method is significantly faster but it loads the data through the master. This means it doesn’t scale well as the master will become the bottleneck but it does allow you to load data from a host anywhere on your network so long as it has access to the master.

gpfdist
gpfdist is a web server that serves posix files for the segments to fetch. Segment processes will get the data directly from gpfdist and bypass the master when doing so. This enables you to scale by adding more gpfdist processes and/or more segments.

gpfdist -p 8888 &
[1] 128836
[[email protected] ~]$ Serving HTTP on port 8888, directory /home/gpadmin

Now you’ll need to create a new external table to read the data from gpfdist.

CREATE EXTERNAL TABLE gpfdist_test_data
(id int,
 fname text,
 lname text)
LOCATION ('gpfdist://hdb:8888/test_data.txt')
FORMAT 'TEXT' (DELIMITER '|');

And to load the data.

INSERT INTO test_data SELECT * FROM gpfdist_test_data;
INSERT 0 10000
Time: 98.362 ms

gpfdist is blazing fast and scales easily. You can add more than one gpfdist location in the external table, use wild cards, use different formats, and much more. The downside is the file must be on a host that all segments can reach. You also have to create a separate gpfdist process on that host.

gpload
gpload is a utility that automates the loading process by using gpfdist. Review the documentation for more on this utility. Technically, it is the same as gpfdist and external tables but just automates the commands for you.

Programmable Extension Framework (PXF)
PXF allows you to read and write data to HDFS using external tables. Like using gpfdist, it is done by each segment so it scales and executes in parallel.

For this example, I’ve loaded the test data into HDFS.

hdfs dfs -cat /test_data/* | head
1|jon_00001|roberts_00001
2|jon_00002|roberts_00002
3|jon_00003|roberts_00003
4|jon_00004|roberts_00004
5|jon_00005|roberts_00005
6|jon_00006|roberts_00006
7|jon_00007|roberts_00007
8|jon_00008|roberts_00008
9|jon_00009|roberts_00009
10|jon_00010|roberts_00010

The external table definition.

CREATE EXTERNAL TABLE et_test_data
(id int,
 fname text,
 lname text)
LOCATION ('pxf://hdb:51200/test_data?Profile=HdfsTextSimple')
FORMAT 'TEXT' (DELIMITER '|');

And now to load it.

INSERT INTO test_data SELECT * FROM et_test_data;
INSERT 0 10000
Time: 227.599 ms

PXF is probably the best way to load data when using the “Data Lake” design. You load your raw data into HDFS and then consume it with a variety of tools in the Hadoop ecosystem. PXF can also read and write other formats.

Outsourcer and gplink
Last but not least are software programs I created. Outsourcer automates the table creation and load of data directly to Greenplum or HAWQ using gpfdist. It sources data from SQL Server and Oracle as these are the two most common OLTP databases.

gplink is another tool that can read external data but this technique can connect to any valid JDBC source. It doesn’t automate many of the steps that Oustourcer does but it is a convenient tool to get data from a JDBC source.

You might be thinking that sqoop does this but not exactly. gplink and Outsourcer load data into HAWQ and Greenplum tables. It is optimized for these databases and fixes data for you automatically. Both remove null and newline characters and escapes the escape and delimiter characters. With sqoop, you will have to read the data from HDFS using PXF and then fix the errors that could be in the files.

Both tools are linked above.

Summary
This post gives a brief description on the various ways to load data into HAWQ. Pick the right technique for your use case. As you can see, HAWQ is very flexible and can handle a variety of ways to load data.

相關推薦

[]Loading Data into HAWQ

Loading data into the database is required to start using it but how? There are several approaches to achieve this basic requirement but achieve the resul

Loading Data Into Hive From File By ODI 12c

Hive ODI 本文介紹如何將文本文件中的數據通過ODI導入Hive數據倉庫。 一、Hive上創建目標表 1.1 源表定義 CREATE TABLE EXAM_SCORE ( ID NUMBER(4), AREA_ID NUMBER(1), EXAM_I

Loading Data into a Table

not 默認 delete format txt文件 pre 電腦 var 想要 在localhost中準備好了一個test數據庫和一個pet表: mysql> SHOW DATABASES; +--------------------+ | Dat

Eclipse安裝svn插件的幾種方式 ....

如果 version name feature help sin 鏈接 exe 文件 Eclipse安裝svn插件的幾種方式 1.在線安裝: (1).點擊 Help --> Install New Software... (2).在彈出的窗口中點擊add按鈕,輸

centos6.8下安裝matlab2009(圖片

.so 完成 流程 bsp ror not libraries .com pen 前言 如何優雅的在centos6.8上安裝matlab2009. 流程 不過我個人安裝過程完後啟動matlab的時候又出現了新問題: error while loading shared

網站架構

設備 老男孩 第一次 緩存 rbd chm nbsp 性能 sql http://oldboy.blog.51cto.com/2561410/736710 高並發訪問的核心原則其實就一句話“把所有的用戶訪問請求都盡量往前推”。 如果把來訪用戶比作來犯的"敵人",我們一

[收集] Java註解

cto 這一 字段 declare rri 鼓勵 指定 包含成員 容易 1、Annotation 它的作用是修飾編程元素。什麽是編程元素呢?例如:包、類、構造方法、方法、成員變量等。Annotation(註解)就是Java提供了一種元程序中的元素關聯任何信息

[MST] Loading Data from the Server using lifecycle hook

del asi con all load() body clas call code Let‘s stop hardcoding our initial state and fetch it from the server instead. In this lesson

Loading Data From Oracle To Hive By ODI 12c

ODI Oracle Hive 本文描述如何通過ODI將Oracle表數據同步到Hive。1、準備工作在hadoop集群的各個節點分別安裝Oracle Big Data Connectors,具體的組件如下圖所示:這裏只需安裝Oracle Loader For Hadoop(oraloader)以

Vista的MBR磁盤簽名(Disk Signature) (

otl sign RR .cn 分區 工作 color 接班人 最新 原帖:Vista的MBR磁盤簽名(Disk Signature)_存夢_新浪博客 http://blog.sina.com.cn/s/blog_6fed14220100qq71.html 存夢發表於(2

[] MySQL "replace into" 的坑 (5.5 ROW格式)

code not num 什麽 ngs 主鍵 -i mas key MySQL 對 SQL 有很多擴展,有些用起來很方便,但有一些被誤用之後會有性能問題,還會有一些意料之外的副作用,比如 REPLACE INTO。 比如有這樣一張表: 1 2 3 4 5 6 7

maven(一) maven到底是個啥玩意~

大白 中心 最好 新的 區別 OS rate ner 命令行模式 轉載自:https://www.cnblogs.com/whgk/p/7112560.html 我記得在搞懂maven之前看了幾次重復的maven的教學視頻。不知道是自己悟性太低還是怎麽滴,就是搞不清楚,

:maven(二) maven項目構建ssh工程(父工程與子模塊的拆分與聚合)

圖片 做的 bsp IT 是個 pan 有一種 junit img 出處:http://www.cnblogs.com/whgk/p/7121336.html 前一節我們明白了maven是個什麽玩意,這一節就來講講他的一個重要的應用場景,也就是通過maven將一個ssh

eclipse Web項目WebContent目錄修改

IE pro 過程 val 路徑和 cat tin 問題 property 最近在做Web 項目時,新建了一個WEB 項目,如webdemo,eclipse默認的build路徑為build,WEB-INF存放於WebContent下面,今改了一個build路徑和WebCon

使用eclipse創建之前沒有創建的web.xml

沒有 nbsp 遇到 src nts rip gravity java avi 由於在下學習Java的時間不長,所以對於一些工具的使用不太熟悉,特別是eclipse,雖然這是一款強大的Java編譯工具但是現有漢化版。所以在實際使用的時候難免會遇到各種各樣的麻煩。今天就遇到了

---把eclipse默認的編譯文件夾build改為web-inf/classess文件夾

java net b2c clas lips water In fcm uil 1、在web-info下新奸classess文件夾 2、右鍵項目 3、選擇java build path ————————————轉帖---把eclipse默認的編譯文件夾build改

Java生成和操作Excel文件

head instance AC 單元格 content 類型 eid 網站 hang JAVA EXCEL API:是一開放源碼項目,通過它Java開發人員可以讀取Excel文件的內容、創建新的Excel文件、更新已經存在的Excel文件。使用該API非Windows操作

java使用poi.3.10讀取excel 2010

輸入流 icc ioe [] dsw sfc input type() 2.x package poi; import java.io.FileInputStream; import java.io.IOException; import java.io.Input

[]Linux 的UTC 和 GMT

1、問題 對於裝有Windows和Linux系統的機器,進入Windows顯示的時間和Linux不一致,Linux中的時間比Windows提前8個小時。 2、解決方法 修改/etc/default/rcS,設定不使用UTC時間,設定如下:UTC=no 3、解釋 這個是一個關於時間的問題,我們就先來了

[]Oracle字符集的檢視與修改 --- 還未嘗試 找個週六 試試. Oracle 字符集的檢視和修改

Oracle 字符集的檢視和修改 感謝原作者 改天試試 https://www.cnblogs.com/rootq/articles/2049324.html   一、什麼是Oracle字符集        Or