簡單示例用例(Simple Example Use Cases)--hive GettingStarted用例翻譯
1、MovieLens User Ratings
First, create a table with tab-delimited text file format:
首先,創建一個通過tab分隔的表:
CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘ STORED AS TEXTFILE; |
Then, download the data files from MovieLens 100k
然後,下載數據文件從下面方法:
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip or: curl --remote-name http://files.grouplens.org/datasets/movielens/ml-100k.zip |
Note: If the link to GroupLens datasets does not work, please report it on HIVE-5341 or send a message to the [email protected]
Unzip the data files:
解壓縮這個文件:
unzip ml-100k.zip |
And load u.data into the table that was just created:
並且加載數據到剛剛創建的u_data表中:
LOAD DATA LOCAL INPATH ‘<path>/u.data‘ OVERWRITE INTO TABLE u_data; |
Count the number of rows in table u_data:
統計表u_data的行數:
SELECT COUNT(*) FROM u_data; |
Note that for older versions of Hive which don‘t include HIVE-287, you‘ll need to use COUNT(1) in place of COUNT(*).
Now we can do some complex data analysis on the table u_data:
現在我們可以做一些復雜的數據分析針對表u_data:
Create weekday_mapper.py: |
import sys import datetime for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split(‘\t‘) weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print ‘\t‘.join([userid, movieid, rating, str(weekday)]) |
Use the mapper script:
使用這個腳本:
CREATE TABLE u_data_new ( userid INT, movieid INT, rating INT, weekday INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘; |
add FILE weekday_mapper.py; |
INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING ‘python weekday_mapper.py‘ AS (userid, movieid, rating, weekday) FROM u_data; 解釋:這裏通過python腳本清洗表u_data中數據,使用python腳本通過 TRANSFORM (userid, movieid, rating, unixtime) --輸入字段 USING ‘python weekday_mapper.py‘ --腳本處理 AS (userid, movieid, rating, weekday) --輸出字段 |
SELECT weekday, COUNT(*) FROM u_data_new GROUP BY weekday; |
Note that if you‘re using Hive 0.5.0 or earlier you will need to use COUNT(1) in place of COUNT(*).
2、Apache Weblog Data
The format of Apache weblog is customizable, while most webmasters use the default.
For default Apache weblog, we can create a table with the following command.
More about RegexSerDe can be found here in HIVE-662 and HIVE-1719.
CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe‘ WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?" ) STORED AS TEXTFILE; |
簡單示例用例(Simple Example Use Cases)--hive GettingStarted用例翻譯