Hive的基本使用(處理資料)
阿新 • • 發佈:2018-12-11
啟動上一篇搭建的hive叢集
sh hive-start.sh
隨便一個資料夾下載檔案,只要自己記住就好:
wget https://raw.githubusercontent.com/ffzs/dataset/master/Questionnaire.csv
開啟hive及beeline:
schematool -dbType mysql -initSchema nohup hiveserver2 1>/home/hadoop/hiveserver.log 2>/home/hadoop/hiveserver.err & beeline -u jdbc:hive2://hadoop1:10000 -n root
建立database:
create database my;
use my;
說一下資料: 一共九列,從左到右分別是:性別,國籍,年齡,工作,資料科學工作首選語言,教育情況,所學專業,從事資料科學工作時間,父母教育情況 建立表:
create table qn(gender string,
country string,
age int,
job string,
language string,
education string,
major string,
tenure string,
parentseducation string)
row format delimited fields terminated by ','
stored as textfile ;
匯入本地資料,本地資料要加local否者hive會去hdfs上找:
load data local inpath '/home/data/Questionnaire.csv' overwrite into table qn;
看一下資料:
受訪者國家分佈情況:
create table country as
select qn.country as country, count(qn.country) as count
from qn where qn.country!='Other' and qn.country!= ''
group by qn.country
order by count desc;
看一下前十名分別來著哪些國家:
select * from country limit 10;
可見美國和印度參與調查的人比較多。
各國受訪者年齡中位數:
create table age as
select qn.country as country,percentile(qn.age, 0.5) as median
from qn where qn.country!='Other' and qn.country!=''
group by qn.country
order by median desc;
看一下前十:
select * from age limit 10;
看來紐西蘭和一些歐洲國家的資料科學家年齡稍微偏大一些。
人數大於400人的國家受訪者年齡中位數
select c.country as country ,c.count as count, a.median as age
from country as c, age as a
where c.country= a.country and c.count > 400
order by count desc;
可見中國、印度發展中國家資料可數學家更年輕化。
人數前十的國家受訪者年齡中位數:
select c.country as country ,c.count as count, a.median as age
from country c left join age a on c.country= a.country
order by count desc
limit 10;
受訪者工作分佈情況
create table job as
select qn.job as job, count(qn.job) as count
from qn where qn.job!='Other' and qn.job!=''
group by qn.job
order by count desc;
前十:
資料科學家最多。
程式語言分佈情況
create table language as
select qn.language as language, count(qn.language) as count
from qn where qn.language!='Other' and qn.language!=''
group by qn.language
order by count desc;
python 和 R以其易用性名列前茅。
python使用者年齡中位數大於30的國家分佈情況:
select qn.country as country, count(qn.country) as count
from qn
where language='Python'
group by qn.country
having percentile(qn.age, 0.5) > 30
order by count desc
limit 10;
各國家受訪者受教育水平人數最多的分類
先取到各個國家各受教育程度人數
create table result1 as
select country, education ,count(gender) as count, GROUPING__ID
from qn
group by country,education
grouping sets (( country, education))
order by count desc;
依據國家分組求出個部分的進行排序,人數最多標為1
create table result2 as
select country, education ,count,
row_number() over (partition by country order by count desc) as num
from result1
where country!='Other' and country!=''
order by num;
選出num=1
create table result3 as
select country, education, count
from result2
where num=1
order by count desc;
結果:
看了一下不是學士就是碩士。
儲存到hdfs上:
insert overwrite directory "/questionnaire/result3/" select * from result3;