Base64原理和實現

阿新 • • 發佈：2020-12-22

hive

簡介

hive是基於Hadoop的一個數據倉庫工具，可以將結構化的資料對映為一張表，它提供了一系列的工具，可以用來進行資料提取、轉化和載入。hive定義了簡單的類SQL查詢語言，稱為HiveSQL。hive在執行過程中會將HQL轉換為MapReduce執行，所以它本質上還是一種離線的大資料分析工具，由於hadoop通常會在作業提交和排程時有很大的開銷，有較高延遲，因此hive也不能再大規模資料集上實現低延遲的快速查詢。hive的最佳適用場景是大資料集的批處理作業，例如網路日誌的分析。

資料倉庫的特徵：

由多種異構資料組成；
一般是歷史資料，用來讀和分析，所以是弱事物；
資料庫是為捕獲資料，資料倉庫是為分析資料的；
資料是時變的，即資料包含時間元素；

資料庫屬於OLTP，即聯機事物處理系統，一般用來各種具體的業務處理；資料倉庫屬於OLAP系統，即聯機分析處理系統，一般用來分析資料。

資料倉庫與資料庫的對比：

資料倉庫	資料庫
為離線分析儲存歷史資料	為線上提供實時資料
只支援一寫多讀，不支援行級改	具有完善的crud功能
不強調事物	具有完整的事物能力
製造冗餘，提高查詢效率	儘量避免冗餘
資料來源多樣	資料來源單一
OLTP	OLAP

安裝：需要jdk環境、hadoop環境、解壓即可使用；

執行：執行hive之前需要啟動HDFS和MapReduce，啟動命令：./bin hive；

注意：hive會將使用中產生的元資料會放在derby資料庫，而derby是檔案式資料庫，而且是在哪裡啟動的hive，derby檔案就在那個目錄，更換目錄元資料就丟失，而且同時只能連線一個客戶端，這很不方便，所以需要藉助mysql來儲存hive的元資料。

基本語句

hive常用命令如下：

show databases：顯示資料庫；
create database test01：建立資料庫；（其實是一個目錄）
use test01：切換到test01資料庫；
create table stu(id int, name string) row format delimited fields terminated by ‘ ’

：建立stu表，包含id、name兩個欄位，且分隔符為空格；（其實就是一個目錄）
desc stu：查看錶結構；
insert into stu values(1, "張三")：插入資料（一般不用）；
select * from stu (where id=1)：查詢，可以跟條件；
load data local inpath '/data/1.txt' into table stu：載入本地檔案儲存到資料庫（HDFS）；
create table stu2 like stu：建立一個結構和stu一樣的表；
insert overwrite table stu2 select * from stu：將查詢結果覆插入到stu2中；
from stu insert overwrite table stu2 select * insert overwrite table stu2 select *：將stu的查詢結果覆蓋插入到stu2和stu3中
insert overwrite local directory ‘/data/1.txt' row format delimited fields terminated by ' ' select * from stu：將查詢結果儲存到本地檔案（去掉local關鍵字則是儲存到HDFS中）；
alter table stu2 rename to stu3：將stu2表重新命名為stu3；
drop table stu3：刪除stu3表；
hql中連表有：內連線（inner join，不能簡寫）、左連線（left join）、右連線（right join）、全連線（full join）；

表概念

內部表

在上面的基本指令中，預設建立的表就是內部表，由hive先建一張表，然後向這個表裡插入資料，這種表不常用；

外部表

外部表是在HDFS中已經有了固定格式的資料，然後在hive中建立外部表處理這部分資料，外部表可以管理指定目錄下的所有檔案，要求是格式統一。刪除外部表不會刪除真實的檔案。建立命令：create external table stu1(id int, name string, score int) row format delimited fields terminated by ' ' location '/hdfs/dir'。

分割槽表

即對資料進行分割槽，每個分割槽一個目錄，這樣可以避免整表查詢從而提高查詢效率，外部表和內部表都可以是分割槽表。分割槽表在匯入資料時需要在hql語句末尾新增**partition(type=‘xx’)**指定分割槽；實際效果在HDFS中就是一個分割槽對應一個目錄，分割槽欄位就對應目錄名，為欄位名=欄位值（type=xx），相關命令如下。

建立內部分割槽表：create table stu(id int, name string) partitioned by (type string) row format delimited fields terminated by ' ';
建立外部分割槽表：create external table stu(id int, name string) partitioned by (type string) row format delimited fields terminated by ' ' location '/hdfs/dir'；
新增分割槽：alter table stu add partition(type='xx1') location '/hdfs/type=xx1';或者msck repair table book;
檢視分割槽：show partitions stu;
刪除分割槽：alter table stu drop partition(type='xx1');
修改分割槽名：alter table stu partition(type='xx1') rename to partition(type='xx2');

分桶表

分桶表概念：

分桶表是一種更細粒度的資料分配方式；
一個表可以分割槽也可以分桶；
分桶的主要作用是用於實現資料抽樣；
分桶通過hash演算法將資料分佈在不同的桶中；
預設不開啟，需要手動開啟；
分桶表只能從另一張表匯入資料（經測試，並不是）；
涉及到Join操作時，可以在桶與桶間關聯即可，大大減小Join的資料量，提高執行效率。
物理上每個桶就是目錄裡的一個檔案，一個作業產生的桶（輸出檔案）數量和reduce任務個數相同。

使用分桶表：

開啟分桶機制：set hive.enforce.bucketing=true；
建立分桶表：create table student(name string, age int) clustered by (name) into 3 buckets row format delimited fields terminated by ' ';
進行抽樣：select * from student tablesample(bucket 1 out of 6 on name)；（1表示從第1個桶開始抽樣；6表示抽樣步長和比例，即抽取桶數/6個桶的資料，每隔6個桶抽取一次，這個值只能是桶數的因數或倍數）；

事務表

hive0.14之後開始支援事務和行級更新，但預設不支援，需要額外的屬性配置，還需要是分桶表且為orc格式；使用步驟如下：

hive-site.xml中進行如下配置：

<property>
    <name>hive.support.concurrency</name>
    <value>true</value>
</property>
<property>
    <name>hive.exec.dynamic.partition.mode</name>
    <value>nonstrict</value>
</property>
<property>
    <name>hive.txn.manager</name>
    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
    <name>hive.compactor.initiator.on</name>
    <value>true</value>
</property>
<property>
    <name>hive.compactor.worker.threads</name>
    <value>1</value>
</property>
<property>
    <name>hive.enforce.bucketing</name>
    <value>true</value>
</property>

建立事務表：create table t1(id int,name string) clustered by (id) into 3 buckets row format delimited fields terminated by ',' stored as orc；

事務表一般不用

資料型別

array

# 陣列型別可以通過size(info)獲取陣列長度
create external table t1(info array<int>) row format delimited fields terminated by ' ' collection items terminated by ',' location '/hdfs/1.txt'

map

# 列分隔符只能是\t
create external table t1(info map<string,int>) row format delimited fields terminated by '\t' map keys terminated by ',' location '/hdfs/1.txt'
# 一個map的查詢示例
select distinct(info['k']) from t1 where info['k'] is not null;

struct

類似於Javabean

# 沒有列分割符了
create external table t1(info struct<id:int,name:string,age:int>) row format delimited collection items terminated by ' ' location '/hdfs/1.txt'
# 使用struct示例
select info.age from t1 where info.naem='xx'

常用操作函式

length(str)：返回字串長度；
reverse(str)：反轉字串；
concat(str1, str2, ···)、concat_ws(ws, str, str, ···)：拼接字串、帶拼接符的字串拼接；
substr(str, start, end)：擷取字串，位置可以為負數，表示倒數；
upper(str)、lower(str)：轉大/小寫；
trim(str)、ltrim(str)、rtrim(str)：去掉空格，去掉左邊空格，去掉右邊空格；
regexp_replace(str, pattern, str)：正則替換；
regexp_extract(str, pattern, int)：正則搜尋；
repeat(str, int)：重複字串n此；
split(str, str)：切分為陣列，引數為：字串、切割符；
explode(array)：將陣列切分為多行；

UDF

UDF即使用者自定義函式，使用方法如下：

將hive的lib中的jar包加到工程中；
建立類繼承UDF類，編寫**evaluate()**方法；
編寫完成後打成jar包，拷貝到hive的lib目錄下；
執行hive命令：add jar /data/hive/xxx.jar；
執行hive命令：create temporary function myfuc as ‘com.xxx’;

調優

group by的資料傾斜問題：

set hive.map.aggr=true：相當於新增Combiner；
set hive.groupby.skewindata=true；該引數本質是生成兩個MRJob查詢計劃。第一次MR會在key上拼接隨機字串，從而使相同的key可能輸出到不同的reduce中，第二次才會真正的實現將相同的key輸出到相同的reduce中，從而達到最終的分割槽效果；

join的資料傾斜問題：

大小表的join：使用map join代替join，本質是將小表快取到map端，並在map端就完成join；
1. 在0.11之前是顯示在select後面跟/*+mapjoin(t1)*/
2. 在0.11之後由兩個引數來控制：①hive.auto.convert.join=true(預設開啟)；②hive.mapjoin.smalltable.filesize=25000000(預設25M)。
3. 0.11之後也可以通過指定兩個引數後使用第一種方式進行mapjoin：①hive.auto.convert.join=false；②hive.ignore.mapjoin.hint=false；
大表join大表：將造成資料傾斜的欄位單獨處理，然後在union其他結果；

連表查詢的優化，在hive中，如果連表查詢有查詢條件，最好改為子查詢（原因是count計數操作在hive中只會用一個reduce來實現，所以要先過濾出資料再計數），例如：

# 優化前：
select * from t1 a join t2 b on a.cid=b.id where a.dt='2019-10' and b.price>100;
# 優化後：
select * from (select * from t1 where dt='2019-10') a join (select * from t2 where price>100) b on a.cid=b.id;

引數調優

調整切片大小：set mapred.max.split.size=134217728（預設128MB）；
JVM重用：set mapred.job.reuse.jvm.num.tasks=1(預設1個)；
關閉推測執行機制：set mapred.map.tasks.speculative.execution=false和set mapred.reduce.tasks.speculative.execution=false；
啟用嚴格模式：set hive.mapred.mode=strict，啟用嚴格模式後在以下情況會報錯：
- 查詢分割槽表沒有指定分割槽欄位；
- 使用了order by但是沒有使用limit；
- 產生了笛卡爾積；
切換執行引擎：set hive.execution.engine=spark；（預設為mr）
設定reduce數量：set mapreduce.job.reduces=2；
hive.exec.reducers.bytes.per.reducer=1000000000 ；(hive預設每1G檔案分配一個reduce)
不要使用count (distinct cloumn) ，使用子查詢

sqoop

sqoop：用於HDFS和關係型資料庫的資料的相互匯入，還可以將資料從關係資料庫匯出到HBASE，安裝sqoop需要jdk和hadoop（匯出到Hbase還需要hbase的）的環境變數，並且需要將相關的驅動包（例如mysql驅動包）拷貝到sqoop的lib下，最後在./conf/sqoop-env.sh檔案中修改相關的環境變數即可；

基礎指令：

檢視mysql資料庫：./sqoop list-databases --connect jdbc:mysql://hdp01:3306/ -username root -password 1234；
查看錶：./sqoop list-tables --connect jdbc:mysql://hdp01:3306/db1 -username root -password 1234；
mysql->hdfs：./sqoop import --connect jdbc:mysql://hdp01:3306/db1 --username root --password 1234 --table t1 --target-dir '/data/t1' --fields-terminated-by '|' -m 1；（還可以通過–query引數指定SQL查詢，如：--query 'select * from db1.t1 where $CONDITIONS and id > 10'）
hdfs->mysql：
- 先在mysql建表，注意欄位型別要一致；
- ./sqoop export --connect jdbc:mysql://hdp01:3306/db1 --username root --password root --export-dir '/data/t1/' --table t1 -m 1 --fields-terminated-by '|'；
mysql->hbase：./sqoop import connect jdbc:mysql://hdp01:3306/db1 --username root --password 1234 --table t1 --hbase-table t1 --column-family info --hbase-row-key sid --hbase-create-table；
檢視幫助：./sqoop help，也可以：./sqoop import help

處理json

拷貝jar包hive-hcatalog-core-1.2.0.jar到lib目錄下；
在hive客戶端中執行命令：add jar /…/lib/xxx.jar；
建立與json格式對應的表：create table if not exists t1(id int,name String) row format serde 'org.apache.hive.hcatalog.data.JsonSerDe' stored as textfile；
載入資料：load data local inpath '/data/xxx.json' into table t1；

Base64原理和實現

hive

目錄

簡介

基本語句

表概念

內部表

外部表

分割槽表

分桶表

事務表

資料型別

array

map

struct

常用操作函式

UDF

調優

sqoop

處理json

Base64原理和實現

詳解Vue中的MVVM原理和實現方法

單元測試框架的原理和實現（模仿google test）

Floyd演算法的原理和實現程式碼

GAN的原理和實現

Flink例項（六十八）：布隆過濾器(Bloom Filter)的原理和實現

C++智慧指標的原理和實現

vue3的資料響應原理和實現

說說資料庫連線池工作原理和實現方案？

布隆過濾器(Bloom Filter)的原理和實現

掃碼登入的原理和實現

(轉)電子郵件收發原理和實現(POP3, SMTP)

glove中文詞向量_NLP.TM | GloVe模型的原理和實現

登入php sdk重製版_PHP實現QQ登入的開原理和實現過程

一文搞懂一致性hash的原理和實現

Counting Bloom Filter 的原理和實現

Redis分散式鎖的原理和實現

服務註冊與發現的原理和實現

vue-router的原理和實現

詳解布隆過濾器的原理和實現

Base64原理和實現

hive

目錄

簡介

基本語句

表概念

內部表

外部表

分割槽表

分桶表

事務表

資料型別

array

map

struct

常用操作函式

UDF

調優

sqoop

處理json

相關推薦