hive結合hbase資料處理解決方案測評二（優化篇）

阿新 • • 發佈：2019-02-09

接上一篇，對hbase引數進行優化，主要是調整與查詢效率相關的引數
count

select count(1) from hbase_table;

部分欄位切表

insert overwrite table hive_table select a,b,c,d from hbase_table;

全欄位切表

insert into table test_table partition(part='aa') select * from hbase_table;

hive至hive切表

create table test_table2 like test_table;
insert into table test_table2 partition(part) select * from test_table;

優化修改引數

<property>
    <name>hbase.regionserver.handler.count</name>
    <value>100</value>
    <description>Count of RPC Listener instances spun up on RegionServers.
    Same property is used by the Master for count of master handlers.
    Default is 10.
    </description>
</property>
<property>
    <name>hfile.block.cache.size</name>
    <value>0.4</value>
    <description>
         Percentage of maximum heap (-Xmx setting) to allocate to block cache
         used by HFile/StoreFile. Default of 0.25 means allocate 25%.
         Set to 0 to disable but it's not recommended.
    </description>
 </property>
 <property>
    <name>hbase.client.scanner.caching</name>
    <value>1000</value>
    <description>Number of rows that will be fetched when calling next
    on a scanner if it is not served from (local, client) memory. Higher
    caching values will enable faster scanners but will eat up more memory
    and some calls of next may take longer and longer times when the cache is empty.
    Do not set this value such that the time between invocations is greater
    than the scanner timeout; i.e. hbase.regionserver.lease.period
    </description>
 </property>

切表

優化前後	欄位全量與否	case	資料量	cpu cost（minutes，seconds）	執行cost（seconds）
前	部分	hbase->hive	1616374	10 ， 18	359.162
後	部分	hbase->hive	1616374	3 ， 24	281.975
後	部分	hbase->hive	1616374	2 ， 38	232.391
後	部分	hbase->hive	2608626	4 ， 39	263.206
後	全量	hbase->hive	2608626	7 ， 53	820.914
後	部分	hbase->hive	12230528	13 ， 22	765.262
後	全量	hbase->hive	12230528	22 ， 59	1305.236
後	全量	hive->hive	12230528	10 ， 41	580.522

count

優化前後	表型別	資料量	cpu cost（minutes，seconds）	執行cost（seconds）
優化前	hbase	1616374	10 ， 45	728.647
hive	1616374	0 ， 25	64.815
優化後	hbase	1616374	4 ， 9	609.28
優化後	hbase	12230528	13 ， 10	907.44
hive	12230528	3 ， 18	422.138

總結

hive&hbase表的統計計算效能遠低於hive表的統計計算，相差3倍乃至以上。
hbase引數優化前後有查詢統計效率成倍提升，但與hive表相比也存在差距。
hive&hbase表切成hive表部分欄位由於全量欄位。
...
綜上，hive的hbase儲存結構不善於統計計算；hive表的hbase儲存結構切換成hive普通的儲存結構，隨著資料量增加，效能也令人堪憂（如上測試也可以看到，即使是hive與hive表的切表資料量大也是很耗時的）。hbase方案具體是選擇居於hive&hbase表統計計算，還是選擇hive&hbase表切換成hive表後統計計算，需要權衡，或者是否有其他更好hive與hbase關聯方案，需要繼續研究。

分析：隨著資料量越來越大，每天都進行hbase—>hive的切表，這是不切合實際的選擇，但終究我們需要將hbase錶轉化成hive表，或許我們可以選擇資料“冷熱”、以及部分欄位切表來優化。

備註：測試還不全面，待完善，特別是hive&hbase複雜sql的統計計算。

hive結合hbase資料處理解決方案測評二（優化篇）

hive結合hbase資料處理解決方案測評二（優化篇）

Struts2資料處理的三種方式（ioc思想）

[轉]#研發解決方案介紹#Tracing（鷹眼）服務化鏈路flow監控

Socket粘包問題終極解決方案—Netty版（2W字）！

文本分類需要CNN？No！fastText完美解決你的需求（後篇）

文本分類需要CNN？No！fastText完美解決你的需求（前篇）

Util應用程式框架公共操作類（一）:資料型別轉換公共操作類（介紹篇）

Util應用程式框架公共操作類(二):資料型別轉換公共操作類（原始碼篇）

Util應用程式框架公共操作類(三):資料型別轉換公共操作類（擴充套件篇）

文字分類需要CNN？ No！fastText完美解決你的需求（前篇）

解決RxJava記憶體洩漏（前篇）：RxLifecycle詳解及原理分析

C#進階系列——WebApi 異常處理解決方案（轉）

伺服器資料恢復通用方法/伺服器硬碟故障導致資料丟失解決方案

Hive查詢特殊欄位解決方案

英特爾與區塊鏈：雲端計算平臺釋出新的資料保護解決方案

前端 SPA 單頁應用資料統計解決方案 (ReactJS / VueJS)

串列埠接收時丟資料問題解決方案

瀏覽器本地資料儲存解決方案以及cookie的坑

MyCat分片-海量資料儲存解決方案

解決 Eclipse 啟動卡在 Loading 畫面不丟失資料的解決方案

hive結合hbase資料處理解決方案測評二（優化篇）

相關推薦