1. 程式人生 > >hive在hadoop中的一個demo執行過程總結

hive在hadoop中的一個demo執行過程總結

參照http://blog.csdn.net/linghe301/article/details/9196713    這裡的過程執行一個gis 的demo程式(依託hadoop和hive和mysql)

在hadoop和hive 上執行gis 的一個程式。

下載demo:

  • Sample tools that demonstrate full stack implementations of all the resources provided to solve GIS problems using Hadoop
  • Templates for building custom tools that solve specific problems
只保留部分有用的資料,其他刪除,做好demo演示的前期工作。。。。。。。。。。。。

在此之前已經建立好hadoop環境,mysql環境。

啟動mysql建立hive資料庫,建立使用者名稱和密碼hive_user,hive_pass。匯入hive中scripts/metastore/upgrade/mysql/中的hive-schema-2.1.0.mysql檔案到hive資料庫中。

前期準備好,之後建立hive環境:

下載,解壓。。。。

可以export hive路徑,也可以不用這個不是必須.

在conf下面建立hive-ste.xml

<configuration>
    <!--以下是MySQL連線資訊-->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://mysql主機名:3306/hive?createDatabaseIfNotExist=true</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive_user</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>hive_pass</value>
    </property>
</configuration>

這裡沒有弄明白metastore是不是儲存到了mysql,這裡有疑問???、!!!!

這裡大概配置了,應該還有配置資訊沒配好,通過查詢資料知道以下內容(大概的意思2啟用了新的CLI:Beelin客戶端和HiveServer2服務端):

Beeline – 一個新的命令列Shell

HiveServer2 supports a new command shell Beeline that works with HiveServer2. It's a JDBC client that is based on the SQLLine CLI (http://sqlline.sourceforge.net/

). There’s detailed documentation of SQLLine which is applicable to Beeline as well.

The Beeline shell works in both embedded mode as well as remote mode. In the embedded mode, it runs an embedded Hive (similar to Hive CLI) whereas remote mode is for connecting to a separate HiveServer2 process over Thrift. Starting inHive 0.14, when Beeline is used with HiveServer2, it also prints the log messages from HiveServer2 for queries it executes to STDERR.

Icon

In remote mode HiveServer2 only accepts valid Thrift calls – even in HTTP mode, the message body contains Thrift payloads.


Beeline 要與HiveServer2配合使用,支援嵌入模式和遠端模式

啟動HiverServer2 , ./bin/hiveserver2 

啟動Beeline 

$ ./bin/beeline 

beeline> !connect jdbc:hive2://localhost:10000

https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients

http://blog.cloudera.com/blog/2014/02/migrating-from-hive-cli-to-beeline-a-primer/


----------------------------------------------------------------------------------------------------------------------------------------------------

至此大概就配好了hive,還有很多東西和概念不知道,例如hive ql語句.可以參照:

http://www.cnblogs.com/HondaHsu/p/4346354.html

這裡執行hive出現命令提示符,表示可以運行了。

將demo解壓,將資料放入到/gis/data(這裡是hdfs中的路徑,需要hdfs dfs -put .........上傳到hdfs中)中,其中有2個資料夾,一個是json格式,一個是cvs格式

Permission Owner Group Size Last Modified Replication Block Size Name
drwxr-xr-x hadoop supergroup 0 B 2016/7/3 下午5:43:00 0 0 B counties-data
drwxr-xr-x hadoop supergroup 0 B 2016/7/3 下午5:43:02 0 0 B earthquake-data

修改demo中的

add jar
  /home/hadoop/gis-for-hadoop/lib/esri-geometry-api.jar                     #這裡是通過hive中的add jar 將包加入到環境中。home/hadoop/gis-for-had........這個是ubuntu裡的物理路徑。具體講jar加入到哪裡了,還不清楚,
  /home/hadoop/gis-for-hadoop/lib/spatial-sdk-hadoop.jar;                 #不過,在執行的過程中,程式去hdfs中找這個兩個檔案,所以需要將這兩個jar 傳入到hdfs中,具體原因不知道。也許是程式程式碼的緣故,也許是
                                                                                                                           #這裡add jar,命令使用不得當的原因,具體問題在學習程式碼和hive命令之後才知道!!

create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';

#將hdfs檔案的資訊存入表中,這裡出現問題,localhost開始是用master的主機名代替的,可是不行,提示拒絕連線後來查詢這裡

#http://wiki.apache.org/hadoop/ConnectionRefused

#http://blog.csdn.net/z363115269/article/details/39048589

#http://www.iteblog.com/archives/802

#知道查詢hive下的DBS表DB_LOCATION_URI列:select DB_LOCATION_URI from DBS; 這裡是用了localhost的緣故。所有下面用了localhost。具體原因,不知道!!!這裡需要進行學習!!!!
CREATE EXTERNAL TABLE IF NOT EXISTS earthquakes (earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, depth DOUBLE, magnitude DOUBLE,
    magtype string, mbstations string, gap string, distance string, rms string, source string, eventid string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'hdfs://localhost:9000/gis/data/earthquake-data';

#匯入另外一個檔案到表中!!
CREATE EXTERNAL TABLE IF NOT EXISTS counties (Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary)                                         
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'              
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'hdfs://localhost:9000/gis/data/counties-data';

#這裡是關鍵,hive的精華!!!我的理解用表管理用的的資料檔案,然後通過hive sql 將任務分解為mapreduce的任務。依託hdfs實現檔案管理和檔案訪問。用hive實現 將hive sql的語句轉化成MR過程來執行,降低入門門檻

#用sql這樣的句子代替程式設計!優點是對靜態檔案訪問和MR過程變得簡單!

SELECT counties.name, count(*) cnt FROM counties
JOIN earthquakes
WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))
GROUP BY counties.name
ORDER BY cnt desc;

進入hive,執行下面hive ql,看到下面的結果:

以下為執行結果:

hive> SELECT counties.name, count(*) cnt FROM counties
    > JOIN earthquakes
    > WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))
    > GROUP BY counties.name
    > ORDER BY cnt desc;
Warning: Map Join MAPJOIN[20][bigTable=?] in task 'Stage-2:MAPRED' is a cross product
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
Query ID = hadoop_20160703034243_ce75c555-bf04-48f9-9b5a-a6466a70a9e1
Total jobs = 2
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2016-07-03 03:42:58Starting to launch local task to process map join;maximum memory = 518979584
2016-07-03 03:43:02Dump the side-table for tag: 0 with group count: 1 into file: file:/tmp/hadoop/3e288758-d62b-4417-93e8-f379f4969ea2/hive_2016-07-03_03-42-43_138_2061802838610441568-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile20--.hashtable
2016-07-03 03:43:02Uploaded 1 File to: file:/tmp/hadoop/3e288758-d62b-4417-93e8-f379f4969ea2/hive_2016-07-03_03-42-43_138_2061802838610441568-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile20--.hashtable (181836 bytes)
2016-07-03 03:43:02End of local task; Time Taken: 4.391 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2016-07-03 03:43:10,532 Stage-2 map = 0%,  reduce = 0%
2016-07-03 03:43:22,064 Stage-2 map = 100%,  reduce = 0%
2016-07-03 03:43:23,077 Stage-2 map = 100%,  reduce = 100%
Ended Job = job_local718067817_0003
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2016-07-03 03:43:26,449 Stage-3 map = 100%,  reduce = 100%
Ended Job = job_local1888560612_0004
MapReduce Jobs Launched: 
Stage-Stage-2:  HDFS Read: 13516872 HDFS Write: 0 SUCCESS
Stage-Stage-3:  HDFS Read: 15548312 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Kern36
San Bernardino35
Imperial28
Inyo20
Los Angeles18
Riverside14
Monterey14
Santa Clara12
Fresno11
San Benito11
San Diego7
Santa Cruz5
San Luis Obispo3
Ventura3
Orange2
San Mateo1
Time taken: 43.319 seconds, Fetched: 16 row(s)
hive>