1. 程式人生 > >滴滴雲部署 Hadoop2.7.7+Hive2.3.4

滴滴雲部署 Hadoop2.7.7+Hive2.3.4

1.本例叢集架構如下:

此處我們使用的是滴滴雲主機內網 IP,如果需要外部訪問 Hadoop,需要繫結公網 IP 即 EIP。有關滴滴雲 EIP 的使用請參考以下連結:
https://help.didiyun.com/hc/kb/section/1035272/

  • Master 節點儲存著分散式檔案系統資訊,比如 inode 表和資源排程器及其記錄。同時 master 還執行著兩個守護程序:
    NameNode:管理分散式檔案系統,存放資料塊在叢集中所在的位置。
    ResourceManger:負責排程資料節點(本例中為 node1 和 node2)上的資源,每個資料節點上都有一個 NodeManger 來執行實際工作。

  • Node1 和 node2 節點負責儲存實際資料並提供計算資源,執行兩個守護程序:
    DataNode:負責管理實際資料的物理儲存。
    NodeManager:管理本節點上計算任務的執行。

2.系統配置#

本例中使用的滴滴雲虛擬機器配置如下:
2核CPU 4G 記憶體 40G HDD儲存 3 Mbps頻寬 CentOS 7.4

  • 滴滴雲主機出於安全考慮,預設不能通過 root 使用者直接登入,需要先用 dc2-user 登入,讓後用 sudo su 切換至 root。本例中預設全部以 dc2-user 使用者執行命令,Hadoop預設使用者同樣為 dc2-user。

  • 將三臺節點的IP和主機名分別寫入三臺節點的 /etc/hosts 檔案,並把前三行註釋掉 。

sudo vi /etc/hosts
#127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
#127.0.0.1 10-254-149-24
10.254.149.24   master
10.254.88.218   node1
10.254.84.165   node2
  • Master 節點需要與 node1 和 node2 進行 ssh 金鑰對連線,在 master 節點上為 dc2-user 生成公鑰。
ssh-keygen -b 4096
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:zRhhVpEfSIZydqV75775sZB0GBjZ/f7nnZ4mgfYrWa8 [email protected]
The key's randomart image is:
+---[RSA 4096]----+
|        ++=*+ .  |
|      .o+o+o+. . |
|       +...o o  .|
|         = .. o .|
|        S + oo.o |
|           +.=o .|
|          . +o+..|
|           o +.+O|
|            .EXO=|
+----[SHA256]-----+

輸入以下命令將生成的公鑰複製到三個節點上:

ssh-copy-id -i $HOME/.ssh/id_rsa.pub [email protected]
ssh-copy-id -i $HOME/.ssh/id_rsa.pub [email protected]
ssh-copy-id -i $HOME/.ssh/id_rsa.pub [email protected]

接下來可以用在 master 輸入 ssh [email protected],ssh [email protected] 來驗證是否不輸入密碼就可以連線成功。

  • 配置 Java 環境

在 3 臺節點下載 JDK。

mkdir /home/dc2-user/java
cd /home/dc2-user/java
wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u191-b12/2787e4a523244c269598db4e85c51e0c/jdk-8u191-linux-x64.tar.gz
tar -zxf jdk-8u191-linux-x64.tar.gz

在 3 臺節點配置 Java 變數。

sudo vi /etc/profile.d/jdk-1.8.sh
export JAVA_HOME=/home/dc2-user/java/jdk1.8.0_191
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

使環境變數生效。

source /etc/profile

檢視 Java 版本。

java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)

說明 Java 環境已經配置成功。

3.安裝 Hadoop

在 master 節點下載 Hadoop3.1.1 並解壓。

cd /home/dc2-user
wget http://mirrors.shu.edu.cn/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
tar zxf hadoop-2.7.7.tar.gz

在 /home/dc2-user/hadoop-2.7.7/etc/hadoop 下需要配置的 6 個檔案分別是 hadoop-env.sh、core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml、workers 。

新增如下內容:

export JAVA_HOME=/home/dc2-user/java/jdk1.8.0_191
export HDFS_NAMENODE_USER="dc2-user"
export HDFS_DATANODE_USER="dc2-user"
export HDFS_SECONDARYNAMENODE_USER="dc2-user"
export YARN_RESOURCEMANAGER_USER="dc2-user"
export YARN_NODEMANAGER_USER="dc2-user"
  • core-site.xml
    <configuration>
        <property>
            <name>fs.default.name</name>
            <value>hdfs://master:9000</value>
        </property>
    </configuration>
  • hdfs-site.xml
<configuration>
        <property>
            <name>dfs.namenode.name.dir</name>
            <value>/home/dc2-user/data/nameNode</value>
        </property>
        <property>
            <name>dfs.datanode.data.dir</name>
            <value>/home/dc2-user/data/dataNode</value>
        </property>
        <property>
            <name>dfs.replication</name>
            <value>1</value>
       </property>
       <property>
           <name>dfs.http.address</name>
           <value>master:50070</value>
       </property>
</configuration>
  • yarn-site.xml
<configuration>
    <property>
            <name>yarn.acl.enable</name>
            <value>0</value>
    </property>
    <property>
      <name>yarn.resourcemanager.hostname</name>
            <value>master</value>
    </property>
    <property>
          <name>yarn.resourcemanager.webapp.address</name>
          <value>master:8088</value>
    </property>
    <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
    </property>
     <property>
	<name>yarn.nodemanager.auxservices.mapreduce_shuffle.class</name>
	<value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>


</configuration>
  • mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>1536</value>
    </property>
    <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx1024M</value>
    </property>
    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>3072</value>
    </property>
    <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx2560M</value>
    </property>
  
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>master:10020</value>
    </property>

    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>master:19888</value>
    </property>
</configuration>
  • 編輯 workers
node1
node2

4.啟動 Hadoop

  • 複製以下配置檔案到 node1 和 node2:
scp -r /home/dc2-user/hadoop-2.7.7 [email protected]:/home/dc2-user/
scp -r /home/dc2-user/hadoop-2.7.7 [email protected]:/home/dc2-user/
  • 配置 Hadoop 環境變數(三臺節點)
sudo vi /etc/profile.d/hadoop-2.7.7.sh
export HADOOP_HOME="/home/dc2-user/hadoop-2.7.7"
export PATH="$HADOOP_HOME/bin:$PATH"
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

使環境變數生效

source /etc/profile

在 3 臺節點輸入 Hadoop version 看是否有輸出,驗證環境變數是否生效:

hadoop version
Hadoop 2.7.7
Subversion Unknown -r c1aad84bd27cd79c3d1a7dd58202a8c3ee1ed3ac
Compiled by stevel on 2018-07-18T22:47Z
Compiled with protoc 2.5.0
From source with checksum 792e15d20b12c74bd6f19a1fb886490
This command was run using /home/dc2-user/hadoop-2.7.7/share/hadoop/common/hadoop-common-2.7.7.jar
  • 格式化 HDFS,只在 master 上操作
/home/dc2-user/hadoop-2.7.7/bin/hdfs namenode -format testCluster
  • 開啟服務
/home/dc2-user/hadoop-2.7.7/sbin/start-dfs.sh
/home/dc2-user/hadoop-2.7.7/sbin/start-yarn.sh
  • 檢視三個節點服務是否已啟動

master

jps
1654 Jps
31882 NameNode
32410 ResourceManager
32127 SecondaryNameNode

node1

jps
19827 NodeManager
19717 DataNode
20888 Jps

node2

jps
30707 Jps
27675 NodeManager
27551 DataNode

出現以上結果,即說明服務已經正常啟動,可以通過 master 的公網 IP 訪問 ResourceManager 的 web 頁面,注意要開啟安全組的 8088 埠,關於滴滴雲安全組的使用請參考以下連結:https://help.didiyun.com/hc/kb/article/1091031/

注:公網開放8088埠可能會被黑客利用植入木馬,因此建議在安全組中限制可訪問的來源IP,或者不在安全組中開放此埠。

5.例項驗證

最後用 Hadoop 中自帶的 WordCount 程式來驗證一下 MapReduce 功能,操作在 master 節點進行:
首先在當前目錄建立兩個檔案 test1 和 test2,內容如下:

vi test1
hello world
bye world
vi test2
hello hadoop
bye hadoop

接下來在 HDFS 中建立資料夾並將以上兩個檔案上傳到資料夾中。

hadoop fs -mkdir /input
hadoop fs -put test* /input

當叢集啟動的時候,會首先進入安全模式,因此要先離開安全模式。

hdfs dfsadmin -safemode leave

執行 WordCount 程式統計兩個檔案中個單詞出現的次數。

yarn jar /home/dc2-user/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount /input /output

WARNING: YARN_CONF_DIR has been replaced by HADOOP_CONF_DIR. Using value of YARN_CONF_DIR.
2018-11-09 20:27:12,233 INFO client.RMProxy: Connecting to ResourceManager at master/10.254.149.24:8032
2018-11-09 20:27:12,953 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1541766351311_0001
2018-11-09 20:27:14,483 INFO input.FileInputFormat: Total input files to process : 2
2018-11-09 20:27:16,967 INFO mapreduce.JobSubmitter: number of splits:2
2018-11-09 20:27:17,014 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enab
2018-11-09 20:27:17,465 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541766351311_0001
2018-11-09 20:27:17,466 INFO mapreduce.JobSubmitter: Executing with tokens: []
2018-11-09 20:27:17,702 INFO conf.Configuration: resource-types.xml not found
2018-11-09 20:27:17,703 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2018-11-09 20:27:18,256 INFO impl.YarnClientImpl: Submitted application application_1541766351311_0001
2018-11-09 20:27:18,296 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1541766351311_0001/
2018-11-09 20:27:18,297 INFO mapreduce.Job: Running job: job_1541766351311_0001
2018-11-09 20:28:24,929 INFO mapreduce.Job: Job job_1541766351311_0001 running in uber mode : false
2018-11-09 20:28:24,931 INFO mapreduce.Job:  map 0% reduce 0%
2018-11-09 20:28:58,590 INFO mapreduce.Job:  map 50% reduce 0%
2018-11-09 20:29:19,437 INFO mapreduce.Job:  map 100% reduce 0%
2018-11-09 20:29:33,038 INFO mapreduce.Job:  map 100% reduce 100%
2018-11-09 20:29:36,315 INFO mapreduce.Job: Job job_1541766351311_0001 completed successfully
2018-11-09 20:29:36,619 INFO mapreduce.Job: Counters: 54
	File System Counters
		FILE: Number of bytes read=75
		FILE: Number of bytes written=644561
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=237
		HDFS: Number of bytes written=31
		HDFS: Number of read operations=11
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Killed map tasks=1
		Launched map tasks=3
		Launched reduce tasks=1
		Data-local map tasks=3
		Total time spent by all maps in occupied slots (ms)=164368
		Total time spent by all reduces in occupied slots (ms)=95475
		Total time spent by all map tasks (ms)=82184
		Total time spent by all reduce tasks (ms)=31825
		Total vcore-milliseconds taken by all map tasks=82184
		Total vcore-milliseconds taken by all reduce tasks=31825
		Total megabyte-milliseconds taken by all map tasks=168312832
		Total megabyte-milliseconds taken by all reduce tasks=97766400
	Map-Reduce Framework
		Map input records=5
		Map output records=8
		Map output bytes=78
		Map output materialized bytes=81
		Input split bytes=190
		Combine input records=8
		Combine output records=6
		Reduce input groups=4
		Reduce shuffle bytes=81
		Reduce input records=6
		Reduce output records=4
		Spilled Records=12
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=2230
		CPU time spent (ms)=2280
		Physical memory (bytes) snapshot=756064256
		Virtual memory (bytes) snapshot=10772656128
		Total committed heap usage (bytes)=541589504
		Peak Map Physical memory (bytes)=281268224
		Peak Map Virtual memory (bytes)=3033423872
		Peak Reduce Physical memory (bytes)=199213056
		Peak Reduce Virtual memory (bytes)=4708827136
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=47
	File Output Format Counters 
		Bytes Written=31

如果出現以上輸出,說明計算完成,結果儲存在 HDFS 中的 /output 資料夾中。

hadoop fs -ls /output
Found 2 items
-rw-r--r--   1 root supergroup          0 2018-11-09 20:29 /output/_SUCCESS
-rw-r--r--   1 root supergroup         31 2018-11-09 20:29 /output/part-r-00000

開啟 part-r-00000 檢視結果:

hadoop fs -cat /output/part-r-00000
bye	2
hadoop	2
hello	2
world	2

Hive2.3.4 安裝和配置

Hive 是基於 Hadoop 的一個數據倉庫,可以將結構化的資料檔案對映為一張表,並提供類 sql 查詢功能,Hive 底層將 sql 語句轉化為 MapReduce 任務執行。

  • 下載 Hive2.3.4 到 maste r的 /home/dc2-user 並解壓
wget http://mirror.bit.edu.cn/apache/hive/hive-2.3.4/apache-hive-2.3.4-bin.tar.gz
tar zxvf apache-hive-2.3.4-bin.tar.gz

  • 設定 Hive 環境變數

編輯 /etc/profile 檔案, 在其中新增以下內容。

sudo vi /etc/profile
export HIVE_HOME=/home/dc2-user/apache-hive-2.3.4-bin
export PATH=$PATH:$HIVE_HOME/bin

使環境變數生效:

source /etc/profile
  • 配置 Hive

重新命名以下配置檔案:

cd apache-hive-2.3.4-bin/conf/
cp hive-env.sh.template hive-env.sh 
cp hive-default.xml.template hive-site.xml 
cp hive-log4j2.properties.template hive-log4j2.properties 
cp hive-exec-log4j2.properties.template hive-exec-log4j2.properties

修改 hive-env.sh:

export JAVA_HOME=/home/dc2-user/java/jdk1.8.0_191  ##Java路徑
export HADOOP_HOME=/home/dc2-user/hadoop-2.7.7   ##Hadoop安裝路徑
export HIVE_HOME=/home/dc2-user/apache-hive-2.3.4-bin ##Hive安裝路徑
export HIVE_CONF_DIR=$HIVE_HOME/conf    ##Hive配置檔案路徑

修改 hive-site.xml:
修改對應屬性的 value 值

vi hive-site.xml
  <property>
    <name>hive.exec.scratchdir</name>
    <value>/tmp/hive-${user.name}</value>
    <description>HDFS root scratch dir for Hive jobs which gets 
    created with write all (733) permission. For each connecting user, 
    an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, 
    with ${hive.scratch.dir.permission}.
    </description>
  </property>
  <property>
    <name>hive.exec.local.scratchdir</name>
    <value>/tmp/${user.name}</value>
    <description>Local scratch space for Hive jobs</description>
  </property>
  <property>
    <name>hive.downloaded.resources.dir</name>
    <value>/tmp/hive/resources</value>
    <description>Temporary local directory for added resources in the remote 
    file system.</description>
  </property>
  <property>
    <name> hive.querylog.location</name>
    <value>/tmp/${user.name}</value>
    <description>Location of Hive run time structured log file</description>
  </property>
  <property>
    <name>hive.server2.logging.operation.log.location</name>
    <value>/tmp/${user.name}/operation_logs</value>
    <description>Top level directory where operation logs are stored if logging functionality is enabled</description>
  </property>

配置 Hive Metastore
Hive Metastore 是用來獲取 Hive 表和分割槽的元資料,本例中使用 mariadb 來儲存此類元資料。
將 mysql-connector-java-5.1.40-bin.jar 放入 $HIVE_HOME/lib 下並在 hive-site.xml 中配置 MySQL 資料庫連線資訊。

<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8&useSSL=false</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive</value>
</property>

為 Hive 建立 HDFS 目錄

start-dfs.s   #如果在安裝配置hadoop是已經啟動,則此命令可省略
hdfs dfs -mkdir /tmp
hdfs dfs -mkdir -p /usr/hive/warehouse
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod g+w /usr/hive/warehouse

安裝 mysql,本例中使用的是 mariadb。

sudo yum install -y mariadb-server
sudo systemctl start mariadb

登入 mysql,初始無密碼,建立 Hve 使用者並設定密碼。

mysql -uroot
MariaDB [(none)]> create user'hive'@'localhost' identified by 'hive';
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> grant all privileges on *.* to [email protected] identified by 'hive';
Query OK, 0 rows affected (0.00 sec)
  • 執行 Hive

執行 Hive 之前必須保證 HDFS 已經啟動,可以使用 start-dfs.sh 來啟動,如果之前安裝 Hadoop 是已啟動,次步驟可略過。
從 Hive 2.1 版本開始, 在啟動 Hive 之前需執行 schematool 命令來執行初始化操作:

schematool -dbType mysql -initSchema

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/dc2-user/apache-hive-2.3.4-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/dc2-user/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:	 jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
Metastore Connection Driver :	 com.mysql.jdbc.Driver
Metastore connection User:	 hive
Starting metastore schema initialization to 2.3.0
Initialization script hive-schema-2.3.0.mysql.sql
Initialization script completed
schemaTool completed

啟動 Hive,輸入命令 Hive

hive

which: no hbase in (/home/dc2-user/java/jdk1.8.0_191/bin:/home/dc2-user/hadoop-2.7.7/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/bin:/home/dc2-user/apache-hive-2.3.4-bin/bin:/home/dc2-user/.local/bin:/home/dc2-user/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/dc2-user/apache-hive-2.3.4-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/dc2-user/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in file:/home/dc2-user/apache-hive-2.3.4-bin/conf/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> 
  • 測試 Hive

在 Hive中建立一個表:

hive> create table test_hive(id int, name string)
    > row format delimited fields terminated by '\t' #欄位之間用tab鍵進行分割
    > stored as textfile;  # 設定載入資料的資料型別,預設是TEXTFILE,如果檔案資料是純文字,就是使用 [STORED AS TEXTFILE],然後從本地直接拷貝到HDFS上,hive直接可以識別資料
OK
Time taken: 10.857 seconds
hive> show tables;
OK
test_hive
Time taken: 0.396 seconds, Fetched: 1 row(s)

可以看到表已經建立成功,輸入 quit ; 退出 Hive,接下來以文字形式建立資料:

vi test_tb.txt
101	aa
102	bb
103	cc

進入 Hive,匯入資料:

hive> load data local inpath '/home/dc2-user/test_db.txt' into table test_hive;
Loading data to table default.test_hive
OK
Time taken: 6.679 seconds

hive> select * from test_hive;
101	aa
102	bb
103	cc
Time taken: 2.814 seconds, Fetched: 3 row(s)

可以看到資料插入成功並且可以正常查詢。