hadoop分散式檔案系統
hadoop分散式檔案系統
http://hadoop.apache.org/docs/r1.1.2/single_node_setup.html
實驗環境:
Master:desk11192.168.122.11
Datanode:server90192.168.122.190
Server233192.168.122.233
server73192.168.122.173用作之後的線上新增
做好本地解析:
192.168.122.11desk11.example.comdesk11
192.168.122.190server90.example.comserver90
192.168.122.233server233.example.comserver233
192.168.122.173server73.eample.comserver73
1.環境配置
hadoop是java程式所以必須運行於java虛擬上(jdk)
準備jdk:jdk-6u26-linux-x64.bin
shjdk-6u26-linux-x64.bin
mvjdk1.6.0_26//usr/local/jdk
vim/etc/profile
新增:
exportJAVA_HOME=/usr/local/jdk
exportCLASSPATH=.:$JAVA_HOME/lib
exportPATH=$PATH:$JAVA_HOME/bin
source/etc/profile
如果系統上安裝了預設的java:openjdk,你需要更新一下
#alternatives--install/usr/bin/javajava/usr/local/jdk/bin/java2
#alternatives--setjava/usr/local/jdk/bin/java
#java-version
javaversion"1.6.0_32"
Java(TM)SERuntimeEnvironment(build1.6.0_32-b05)
JavaHotSpot(TM)64-BitServerVM(build20.7-b02,mixedmode)
檢測一下java命令位置
whichjava
/usr/local/jdk/bin/java則位置正常
java環境正常!
1.偽分散式系統(MasterDatanode
選擇Master主機:desk11
yum-yinstallopensshrsync
useradd-u600hadoop#hadoop均已hadoop身份執行
echohadoop|passwd--stdinhadoop
Chownhadoop.hadoop/home/hadoop-R
以下操作均在hadoop身份操作
su-hadoop
建立ssh無密碼驗證(ssh等效性)
ssh-keygent一路回車
ssh-copy-idlocalhost
得到的效果是:
[[email protected]~]$sshlocalhost
Lastlogin:SatAug313:59:582013fromlocalhost
配置hadoop
cdhadoop-1.0.4/conf
vimcore-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://desk11:9000</value>
</property>
</configuration>
vimhdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>#啟動一個節點
</property>
</configuration>
vimmapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>desk11:9001</value>
</property>
</configuration>
[[email protected]]$vimhadoop-env.sh
exportJAVA_HOME=/usr/local/jdk
如果沒有設定java環境的位置在啟動hadoop時則會報錯:
格式化:
bin/hadoopnamenode-format
正常結果如下:
啟動hadoop
bin/start-all.sh
檢視啟動的程序:jps
啟動之後我們可以在web上訪問:
測試:
bin/hadoopfs-putconfwestos#將當前目錄下的conf目錄下的內容上傳到HDFS存於westos目錄中
bin/hadoopfs-ls#檢視fs系統中的內容
複製分散式檔案系統的檔案到本地:
bin/hadoopfs-getwestostest#從hdfs上下載westos的內容到test目錄
看見在當前的目錄下會有test目錄裡面的內容正好是上傳上去conf裡的配置檔案;
bin/hadoopfs-mkdirwangzi
bin/hadoopfs-ls
bin/hadoopfs-rmrwangzi#刪除hdfs上的目錄
bin/hadoopfs-ls
bin/hadoopjarhadoop-examples-1.0.4.jargrepwestosoutput'dfs[a-z.]+'
#運算測試,在westos內查詢以dfs開頭的內容並存到output內
運算過程:
13/08/0313:38:27INFOutil.NativeCodeLoader:Loadedthenative-hadooplibrary
13/08/0313:38:27WARNsnappy.LoadSnappy:Snappynativelibrarynotloaded
13/08/0313:38:27INFOmapred.FileInputFormat:Totalinputpathstoprocess:16
13/08/0313:38:27INFOmapred.JobClient:Runningjob:job_201308031321_0001
13/08/0313:38:28INFOmapred.JobClient:map0%reduce0%
13/08/0313:39:28INFOmapred.JobClient:map6%reduce0%
13/08/0313:39:33INFOmapred.JobClient:map12%reduce0%
網頁監測:
可以看出已經有意個job提交了,並且正在運算
可見hadoop將任務分成兩個job提交的;
[[email protected]]$bin/hadoopfs-ls
檢視output的內容:
關閉hadoop:
[[email protected]]$bin/stop-all.sh
[[email protected]]$jps
28027Jps
2.分散式檔案系統:
1)jdk環境:
在兩個資料節點server90及server233上:
shjdk-6u26-linux-x64.bin
mvjdk1.6.0_26//usr/local/jdk
vim/etc/profile
新增:
exportJAVA_HOME=/usr/local/jdk
exportCLASSPATH=.:$JAVA_HOME/lib
exportPATH=$PATH:$JAVA_HOME/bin
source/etc/profile
檢測一下java命令位置
whichjava
/usr/local/jdk/bin/java則位置正常
java環境正常!
2)。seync搭建:(配置以root身份進行)
Master節點與其他的節點之間的配置是要相同的,並且需要各個節點之間要建立ssh等效性,即任意兩個節點之間的ssh連結不能有密碼。
為方便部署採用裡seync
需要安裝包:sersync2.5_64bit_binary_stable_final.tar.gz
在各節點上:
yum-yinstallrsyncxinetd
在Master上:
tarzxfsersync2.5.4_64bit_binary_stable_final.tar.gz
[[email protected]]#ls
GNU-Linux-x86hadoop
[[email protected]]#cdGNU-Linux-x86/
[[email protected]]#ls
confxml.xmlsersync2
[[email protected]]#vimconfxml.xml
<sersync>
<localpathwatch="/home/hadoop">#同步伺服器同步的目錄
<remoteip="192.168.122.190"name="rsync"/>#name為目標伺服器設定的同步名
<remoteip="192.168.122.233"name="rsync"/>
<!--<remoteip="192.168.8.40"name="tongbu"/>-->
</localpath>
在目標伺服器上,即在兩個資料節點上;
useradd-u600hadoop#hadoop均已hadoop身份執行
echohadoop|passwd--stdinhadoop
[[email protected]~]#vim/etc/rsyncd.conf
uid=hadoop#同步過來的所有的內容的所有者及屬組均為hadoop
gid=hadoop
maxconnections=36000
usechroot=no
logfile=/var/log/rsyncd.log
pidfile=/var/run/rsyncd.pid
lockfile=/var/run/rsyncd.lock
[rsync]#與Master上的name="rsync"保持一至
path=/home/hadoop#同步本地的目錄
comment=testfiles
ignoreerrors=yes
readonly=no
hostsallow=192.168.122.11/24
hostsdeny=*
[[email protected]~]#/etc/init.d/xinetdrestart
Stoppingxinetd:[OK]
Startingxinetd:[OK]
rsync--daemon
兩個資料節點上操作完全一樣
Master節點上:
[[email protected]]#/etc/init.d/xinetdrestart
Stoppingxinetd:[FAILED]
Startingxinetd:[OK]
[[email protected]]#./sersync2-r-d#-r整體同步-d
後臺執行並檢測同步伺服器的資料是否發生變化,只要資料變化就會將變化的資料同步到其他的兩個節點上;
3)分散式檔案系統的配置:
以hadoop身份操作:
Master節點上:
[[email protected]]$vimmasters#制定Master位置
desk11#注意解析
[[email protected]]$vimslaves
server90
server233
[[email protected]]$vimhdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>#啟動兩個Datanode
</property>
</configuration>
啟動:
[[email protected]]$bin/hadoopnamenode-format
bin/start-all.sh
可以看見在Master節點上的DataNode程序沒有啟動,這個程序是在兩個資料節點上啟動的;
在資料節點server90及server233上:
[[email protected]]$jps
22978DataNode
23145Jps
23071TaskTracker
[[email protected]~]$jps
23225Jps
23150TaskTracker
23055DataNode
測試:
可見節點數變成了兩個:
執行測試測試程式:
[[email protected]]$bin/hadoopjarhadoop-examples-1.0.4.jargrepwestosoutput'dfs[a-z.]+'
13/08/0501:41:10INFOutil.NativeCodeLoader:Loadedthenative-hadooplibrary
13/08/0501:41:10WARNsnappy.LoadSnappy:Snappynativelibrarynotloaded
13/08/0501:41:10INFOmapred.FileInputFormat:Totalinputpathstoprocess:16
13/08/0501:41:11INFOmapred.JobClient:Runningjob:job_201308050135_0001
13/08/0501:41:12INFOmapred.JobClient:map0%reduce0%
13/08/0501:41:39INFOmapred.JobClient:map12%reduce0%
13/08/0501:41:43INFOmapred.JobClient:map25%reduce0%
13/08/0501:42:05INFOmapred.JobClient:map31%reduce0%
13/08/0501:42:08INFOmapred.JobClient:map37%reduce0%
13/08/0501:42:17INFOmapred.JobClient:map43%reduce0%
13/08/0501:42:20INFOmapred.JobClient:map50%reduce0%
監控計算狀態:
[[email protected]]$bin/hadoopdfsadmin-report
ConfiguredCapacity:6209044480(5.78GB)
PresentCapacity:3567787548(3.32GB)
DFSRemaining:3567493120(3.32GB)
DFSUsed:294428(287.53KB)
DFSUsed%:0.01%
Underreplicatedblocks:0
Blockswithcorruptreplicas:0
Missingblocks:0
-------------------------------------------------
Datanodesavailable:2(2total,0dead)
Name:192.168.122.233:50010
DecommissionStatus:Normal
ConfiguredCapacity:3104522240(2.89GB)
DFSUsed:147214(143.76KB)
NonDFSUsed:1320521970(1.23GB)
DFSRemaining:1783853056(1.66GB)
DFSUsed%:0%
DFSRemaining%:57.46%
Lastcontact:MonAug0501:45:38CST2013
Name:192.168.122.190:50010
DecommissionStatus:Normal
ConfiguredCapacity:3104522240(2.89GB)
DFSUsed:147214(143.76KB)
NonDFSUsed:1320734962(1.23GB)
DFSRemaining:1783640064(1.66GB)
DFSUsed%:0%
DFSRemaining%:57.45%
Lastcontact:MonAug0501:45:36CST2013
可以看出兩個Datanode均參與了節點的計算,負載均衡
4)hadoop線上新增節點:
1.在新增節點上安裝jdk,並建立相同的hadoop使用者,uid等保持一致
2.在conf/slaves檔案中新增新增節點的ip或者對應域名
3.建立各節點之間與server73的ssh等效性
4.同步master上hadoop所有資料到新增節點上,路徑保持一致
5.在新增節點上啟動服務:
[[email protected]]$bin/hadoop-daemon.shstartdatanode
[[email protected]]$bin/hadoop-daemon.shstarttasktracker
[[email protected]]$jps
1926DataNode
2024TaskTracker
2092Jps
節點數加一了
6.均衡資料:
在Master節點上:
[[email protected]]$bin/start-balancer.sh
startingbalancer,loggingto/home/hadoop/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-balancer-desk11.example.com.out
1)如果不執行均衡,那麼cluster會把新的資料都存放在新的datanode上,這樣會降低mapred的工作效率
2)設定平衡閾值,預設是10%,值越低各節點越平衡,但消耗時間也更長
[[email protected]]$bin/start-balancer.sh-threshold5
[[email protected]]$bin/hadoopjarhadoop-examples-1.0.4.jargrepwestostest'dfs[a-z.]+'
13/08/0503:03:49INFOutil.NativeCodeLoader:Loadedthenative-hadooplibrary
13/08/0503:03:49WARNsnappy.LoadSnappy:Snappynativelibrarynotloaded
13/08/0503:03:49INFOmapred.FileInputFormat:Totalinputpathstoprocess:16
13/08/0503:03:49INFOmapred.JobClient:Runningjob:job_201308050135_0003
13/08/0503:03:50INFOmapred.JobClient:map0%reduce0%
[[email protected]]$bin/hadoopdfsadmin-report
ConfiguredCapacity:9313566720(8.67GB)
PresentCapacity:5379882933(5.01GB)
DFSRemaining:5378859008(5.01GB)
DFSUsed:1023925(999.93KB)
DFSUsed%:0.02%
Underreplicatedblocks:2
Blockswithcorruptreplicas:0
Missingblocks:0
-------------------------------------------------
Datanodesavailable:3(3total,0dead)
Name:192.168.122.233:50010
DecommissionStatus:Normal
ConfiguredCapacity:3104522240(2.89GB)
DFSUsed:424147(414.21KB)
NonDFSUsed:1321731885(1.23GB)
DFSRemaining:1782366208(1.66GB)
DFSUsed%:0.01%
DFSRemaining%:57.41%
Lastcontact:MonAug0503:04:25CST2013
Name:192.168.122.73:50010#新新增的節點
DecommissionStatus:Normal
ConfiguredCapacity:3104522240(2.89GB)
DFSUsed:195467(190.89KB)
NonDFSUsed:1290097781(1.2GB)
DFSRemaining:1814228992(1.69GB)
DFSUsed%:0.01%
DFSRemaining%:58.44%
Lastcontact:MonAug0503:04:24CST2013
Name:192.168.122.190:50010
DecommissionStatus:Normal
ConfiguredCapacity:3104522240(2.89GB)
DFSUsed:404311(394.83KB)
NonDFSUsed:1321854121(1.23GB)
DFSRemaining:1782263808(1.66GB)
DFSUsed%:0.01%
DFSRemaining%:57.41%
Lastcontact:MonAug0503:04:23CST2013
5)hadoop線上刪除datanode節點
[[email protected]]$vimmapred-site.xml
新增:
<property>
<name>dfs.hosts.exclude</name>
<value>/home/hadoop/hadoop-1.0.4/conf/datanode-exclude</value>
</property>
建立/home/hadoop/hadoop-1.0.4/conf/datanode-exclude檔案,寫入要刪除的主機,一行一個
[[email protected]]$echo"server73">\
/home/hadoop/hadoop-1.0.4/conf/datanode-exclude
在master上線上重新整理節點:
[[email protected]]$bin/hadoopdfsadmin-refreshNodes
此操作會在後臺遷移資料
可以看出在Datanode上的已經有意個節點down掉了;
6)hadoop線上刪除tasktracker節點:
在master上修改conf/mapred-site.xml
<property>
<name>mapred.hosts.exclude</name>
<value>/home/hadoop/hadoop-1.0.4/conf/trasktracker-exclude</value>
</property>
建立/home/hadoop/hadoop-1.0.4/conf/trasktracker-exclude檔案:
touch/home/hadoop/hadoop-1.0.4/conf/trasktracker-exclude
vim/home/hadoop/hadoop-1.0.4/conf/trasktracker-exclude
server73或192.168.122.173
重新整理節點:
[[email protected]]$./hadoopmradmin-refreshNodes
轉載於:https://blog.51cto.com/wangziyin/1303196