1. 程式人生 > >namenode崩潰的資料恢復測試

namenode崩潰的資料恢復測試

2012-09-26

周海漢/文 http://abloz.com 2012.9.9

前言 用second namenode 資料恢復測試。datanode由於採用2-3個備份,即使一臺裝置損壞,還是能自動恢復並找回全部資料。 hadoop 1.0.3和0.20之前的版本,namenode存在單點問題。如果namenode損壞,會導致整個系統資料徹底丟失。所以second namenode就顯得特別重要。本文主要探討namenode損壞的資料恢復實踐,包括配置檔案,部署,namenode崩潰,namenode資料損壞和namenode meta資料恢復。

hadoop版本是hadoop1.0.3 一共三臺機器參與測試。 機器角色: Hadoop48 Namenode Hadoop47 Second Namenode, Datanode Hadoop46 Datanode

1.編輯core-site,增加checkpoint相關配置 fs.checkpoint.dir 是恢復檔案存放目錄 fs.checkpoint.period 同步檢查時間,預設是3600秒1小時。測試時設為20秒。 fs.checkpoint.size 當edit 日誌檔案大於這個位元組數時,即使檢查時間沒到,也會觸發同步。

[[email protected] conf]$ vi core-site.xml

hadoop.mydata.dir /data/zhouhh/myhadoop A base for other directories.${user.name} hadoop.tmp.dir
/tmp/hadoop-${user.name} A base for other temporary directories.
fs.checkpoint.dir ${hadoop.data.dir}/dfs/namesecondary Determines where on the local filesystem the DFS secondary name node should store the temporary images to merge. If this is a comma-delimited list of directories then the image is replicated in all of the directories for redundancy.
fs.checkpoint.edits.dir ${fs.checkpoint.dir} Determines where on the local filesystem the DFS secondary name node should store the temporary edits to merge. If this is a comma-delimited list of directoires then teh edits is replicated in all of the directoires for redundancy. Default value is same as fs.checkpoint.dir fs.checkpoint.period 20 The number of seconds between two periodic checkpoints.default is 3600 second fs.checkpoint.size 67108864 The size of the current edit log (in bytes) that triggers a periodic checkpoint even if the fs.checkpoint.period hasn't expired.

2.將second namenode設定到另一臺機器。 設定masters檔案,這是指定seconde namenode啟動的機器。 [[email protected] conf]$ cat masters Hadoop47

編輯dfs.secondary.http.address,指定second namenode的http web UI 域名或IP到namenode Hadoop48不同的機器Hadoop47,而不是預設的0.0.0.0

[[email protected] conf]$ vi hdfs-site.xml

dfs.name.dir ${hadoop.mydata.dir}/dfs/name Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. Default value is:${hadoop.tmp.dir}/dfs/name dfs.secondary.http.address Hadoop47:55090 The secondary namenode http server address and port. If the port is 0 then the server will start on a free port.

3.測試時如果name node指定的目錄沒有初始化,需初始化一下 [[email protected] logs]$ hadoop namenode -format

4.同步conf下的配置到Hadoop47/46(略),啟動hadoop [[email protected] conf]$ start-all.sh

[[email protected] conf]$ jps 9633 Bootstrap 10746 JobTracker 10572 NameNode 10840 Jps

[[email protected] ~]$ jps 23157 DataNode 23362 TaskTracker 23460 Jps 23250 SecondaryNameNode

Namenode log報的error: 2012-09-25 19:27:54,816 ERROR security.UserGroupInformation - PriviledgedActionException as:zhouhh cause:org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /data/zhouhh/myhadoop/mapred/ system. Name node is in safe mode. 請不要急,NameNode會在開始啟動階段自動關閉安全模式,然後啟動成功。如果你不想等待,可以執行:

bin/hadoop dfsadmin -safemode leave 強制結束。 NameNode啟動時會從fsimage和edits日誌檔案中裝載檔案系統的狀態資訊,接著它等待各個DataNode向它報告它們各自的資料塊狀態,這樣,NameNode就不會過早地開始複製資料塊,即使在副本充足的情況下。這個階段,NameNode處於安全模式下。NameNode的安全模式本質上是HDFS叢集的一種只讀模式,此時叢集不允許任何對檔案系統或者資料塊修改的操作。通常NameNode會在開始階段自動地退出安全模式。如果需要,你也可以通過’bin/hadoop dfsadmin -safemode’命令顯式地將HDFS置於安全模式。NameNode首頁會顯示當前是否處於安全模式。

5.編輯放置測試檔案 [[email protected] hadoop-1.0.3]$ fs -put README.txt /user/zhouhh/README.txt [[email protected]adoop48 hadoop-1.0.3]$ fs -ls . Found 1 items -rw-r–r– 2 zhouhh supergroup 1381 2012-09-26 14:03 /user/zhouhh/README.txt [[email protected] hadoop-1.0.3]$ cat test中文.txt 這是測試檔案 test001 by zhouhh http://abloz.com 2012.9.26

6. 放到HDFS中 [[email protected] hadoop-1.0.3]$ hadoop fs -put test中文.txt .

[[email protected] hadoop-1.0.3]$ hadoop fs -ls . Found 2 items -rw-r–r– 2 zhouhh supergroup 1381 2012-09-26 14:03 /user/zhouhh/README.txt -rw-r–r– 2 zhouhh supergroup 65 2012-09-26 14:10 /user/zhouhh/test中文.txt [[email protected] ~]$ hadoop fs -cat test中文.txt 這是測試檔案 test001 by zhouhh http://abloz.com 2012.9.26

7 殺死Namenode,模擬崩潰 [[email protected] ~]$ jps 9633 Bootstrap 23006 Jps 19691 NameNode 19867 JobTracker [[email protected] ~]$ kill -9 19691 [[email protected] ~]$ jps 9633 Bootstrap 23019 Jps 19867 JobTracker

[[email protected] hadoop-1.0.3]$ jps 1716 DataNode 3825 Jps 1935 TaskTracker 1824 SecondaryNameNode

8. 將dfs.name.dir下的內容清空,模擬硬碟損壞 [[email protected] ~]$ cd /data/zhouhh/myhadoop/dfs/name/ [[email protected] name]$ ls current image in_use.lock previous.checkpoint [[email protected] name]$ cd .. 採用改名的方式進行測試 [[email protected] dfs]$ mv name name1 此時,name 目錄不存在,namenode是會啟動失敗的

9.資料恢復,從second namenode 複製資料

檢視second namenode檔案,並打包複製到namenode的fs.checkpoint.dir [[email protected] hadoop-1.0.3]$ cd /data/zhouhh/myhadoop/dfs/ [[email protected] dfs]$ ls data namesecondary [[email protected] dfs]$ cd namesecondary/ [[email protected] namesecondary]$ ls current image in_use.lock [[email protected] namesecondary]$ cd .. [[email protected] dfs]$ scp sec.tar.gz Hadoop48:/data/zhouhh/myhadoop/dfs/ sec.tar.gz

[[email protected] dfs]$ ls name1 sec.tar.gz [[email protected] dfs]$ tar zxvf sec.tar.gz namesecondary/ namesecondary/current/ namesecondary/current/VERSION namesecondary/current/fsimage namesecondary/current/edits namesecondary/current/fstime namesecondary/image/ namesecondary/image/fsimage namesecondary/in_use.lock [[email protected] dfs]$ ls name1 namesecondary sec.tar.gz

如果dfs.name.dir配置的name不存在,需建立name目錄(我測試時將其改名了,也可以進入name目錄用rm * -f) [[email protected] dfs]$ mkdir name

[[email protected] dfs]$ hadoop namenode -importCheckpoint 此時name下面已經有資料 Ctrl+C 結束

10.恢復成功,檢查資料正確性 [[email protected] dfs]$ start-all.sh [[email protected] dfs]$ jps 23940 Jps 9633 Bootstrap 19867 JobTracker 23791 NameNode [[email protected] dfs]$ hadoop fs -ls . Found 2 items -rw-r–r– 2 zhouhh supergroup 1381 2012-09-26 14:03 /user/zhouhh/README.txt -rw-r–r– 2 zhouhh supergroup 65 2012-09-26 14:10 /user/zhouhh/test中文.txt [[email protected] dfs]$ hadoop fs -cat test中文.txt 這是測試檔案 test001 by zhouhh http://abloz.com 2012.9.26

[[email protected] dfs]$ hadoop fsck /user/zhouhh FSCK started by zhouhh from /192.168.10.48 for path /user/zhouhh at Wed Sep 26 14:42:31 CST 2012 ..Status: HEALTHY

恢復成功

如非註明轉載, 均為原創. 本站遵循知識共享CC協議,轉載請註明來源