生活随笔
收集整理的這篇文章主要介紹了
hadoop集群崩溃恢复记录
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
一.崩潰原因
搭建的是一個hadoop測試集群,所以將數據備份參數設置為dfs.replication=1,這樣如果有一臺datanode損壞的話,數據就會失去。但不幸的是,剛好就有一臺機器由于負載過高,導致數據操壞。進而后面需要重啟整個hadoop集群,重啟后啟動namenode啟動不了。報如下錯誤:
?
Java代碼 ?
FSNamesystem?initialization?failed?saveLeases?found?path????/tmp/xxx/aaa.txt?but?no?matching?entry?in?namespace.?? FSNamesystem initialization failed saveLeases found path /tmp/xxx/aaa.txt but no matching entry in namespace.
?
二.修復namenode?
?
hadoop 集群崩潰了. 導致namenode啟動不了.
?
1. 刪除 namenode主節點的metadata配置目錄
rm -fr /data/hadoop-tmp/hadoop-hadoop/dfs/name
?
2. 啟動secondnamenode
使用start-all.sh命令啟動secondnamenode,namenode的啟動不了不管
?
3. 從secondnamenode恢復
使用命令: hadoop namenode -importCheckpoint
?
?
恢復過程中,發現數據文件有些已經損壞(因為dfs.replication=1),所以一直無法退出安全模式(safemode),一直報如下提示:
?
Java代碼 ?
The?ratio?of?reported?blocks?0.8866?has?not?reached?the?threshold?0.9990.?Safe?mode?will?be?turned?off?automatically.?? The ratio of reported blocks 0.8866 has not reached the threshold 0.9990. Safe mode will be turned off automatically.
?
?
4.強制退出safemode
?
?
Java代碼 ?
hadoop?dfsadmin?-safemode?leave?? hadoop dfsadmin -safemode leave
?
最后啟動成功,查看hdfs網頁報警告信息:
?
?
Java代碼 ?
WARNING?:?There?are?about?257?missing?blocks.?Please?check?the?log?or?run?fsck.?? WARNING : There are about 257 missing blocks. Please check the log or run fsck.
?
?
5.檢查損壞的hdfs文件列表
?
使用命令可以打印出損壞的文件列表:?
?
Java代碼 ?
./hadoop?fsck?/?? ./hadoop fsck /
?打印結果:
?
?
?
Java代碼 ?
/user/hive/warehouse/pay_consume_orgi/dt=2011-06-28/consume_2011-06-28.sql:?MISSING?1?blocks?of?total?size?1250990?B.. ??/user/hive/warehouse/pay_consume_orgi/dt=2011-06-29/consume_2011-06-29.sql:?CORRUPT?block?blk_977550919055291594 ????/user/hive/warehouse/pay_consume_orgi/dt=2011-06-29/consume_2011-06-29.sql:?MISSING?1?blocks?of?total?size?1307147?B..................Status:?CORRUPT ???Total?size:????235982871209?B ???Total?dirs:????1213???Total?files:???1422???Total?blocks?(validated):??????4550?(avg.?block?size?51864367?B) ????******************************** ????CORRUPT?FILES:????????277????MISSING?BLOCKS:???????509????MISSING?SIZE:?????????21857003415?B ????CORRUPT?BLOCKS:???????509????********************************?? /user/hive/warehouse/pay_consume_orgi/dt=2011-06-28/consume_2011-06-28.sql: MISSING 1 blocks of total size 1250990 B..
/user/hive/warehouse/pay_consume_orgi/dt=2011-06-29/consume_2011-06-29.sql: CORRUPT block blk_977550919055291594/user/hive/warehouse/pay_consume_orgi/dt=2011-06-29/consume_2011-06-29.sql: MISSING 1 blocks of total size 1307147 B..................Status: CORRUPTTotal size: 235982871209 BTotal dirs: 1213Total files: 1422Total blocks (validated): 4550 (avg. block size 51864367 B)********************************CORRUPT FILES: 277MISSING BLOCKS: 509MISSING SIZE: 21857003415 BCORRUPT BLOCKS: 509********************************
沒有冗余備份,只能刪除損壞的文件,使用命令:
Java代碼 ?
./hadoop?fsck?--delete?? ./hadoop fsck --delete
?
?
三.總結
?
一定需要將你的secondnamenode及namenode分開在不同兩臺機器運行,增加namenode的容錯性。以便在集群崩潰時可以從secondnamenode恢復數據.
轉載于:https://www.cnblogs.com/JohnLiang/archive/2011/11/10/2244572.html
總結
以上是生活随笔為你收集整理的hadoop集群崩溃恢复记录的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。