Hadoop hdfs副本儲存和糾刪碼(Erasure Coding)儲存優缺點
The advantages and disadvantages of hadoop hdfs replicating storage and erasure coding storage.
Hadoop 3.0.0-alpha1 及以上版本提供了糾刪碼(Erasure Coding)儲存資料的支援,使用者可以根據不同的場景和需求選擇副本儲存或EC儲存方案,兩種儲存方案各有優缺點和適用場景。
1 副本儲存
以檔案大小:2.5G 寫入HDFS為例;
$ ls –ltr
-rw-r--r-- 1 sywu sywu 2.5G Mar 4 14:15 data000
hdfs資料塊大小:512M,預設副本數:3;2.5G檔案被切分為5個數據塊(B1,B2,B3,B4,B5)儲存,單節點儲存完成,資料塊複製到其它節點,最終總的資料塊數:15個,總消耗儲存空間:7.5G。
1.1 優點
- 副本保證資料可用性;當某個節點資料塊丟失或破損,datanode發現後會自動從其它正常的節點中恢復資料塊;
- 副本提高了作業執行並行度,作業可以同時在副本節點執行。 ## 1.2 缺點
- 每個副本使用100%的儲存開銷,副本數越多,儲存開銷越大;
- 副本同步佔用大量的網路和IO資源。
2 糾刪碼(Erasure Coding)儲存
HDFS使用糾刪碼(Erasure Coding,以下簡稱EC)解決副本複製和副本儲存所帶來的空間和資源消耗問題,以EC代替副本,提供和副本儲存相同級別的容錯能力,並且儲存開銷不超過單副本儲存的50%。
2.1 糾刪碼(Erasure Coding)組成結構
- EC由資料(Data)和奇偶校驗碼(Parity)兩部分組成,資料儲存時通過EC演算法生成;生成的過程稱為編碼(encoding),恢復丟失資料塊的過程稱為解碼(decoding)。
- 與HDFS檔案基本構成單位:塊(Block) 不同,EC的構成單位:塊組(Block group)、塊(Block)、單元(cell),每個塊組存放與其它塊組一樣數量的資料塊和奇偶校驗碼塊;單元(cell)是EC內部最小的儲存結構,多個單元組成條(Striping),儲存在塊(Block)裡。
- EC寫入方案有:連續佈局(Contiguous Layout)和條形佈局(Striping Layout);連續佈局從 Hadoop 3.0.0-alpha1 版本開始提供支援,資料依次寫入塊中的單元(cell),一個塊寫滿之後再寫入下一個塊;條形佈局寫入方案目前還在開發階段
- 資料和奇偶校驗碼塊數量由EC策略(Erasure coding policies)決定,比如上圖的策略:RS-3-2-1024k,表示每個塊組由3個數據塊和2個校驗塊構成,每個單元(cell)大小為1024k,最大可丟失塊2個,丟失超過2個則無法恢復。
2.2 Erasure Coding演算法
目前支援的EC演算法有XOR和Reed-Solomon兩種;
2.2.1 XOR 演算法
exclusive OR,基於異或運算的演算法,從任意數量的單元(cell)中生成1個奇偶校驗塊。如果任何資料塊丟失,則通過對其餘位和奇偶校驗位進行來恢復。
由於只有1個奇偶校驗塊,因此它只能容忍1個數據塊故障,對於像HDFS這樣需要處理多個故障的系統來說,不適用。
2.2.2 Reed-Solomon 演算法
Reed-Solomon(簡稱:RS),該演算法有兩個引數k和m,記為 RS(k,m),RS演算法將k個數據單元(cell)與生成器矩陣(Generator Matrix)相乘得到具有k個數據單元和m個奇偶校驗碼的擴充套件碼字(extended codewords),最多可容忍m個數據塊丟失;如果資料塊丟失,則只要將 k+m 個單元格中的k個可用,通過將生成器矩陣的逆與擴充套件碼字(extended codewords)相乘來恢復資料塊。由於可以容忍多個數據塊丟失,更適用於生產環境。
2.3 使用EC RS演算法儲存資料
Hadoop 3.0.0-alpha1 版本及以上版本預設已經啟用EC,首先設定目錄策略使用RS演算法,以6個data塊和3個奇偶校驗碼構成一個block group;
hdfs ec -setPolicy -path /data/ec -policy RS-6-3-1024k
將2.5G資料寫入EC目錄下,資料最終儲存在一個block group下,共9個數據塊,總消耗儲存空間:2.5G。
檢查狀態;
/data/ec/split000 2684354560 bytes, erasure-coded: policy=RS-6-3-1024k, 1 block(s): OK
0. BP-1016852637-192.168.1.192-1597819912196:blk_-9223372036854775792_4907534 len=2684354560 Live_repl=9 [blk_-9223372036854775792:DatanodeInfoWithStorage[192.168.1.192:1019,DS-3b10bc66-c5c9-47f8-b4ea-99441fc5df04,DISK], blk_-9223372036854775791:DatanodeInfoWithStorage[192.168.1.196:1019,DS-c618bb83-03f2-4007-a3a9-cbd11eb2a15a,DISK], blk_-9223372036854775790:DatanodeInfoWithStorage[192.168.1.193:1019,DS-6554db41-d285-4f31-9fde-2017cc211b0c,DISK], blk_-9223372036854775789:DatanodeInfoWithStorage[192.168.1.188:1019,DS-09a7d2b0-6018-43af-8a63-6be8ba9217e6,DISK], blk_-9223372036854775788:DatanodeInfoWithStorage[192.168.1.199:1019,DS-b24cf7ce-1235-478f-884d-19cde4a03c9e,DISK], blk_-9223372036854775787:DatanodeInfoWithStorage[192.168.1.187:1019,DS-43b686d2-e83c-49a3-b8e5-64c4e5edcb53,DISK], blk_-9223372036854775786:DatanodeInfoWithStorage[192.168.1.195:1019,DS-eb42e686-1ca9-44c3-9891-868c67d9d1fa,DISK], blk_-9223372036854775785:DatanodeInfoWithStorage[192.168.1.194:1019,DS-15635fea-314c-4703-a7ab-6f81db1e52cd,DISK], blk_-9223372036854775784:DatanodeInfoWithStorage[192.168.1.198:1019,DS-73bc78d2-e218-40a9-ae9c-d24706a0bc31,DISK]]
Status: HEALTHY
Number of data-nodes: 10
Number of racks: 1
Total dirs: 0
Total symlinks: 0
Replicated Blocks:
Total size: 0 B
Total files: 0
Total blocks (validated): 0
Minimally replicated blocks: 0
Over-replicated blocks: 0
Under-replicated blocks: 0
Mis-replicated blocks: 0
Default replication factor: 3
Average block replication: 0.0
Missing blocks: 0
Corrupt blocks: 0
Missing replicas: 0
Erasure Coded Block Groups:
Total size: 2684354560 B
Total files: 1
Total block groups (validated): 1 (avg. block group size 2684354560 B)
Minimally erasure-coded block groups: 1 (100.0 %)
Over-erasure-coded block groups: 0 (0.0 %)
Under-erasure-coded block groups: 0 (0.0 %)
Unsatisfactory placement block groups: 0 (0.0 %)
Average block group size: 9.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0 (0.0 %)
嘗試刪除(blk-9223372036854775792、blk-9223372036854775791、blk_-9223372036854775790)三個資料塊(最多允許3個數據塊丟失)。
rm /data/current/BP-1016852637-192.168.1.192-1597819912196/current/finalized/subdir0/subdir0/blk_-9223372036854775792
rm /data/current/BP-1016852637-192.168.1.192-1597819912196/current/finalized/subdir0/subdir0/blk_-9223372036854775791
rm /data/current/BP-1016852637-192.168.1.192-1597819912196/current/finalized/subdir0/subdir0/blk_-9223372036854775790
現在嘗試讀取檔案;
hdfs dfs -get /data/ec/split000 .
datanode首先拋未找到資料塊異常;
21/03/05 16:55:39 WARN impl.BlockReaderFactory: I/O error constructing remote block reader.
java.io.IOException: Got error, status=ERROR, status message opReadBlock BP-1016852637-192.168.1.192-1597819912196:blk_-9223372036854775792_4907534 received exception java.io.FileNotFoundException: BlockId -9223372036854775792 is not valid., for OP_READ_BLOCK, self=/192.168.1.192:42168, remote=/192.168.1.192:1019, for file /data/ec/split000, for pool BP-1016852637-192.168.1.192-1597819912196 block -9223372036854775792_4907534
at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:110)
at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.checkSuccess(BlockReaderRemote.java:447)
at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.newBlockReader(BlockReaderRemote.java:415)
at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:860)
at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:756)
at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:390)
at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:657)
at org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:256)
at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:293)
at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:323)
at org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:318)
at org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:391)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:828)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:94)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:68)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:129)
at org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:497)
at org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:419)
at org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(CommandWithDestination.java:354)
at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:289)
at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:274)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303)
at org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument(CommandWithDestination.java:269)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269)
at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:240)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:120)
at org.apache.hadoop.fs.shell.Command.run(Command.java:176)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
然後自動通過RS演算法恢復資料;
21/03/05 16:55:39 hdfs.DFSClient: refreshLocatedBlock for striped blocks, offset=0. Obtained block LocatedStripedBlock{BP-1016852637-192.168.1.192-1597819912196:blk_-9223372036854775792_4907534; getBlockSize()=2684354560; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[192.168.1.192:1019,DS-3b10bc66-c5c9-47f8-b4ea-99441fc5df04,DISK], DatanodeInfoWithStorage[192.168.1.196:1019,DS-c618bb83-03f2-4007-a3a9-cbd11eb2a15a,DISK], DatanodeInfoWithStorage[192.168.1.193:1019,DS-6554db41-d285-4f31-9fde-2017cc211b0c,DISK], DatanodeInfoWithStorage[192.168.1.188:1019,DS-09a7d2b0-6018-43af-8a63-6be8ba9217e6,DISK], DatanodeInfoWithStorage[192.168.1.199:1019,DS-b24cf7ce-1235-478f-884d-19cde4a03c9e,DISK], DatanodeInfoWithStorage[192.168.1.187:1019,DS-43b686d2-e83c-49a3-b8e5-64c4e5edcb53,DISK], DatanodeInfoWithStorage[192.168.1.195:1019,DS-eb42e686-1ca9-44c3-9891-868c67d9d1fa,DISK], DatanodeInfoWithStorage[192.168.1.194:1019,DS-15635fea-314c-4703-a7ab-6f81db1e52cd,DISK], DatanodeInfoWithStorage[192.168.1.198:1019,DS-73bc78d2-e218-40a9-ae9c-d24706a0bc31,DISK]]; indices=[0, 1, 2, 3, 4, 5, 6, 7, 8]}, idx=0
21/03/05 16:55:39 WARN hdfs.DFSClient: [DatanodeInfoWithStorage[192.168.1.192:1019,DS-3b10bc66-c5c9-47f8-b4ea-99441fc5df04,DISK]] are unavailable and all striping blocks on them are lost. IgnoredNodes = null
21/03/05 16:55:39 hdfs.DFSClient: refreshLocatedBlock for striped blocks, offset=0. Obtained block LocatedStripedBlock{BP-1016852637-192.168.1.192-1597819912196:blk_-9223372036854775792_4907534; getBlockSize()=2684354560; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[192.168.1.192:1019,DS-3b10bc66-c5c9-47f8-b4ea-99441fc5df04,DISK], DatanodeInfoWithStorage[192.168.1.196:1019,DS-c618bb83-03f2-4007-a3a9-cbd11eb2a15a,DISK], DatanodeInfoWithStorage[192.168.1.193:1019,DS-6554db41-d285-4f31-9fde-2017cc211b0c,DISK], DatanodeInfoWithStorage[192.168.1.188:1019,DS-09a7d2b0-6018-43af-8a63-6be8ba9217e6,DISK], DatanodeInfoWithStorage[192.168.1.199:1019,DS-b24cf7ce-1235-478f-884d-19cde4a03c9e,DISK], DatanodeInfoWithStorage[192.168.1.187:1019,DS-43b686d2-e83c-49a3-b8e5-64c4e5edcb53,DISK], DatanodeInfoWithStorage[192.168.1.195:1019,DS-eb42e686-1ca9-44c3-9891-868c67d9d1fa,DISK], DatanodeInfoWithStorage[192.168.1.194:1019,DS-15635fea-314c-4703-a7ab-6f81db1e52cd,DISK], DatanodeInfoWithStorage[192.168.1.198:1019,DS-73bc78d2-e218-40a9-ae9c-d24706a0bc31,DISK]]; indices=[0, 1, 2, 3, 4, 5, 6, 7, 8]}, idx=1
21/03/05 16:55:39 hdfs.DFSClient: refreshLocatedBlock for striped blocks, offset=0. Obtained block LocatedStripedBlock{BP-1016852637-192.168.1.192-1597819912196:blk_-9223372036854775792_4907534; getBlockSize()=2684354560; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[192.168.1.192:1019,DS-3b10bc66-c5c9-47f8-b4ea-99441fc5df04,DISK], DatanodeInfoWithStorage[192.168.1.196:1019,DS-c618bb83-03f2-4007-a3a9-cbd11eb2a15a,DISK], DatanodeInfoWithStorage[192.168.1.193:1019,DS-6554db41-d285-4f31-9fde-2017cc211b0c,DISK], DatanodeInfoWithStorage[192.168.1.188:1019,DS-09a7d2b0-6018-43af-8a63-6be8ba9217e6,DISK], DatanodeInfoWithStorage[192.168.1.199:1019,DS-b24cf7ce-1235-478f-884d-19cde4a03c9e,DISK], DatanodeInfoWithStorage[192.168.1.187:1019,DS-43b686d2-e83c-49a3-b8e5-64c4e5edcb53,DISK], DatanodeInfoWithStorage[192.168.1.195:1019,DS-eb42e686-1ca9-44c3-9891-868c67d9d1fa,DISK], DatanodeInfoWithStorage[192.168.1.194:1019,DS-15635fea-314c-4703-a7ab-6f81db1e52cd,DISK], DatanodeInfoWithStorage[192.168.1.198:1019,DS-73bc78d2-e218-40a9-ae9c-d24706a0bc31,DISK]]; indices=[0, 1, 2, 3, 4, 5, 6, 7, 8]}, idx=6
最終3個丟失的資料塊被恢復,檔案正常讀取。
2.4 優點
- 相比副本儲存方式大大降低了儲存資源和IO資源的使用;
- 通過XOR和RS演算法保證資料安全,有效解決允許範圍內資料塊破碎和丟失導致的異常;
2.5 缺點
- 恢復資料時需要去讀其它資料塊和奇偶校驗碼塊資料,需要消耗IO和網路資源;
- EC演算法編碼和解密計算需要消耗CPU資源;
- 儲存大資料,並且資料塊較集中的節點執行作業負載會較高;
- EC檔案不支援hflush, hsync, concat, setReplication, truncate, append 等操作。
3 副本儲存和EC儲存讀寫對比
寫檔案;
檔案大小 | 三副本儲存(秒) | EC儲存(秒) |
---|---|---|
2.5G | 6.492 | 9.093 |
5G | 29.778 | 17.245 |
12G | 65.156 | 30.989 |
讀檔案;
檔案大小 | 三副本儲存(秒) | EC儲存(秒) |
---|---|---|
2.5G | 4.915 | 4.755 |
5G | 7.737 | 7.119 |
12G | 14.493 | 13.775 |
4 總結
綜合副本儲存和EC儲存優缺點,EC儲存更適合用於儲存備份資料和使用頻率較少的非熱點資料,副本儲存更適用於儲存需要追加寫入和經常分析的熱點資料。
參考文獻
- https://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-common/release/3.0.0-alpha1/RELEASENOTES.3.0.0-alpha1.html - Apache Hadoop 3.0.0-alpha1 Release Notes
- https://issues.apache.org/jira/browse/HDFS-7285 - Erasure Coding Support inside HDFS
- https://issues.apache.org/jira/browse/HDFS-8030 - HDFS Erasure Coding Phase II -- EC with contiguous layout
- https://www.sciencedirect.com/topics/engineering/reed-solomon-code - Reed-Solomon Code