1. 程式人生 > >mfs 分散式檔案系統

mfs 分散式檔案系統

12.  mfs官方關於1.6.x 的介紹   翻譯人(QQ群戰友:Cuatre )


View on new features of next release v 1.6 of Moose File System
關於對MFS(Moose File System)下一個釋出版本V1.6新特性的一些看法
We are about to release a new version of MooseFS which would include a large number of new features and bug fixes. The new features are so significant that we decided to release it under 1.6 version. The newest beta files are in the GIT repository.
我們將要釋出MFS一個最新版本,該版本修復了大量的bug,同時也包含了大量的新特性。這些新特性非常重要和有特色,我們決定在1.6版本進行釋出。最新的beta檔案你可以GIT的知識庫找得到。
The key new features/changes of MooseFS 1.6 would include: 
MooseFS 1.6的主要特性及變化包括:
General: 
Removed duplicate source files. 
移除了複製原始檔
Strip whitespace at the end of configuration file lines. 
配置檔案行的末尾將為空白
Chunkserver: 
Chunkserver
Rewritten in multi-threaded model. 
重寫了多線成模式
Added periodical chunk testing functionality (HDD_TEST_FREQ option). 
增加了定期chunk測試功能(HDD_TEST_FREQ選項)
New -v option (prints version and exits). 
新的-v選項(顯示版本)

Master: 
Added "noowner" objects flag (causes objects to belong to current user). 
增加了"noowner"物件標記(可以使物件屬於當前使用者)
Maintaining `mfsdirinfo` data online, so it doesn't need to be calculated on every request. 
保持‘mfsdirinfo’資料線上,這樣就不需要求每一個請求都進行運算。
Filesystem access authorization system (NFS-like mfsexports.cfg file, REJECT_OLD_CLIENTS option) with ro/rw, maproot, mapall and password functionality. 
檔案系統訪問認證系統(類似於NFS的mfsexports.cfg檔案,REJECT_OLD_CLIENTS選項),有ro/rw, maproot, mapall及密碼功能
New -v option (prints version and exits). 
新的-v選項(顯示版本)

Mount: 
Rewritten options parsing in mount-like way, making possible to use standard FUSE mount utilities (see mfsmount(
 manual for new syntax). Note: old syntax is no longer accepted and mountpoint is mandatory now (there is no default). 
重寫選項將採用類似於掛載的解析方式,使用標準的FUSE掛載工具集將成為可能(參見新的mfsmount(語法手冊)。注:舊的語法現在將不再被支援,而設定掛載點則是必須的。(非預設選項)
Updated for FUSE 2.6+. 
升級到FUSE 2.6版本以上
Added password, file data cache, attribute cache and entry cache options. By default attribute cache and directory entry cache are enabled, file data cache and file entry cache are disabled. 
增加了密碼,檔案資料快取,屬性快取及目錄項選項。預設情況下,屬性快取及目錄項快取是開啟的,而檔案資料快取和檔案項輸入快取則是關閉的
opendir() no longer reads directory contents- it's done on first readdir() now; fixes "rm -r" on recent Linux/glibc/coreutils combo. 
opendir()函式將不再讀取目錄內容-讀取目錄內容現在將由readdir()函式完成;修復了當前Linux/glibc/coreutils組合中的‘rm -r’命令
Fixed mtime setting just before close() (by flushing file on mtime change); fixes mtime preserving on "cp -p". 
修復了在close()前的mtime設定(在mtime變化的時候重新整理檔案)
Added statistics accessible through MFSROOT/.stats pseudo-file. 
增加了表示訪問吞吐量的統計偽檔案MFSROOT/.stats
Changed master access method for mfstools (direct .master pseudo-file replaced by .masterinfo redirection); fixes possible mfstools race condition and allows to use mfstools on read-only filesystem. 
對於mfstools改變了主要的訪問路徑(直接)

Tools: 
Units cleanup in values display (exact values, IEC-60027/binary prefixes, SI/decimal prefixes); new options: -n, -h, -H and MFSHRFORMAT environment variable - refer to mfstools(
 manual for details). 
在單元值顯示方面進行一致化(確切值,IEC-60027/二進位制字首, SI/十進位制字首);新的選項:-n,-h,-H以及可變的MFSHRFORMAT環境----詳細參見mfstools(手冊
mfsrgetgoal, mfsrsetgoal, mfsrgettrashtime, mfsrsettrashtime have been deprecated in favour of new "-r" option for mfsgetgoal, mfssetgoal, mfsgettrashtime, mfssettrashtime tools. 
我們推薦使用帶新的“-r”選項的mfsgetgoal, mfssetgoal, mfsgettrashtime, mfssettrashtime工具,而不推薦mfsrgetgoal, mfsrsetgoal, mfsrgettrashtime, mfsrsettrashtime工具。(注意前後命令是不一樣的,看起來很類似)
mfssnapshot utility replaced by mfsappendchunks (direct descendant of old utility) and mfsmakesnapshot (which creates "real" recursive snapshots and behaves similar to "cp -r"

mfssnapshot工具集取代了mfsappendchunks(老工具集的後續版本)和mfsmakesnapshot(該工具能夠建立“真”的遞迴快照,這個動作類似於執行“cp -r”)工具
New mfsfilerepair utility, which allows partial recovery of file with some missing or broken chunks. 
新的mfs檔案修復工具集,該工具允許對部分丟失及損壞塊的檔案進行恢復
CGI scripts: 
First public version of CGI scripts allowing to monitor MFS installation from WWW browser. 
第一個允許從WWW瀏覽器監控MFS安裝的CGI指令碼釋出版本


13. mfs官方FAQ(TC版)
What average write/read speeds can we expect?
The raw reading / writing speed obviously depends mainly on the performance of the used hard disk drives and the network capacity and its topology and varies from installation to installation. The better performance of hard drives used and better throughput of the net, the higher performance of the whole system.

In our in-house commodity servers (which additionally make lots of extra calculations) and simple gigabyte Ethernet network on a petabyte-class installation
on Linux (Debian) with goal=2 we have write speeds of about 20-30 MiB/s and reads of 30-50MiB/s. For smaller blocks the write speed decreases, but reading is not much affected. 


Similar FreeBSD based network has got a bit better writes and worse reads, giving overall a slightly better performance.

Does the goal setting influence writing/reading speeds?

Generally speaking,
it doesn’t. The goal setting can influence the reading speed only under certain conditions. For example, reading the same file at the same time by more than one client would be faster when the file has goal set to 2 and not goal=1.


But the situation in the real world when several computers read the same file at the same moment is very rare; therefore, the goal setting has rather little influence on the reading speeds.

Similarly, the writing speed is not much affected by the goal setting.


How well concurrent read operations are supported?

All read processes are parallel - there is no problem with concurrent reading of the same data by several clients at the same moment.

How much CPU/RAM resources are used?

In our environment (ca. 500 TiB, 25 million files, 2 million folders distributed on 26 million chunks on 70 machines) the usage of chunkserver CPU (by constant file transfer) is about 15-20% and chunkserver RAM usually consumes about 100MiB (independent of amount of data). 
The master server consumes about 30% of CPU (ca. 1500 operations per second) and 8GiB RAM. CPU load depends on amount of operations and RAM on number of files and folders.

Is it possible to add/remove chunkservers and disks on fly?

You can add / remove chunkservers on the fly. But mind that it is not wise to disconnect a chunkserver if there exists a chunk with only one copy (marked in orange in the CGI monitor). 
You can also disconnect (change) an individual hard drive. The scenario for this operation would be:


  • Mark the disk(s) for removal
  • Restart the chunkserver process
  • Wait for the replication (there should be no “undergoal” or “missing” chunks marked in yellow, orange or red in CGI monitor)
  • Stop the chunkserver process
  • Delete entry(ies) of the disconnected disk(s) in 'mfshdd.cfg'
  • Stop the chunkserver machine
  • Remove hard drive(s)
  • Start the machine
  • Start the chunkserver process

If you have hotswap disk(s) after step 5 you should follow these:
  • Unmount disk(s)
  • Remove hard drive(s)
  • Start the chunkserver process

If you follow the above steps work of client computers would be not interrupted and the whole operation would not be noticed by MooseFS users.

My experience with clustered filesystems is that metadata operations are quite slow. How did you resolve this problem?

We have noticed the problem with slow metadata operations and we decided to cache file system structure in RAM in the metadata server. This is why metadata server has increased memory requirements. 


When doing df -h on a filesystem the results are different from what I would expect taking into account actual sizes of written files.

Every chunkserver sends its own disk usage increased by 256MB for each used partition/hdd, and a sum of these master sends to the client as total disk usage. If you have 3 chunkservers with 7 hdd each, your disk usage will be increased by 3*7*256MB (about 5GB). Of course it's not important in real life, when you have for example 150TB of hdd space.

There is one other thing. If you use disks exclusively for MooseFS on chunkservers df will show correct disk usage, but if you have other data on your MooseFS disks df will count your own files too.

If you want to see usage of your MooseFS files use 'mfsdirinfo' command.


Do chunkservers and metadata server do their own checksumming?

Yes there is checksumming done by the system itself. We thought it would be CPU consuming but it is not really. Overhead is about 4B per a 64KiB block which is 4KiB per a 64MiB chunk (per goal).

What sort of sizing is required for the Master  server? 
The most important factor is RAM of mfsmaster machine, as the full file system structure is cached in RAM for speed. Besides RAM mfsmaster machine needs some space on HDD for main metadata file together with incremental logs. 

The size of the metadata file is dependent on the number of files (not on their sizes). The size of incremental logs depends on the number of operations per hour, but length (in hours) of this incremental log is configurable.

1 million files takes approximately 300 MiB of RAM. Installation of 25 million files requires about 8GiB of RAM and 25GiB space on HDD.


When I delete files or directories the MooseFS size doesn’t change. Why?

MooseFS is not erasing files immediately to let you revert the delete operation.

You can configure for how long files are kept in trash and empty the trash manually (to release the space). There are more details here:
http://moosefs.com/pages/userguides.html#2[MB1]in section "Operations specific for MooseFS".

In short - the time of storing a deleted file can be verified by the 
mfsgettrashtime command and changed with mfssettrashtime.


When I added a third server as an extra chunkserver it looked like it started replicating data to the 3rd server even though the file goal was still set to 2. 

Yes. Disk usage ballancer uses chunks independently, so one file could be redistributed across all of your chunkservers.

Is MooseFS 64bit compatible?Yes!

Can I modify the chunk size?

File data is divided into fragments (chunks) with a maximum of 64MiB each. The value of 64 MiB is hard coded into system so you cannot modify its size. We based the chunk size on real-world data and it was a very good compromise between number of chunks and speed of rebalancing / updating the filesystem. Of course if a file is smaller than 64 MiB it occupies less space. 

Please note systems we take care of enjoy files of size well exceeding 100GB and there is no chunk size penalty noticeable. 

How do I know if a file has been successfully written in MooseFS?

First off, let's briefly discuss the way the writing process is done in file systems and what programming consequences this bears. Basically, files are written through a buffer (write cache) in all contemporary file systems. As a result, execution of the "write" command itself only transfers the data to a buffer (cache), with no actual writing taking place. Hence, a confirmed execution of the "write" command does not mean that the data has been correctly written on a disc. It is only with the correct performance of the "fsync" (or "close" command that all data kept in buffers (cache) gets physically written. If an error occurs while such buffer-kept data is being written, it could return an incorrect status for the "fsync" (or even "close", not only "write" command.
The problem is that a vast majority of programmers do not test the "close" command status (which is generally a mistake, though a very common one). Consequently, a program writing data on a disc may "assume" that the data has been written correctly, while it has actually failed. 
As far as MooseFS is concerned – first, its write buffers are larger than in classic file systems (an issue of efficiency); second, write errors may be more frequent than in case of a classic hard drive (the network nature of MooseFS provokes some additional error-inducing situations). As a consequence, the amount of data processed during execution of the "close" command is often significant and if an error occurs while the data is being written, this will be returned in no other way than as an error in execution of the "close" command only. 
Hence, before executing "close", it is recommended (especially when using MooseFS) to perform "fsync" after writing in a file and then check the status of "fsync" and – just in case – the status of "close" as well. 
NOTE! When "stdio" is used, the "fflush" function only executes the "write" command, so correct execution of "fflush" is not enough grounds to be sure that all data has been written successfully – you should also check the status of "fclose".
One frequent situation in which the above problem may occur is redirecting a standard output of a program to a file in "shell". Bash (and many other programs) does not check the status of "close" execution and so the syntax of the "application > outcome.txt" type may wrap up successfully in "shell", while in fact there has been an error in writing the "outcome.txt" file. You are strongly advised to avoid using the above syntax. If necessary, you can create a simple program reading the standard input and writing everything to a chosen file (but with an appropriate check with the "fsync" command) and then use "application | mysaver outcome.txt", where "mysaver" is the name of your writing program instead of "application > outcome.txt".
Please note that the problem discussed above is in no way exceptional and does not stem directly from the characteristics of MooseFS itself. It may affect any system of files – only that network type systems are more prone to such difficulties. Technically speaking, the above recommendations should be followed at all times (also in case of classic file systems).