Lustre並行檔案系統建設

阿新 • • 發佈：2019-01-14

Author By esxu

2015/08/19

專案背景介紹

本次專案目的在於構建一個具有高效能、支援高併發讀寫、檔案共享的儲存系統。Lustre在HPC領域被廣泛使用，本次專案建設也是在調研了其他諸多檔案系統之後，最終選擇Lustre檔案系統作為軟體部分，由於Lustre檔案系統本身無資料安全機制，必須構建於穩定的磁碟陣列之上，通過硬體的穩定性來解決資料的安全性。

硬體層面可以從兩個方面來保證資料的安全性。

RAID組
節點冗餘

這裡節點冗餘包括MDS和OSS節點，2個MDS節點互為active/stanby模式，兩個OSS節點均為active模式。基本上實現資料訪問不中斷。

軟體安裝

這裡使用的是DDN打包好的es-hpc-2.1.2-Cent-r41-x86_64-DVD.iso映象檔案，直接作為作業系統的ISO檔案安裝作業系統即可。系統安裝完畢後，所有相關軟體均會同步安裝完畢。安裝的過程中若原先系統盤中有資料會提示你是否將原先資料格式清除，在系統盤引導介面需選擇相應的安裝項，按e鍵進入編輯模式，新增skip-sda-check引數，跳過安全檢查，即可直接安裝。

系統安裝完畢後，需配置好一些準備工作：

確保TCP和IB網路正常
新增主機名-IP地址隱射關係
關閉防火牆和selinux
確保時間同步

確認完成以上內容後，即可開始檔案系統建立。

MDS節點部署

MDS節點即為lustre的元資料節點，一般採用2臺伺服器作為mds節點進行部署，一個處於avtive狀態，另一個處於standby狀態。MDS節點安裝配置分為幾步：

格式化磁碟
配置lustre.conf檔案
載入lustre核心模組
掛載磁碟

格式化磁碟

MDT

mkfs.lustre --mdt --index=0 --fsname=lustre [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] 
  --mkfsoptions="-m 1 -J size=4096"  --reformat --verbose /dev/mapper/mdt

MGT

mkfs.lustre --mgs --fsname=lustre [email protected] [email protected] --reformat --verbose  /dev/mapper/mgt

命令比較長，不做過多解釋，從字面上基本上都能理解各個引數的含義。

lustre.conf

lustre.conf為luster的配置檔案，檔案內容只有一項，是關於檔案系統網路的配置：

[[email protected] new]# cat /etc/modprobe.d/lustre.conf
options lnet networks=o2ib(bond0),tcp(eth2)

這裡有兩套網路，用逗號各個，注意語法規則，前面的為IB網路，後面的為tcp網路，括號裡面的是網路裝置名稱，必須與當前處於up狀態的網絡卡進行對應。這裡ib網路用了2個埠進行繫結，所以用的是bond0，乙太網為eth2網絡卡。

載入lustre核心模組

預設作業系統安裝好後，系統啟動不會載入lustre模組，可通過命令

lsmod|grep lustre

進行檢視是否載入了lustre模組。

手動載入lustre模組：

modprobe lustre

整個檔案系統建立最關鍵的一步就在這裡，lustre核心模組載入成功後，後面基本上不有什麼大的問題。因為lustre模組對作業系統的核心以及IB驅動的版本均有很強的依賴關係，有一個不匹配就會載入不上，如果報錯，先檢查lustre.conf檔案格式是否正確，如果確認lustre.conf內容無誤，再取檢視具體messages日誌資訊。

掛載MDT、MGT

完成上面幾步之後，即可進行MDT和MGT掛載.

MGT

mount -t lustre /dev/mapper/mgt /lustre/mgt/

MDT

mount -t lustre /dev/mapper/mdt /lustre/mdt/

掛載時間會比較長，大概1-2分鐘，需要多等一會。

到這裡就完成了lustre的mds節點部署。

OSS節點部署

oss節點即為lustre的資料儲存節點，影響oss節點數量的因素有以下幾個：

磁碟陣列的規模及聚合頻寬
單個oss節點的網路頻寬

為保障節點間高可用，原則上叢集中oss節點數量應為偶數，兩兩互為備份。ost的數量也要求為偶數，互為備份的兩個oss節點上掛載數目相同的ost。本次專案環境中僅有2個oss節點、14個ost，每組ost為8+2 Raid6，4塊盤作為熱備盤，磁碟陣列共144*6T 7.2k SAS盤.

部署oss節點與mds節點過程類似，步驟如下：

格式化磁碟

這裡每個oss節點上共有7個ost

mkfs.lustre --ost --index=0 --fsname=lustre [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] --mkfsoptions="-m 1 -i 131072" --reformat --verbose /dev/mapper/360001ff0a101a0000000000089c50000

mkfs.lustre --ost --index=1 --fsname=lustre [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] --mkfsoptions="-m 1 -i 131072" --reformat --verbose /dev/mapper/360001ff0a101a0000000000189c60001

mkfs.lustre --ost --index=2 --fsname=lustre [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] --mkfsoptions="-m 1 -i 131072" --reformat --verbose /dev/mapper/360001ff0a101a0000000000289c70002

mkfs.lustre --ost --index=3 --fsname=lustre [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] --mkfsoptions="-m 1 -i 131072" --reformat --verbose /dev/mapper/360001ff0a101a0000000000389c80003

mkfs.lustre --ost --index=4 --fsname=lustre [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] --mkfsoptions="-m 1 -i 131072" --reformat --verbose /dev/mapper/360001ff0a101a0000000000489c90004

mkfs.lustre --ost --index=5 --fsname=lustre [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] --mkfsoptions="-m 1 -i 131072" --reformat --verbose /dev/mapper/360001ff0a101a0000000000589ca0005

mkfs.lustre --ost --index=6 --fsname=lustre [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] --mkfsoptions="-m 1 -i 131072" --reformat --verbose /dev/mapper/360001ff0a101a0000000000689cb0006

格式化磁碟後，配置lustre.conf檔案及載入lustre模組，過程與mds相同，這裡不做過多說明，下面進行磁碟掛載。

掛載OST

mount -t lustre /dev/mapper/360001ff0a101a0000000000089c50000 /lustre/ost00/
mount -t lustre /dev/mapper/360001ff0a101a0000000000189c60001 /lustre/ost01/
mount -t lustre /dev/mapper/360001ff0a101a0000000000289c70002 /lustre/ost02/
mount -t lustre /dev/mapper/360001ff0a101a0000000000389c80003 /lustre/ost03/
mount -t lustre /dev/mapper/360001ff0a101a0000000000489c90004 /lustre/ost04/
mount -t lustre /dev/mapper/360001ff0a101a0000000000589ca0005 /lustre/ost05/
mount -t lustre /dev/mapper/360001ff0a101a0000000000689cb0006 /lustre/ost06/

以上完成了其中一個oss節點的部署，另外一個節點操作與此類似。

完成以上配置後，lustre檔案系統的配置基本結束，客戶端安裝lustre客戶端軟體包並載入lustre模組即可掛載使用了。客戶端配置後面再詳細說明。

HA實現說明

上面的操作只完成了基本的lustre的檔案系統建立，若想實現檔案系統的高可用，還需進一步配置才行。本次專案使用的是基於corosync+pacemaker進行資源和節點間心跳管理。這裡建立2組資源，一組為MDS，一組為OSS。也可將2組資源合併到一起，這裡為了方便管理和後期擴容，將MDS和OSS進行分離，建立兩組祕鑰。

節點HA

corosync.conf具體配置如下：

[[email protected] ~]# cat /etc/corosync/corosync.conf
compatibility: none

totem {
    version: 2
    secauth: off
    threads: 0

    # Tuning for highly congested networks
    # token:
    # This timeout specifies in milliseconds until a token loss is
    # declared after not receiving a token. This is the time spent
    # detecting a failure of a processor in the current configuration.
    # Reforming a new configuration takes about 50 milliseconds in
    # addition to this timeout.
    # The default is 1000 milliseconds.
    token: 10000
    # retransmits_before_loss:
    # This  value  identifies  how  many  token  retransmits should be
    # attempted before forming a new configuration.  If this value  is
    # set,  retransmit  and hold will be automatically calculated from
    # retransmits_before_loss and token.
    # The default is 4 retransmissions.
    retransmits_before_loss: 25
    # consensus:
    # This  timeout  specifies  in  milliseconds  how long to wait for
    # consensus  to  be  achieved  before  starting  a  new  round  of
    # membership configuration.
    # The default is 200 milliseconds.
    consensus: 12000
    # join:
    # This timeout specifies in milliseconds how long to wait for join
    # messages in the membership protocol.
    # The default is 100 milliseconds.
    join: 1000
    # merge:
    # This  timeout  specifies in milliseconds how long to wait before
    # checking for a partition when  no  multicast  traffic  is  being
    # sent.   If  multicast traffic is being sent, the merge detection
    # happens automatically as a function of the protocol.
    # The default is 200 milliseconds.
    merge: 400
    # downcheck:
    # This timeout specifies in milliseconds how long to  wait  before
    # checking  that  a network interface is back up after it has been
    # downed.
    # The default is 1000 millseconds.
    downcheck: 2000

    rrp_mode: passive

    interface {
            member {
                    memberaddr: 192.168.242.34
            }
            member {
                    memberaddr: 192.168.242.35
            }
            ringnumber: 0
            #bindnetaddr: 226.94.1.1
            bindnetaddr: 192.168.242.0
            mcastport: 5401
            ttl: 1
    }
    transport: udpu
}


logging {
    fileline: off
    to_stderr: no
    to_logfile: yes
    to_syslog: yes
    logfile: /var/log/cluster/corosync.log
    debug: off
    timestamp: on
    logger_subsys {
            subsys: AMF
            debug: off
    }
}

amf {
    mode: disabled
}

這裡只列出了MDS節點的配置檔案，OSS節點與此類似。

生成祕鑰檔案
命令：

corosync-keygen

執行該命令後會，在/etc/corosync/目錄下會生成authkey檔案，將該檔案和corosync.conf一起拷貝到另外一個mds節點上。

配置檔案都準備好之後將corosync服務重啟一下,然後執行crm_mon命令檢視叢集狀態

Last updated: Sun Sep  6 02:37:16 2015
Last change: Tue Aug 25 23:12:18 2015 via crmd on oss00
Stack: classic openais (with plugin)
Current DC: oss01 - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
2 Nodes configured, 2 expected votes
14 Resources configured


Online: [ oss00 oss01 ]

res_Filesystem_ost07    (ocf::heartbeat:Filesystem):    Started oss01
res_Filesystem_ost08    (ocf::heartbeat:Filesystem):    Started oss01
res_Filesystem_ost09    (ocf::heartbeat:Filesystem):    Started oss01
res_Filesystem_ost0a    (ocf::heartbeat:Filesystem):    Started oss01
res_Filesystem_ost0b    (ocf::heartbeat:Filesystem):    Started oss01
res_Filesystem_ost0c    (ocf::heartbeat:Filesystem):    Started oss01
res_Filesystem_ost0d    (ocf::heartbeat:Filesystem):    Started oss01
res_Filesystem_ost00    (ocf::heartbeat:Filesystem):    Started oss00
res_Filesystem_ost01    (ocf::heartbeat:Filesystem):    Started oss00
res_Filesystem_ost02    (ocf::heartbeat:Filesystem):    Started oss00
res_Filesystem_ost03    (ocf::heartbeat:Filesystem):    Started oss00
res_Filesystem_ost04    (ocf::heartbeat:Filesystem):    Started oss00
res_Filesystem_ost05    (ocf::heartbeat:Filesystem):    Started oss00
res_Filesystem_ost06    (ocf::heartbeat:Filesystem):    Started oss00

看到有2個節點線上oss00和oss01。下面是這兩個節點的資源分配情況，這裡藉助的是圖形化工具lcmc，進行資源新增和分配，並設定優先順序，也可通過命令列來操作，但較為麻煩。藉助圖形化工具劃分好資源後，我們可通過命令列進行配置檔案檢視，具體方法如下：

[[email protected] ~]# crm
crm(live)# configure
crm(live)configure# show

執行前面2條命令後，檢視配置檔案可執行show，通過LCMC新增完資源後，我們可通過命令列的方式進行引數編輯，若想編輯配置檔案可執行edit，進入後，同vim編輯器操作相同。

磁碟HA

磁碟與節點間使用的是多路徑軟體multipath來實現，將所有OST或MDT磁碟陣列同時對映給相應的OSS節點或MDS節點。

[[email protected] ~]# multipath -ll
mdt (3600c0ff0001e1c81cb32d75501000000) dm-2 DotHill,DH3824
size=540G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 11:0:0:0 sdb 8:16 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
`- 12:0:0:0 sdd 8:48 active ready running
mgt (3600c0ff0001e1c812233d75501000000) dm-3 DotHill,DH3824
size=19G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 11:0:0:1 sdc 8:32 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
  `- 12:0:0:1 sde 8:64 active ready running

這裡列出的是MDS節點上的mdt儲存磁碟，這裡有2個裝置：mdt與mgt，也可將mgt與mdt合併，我這裡磁碟比較富裕，就單獨為mgt建立了一塊分割槽作為mgt儲存區域。同時再另外一臺備份節點上也可發現相同的2個裝置，如下：

[[email protected] ~]# multipath -ll
mdt (3600c0ff0001e1c81cb32d75501000000) dm-2 DotHill,DH3824
size=540G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 11:0:0:0 sdb 8:16 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
  `- 12:0:0:0 sdd 8:48 active ready running
mgt (3600c0ff0001e1c812233d75501000000) dm-3 DotHill,DH3824
size=19G features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 11:0:0:1 sdc 8:32 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
 `- 12:0:0:1 sde 8:64 active ready running

網路HA

這裡IO節點的IB網絡卡均配置了雙網絡卡，繫結提高網絡卡的穩定性。步驟如下：

配置IP地址

[[email protected] ~]# cat /etc/sysconfig/network-scripts/ifcfg-ib0
DEVICE=ib0
TYPE=InfiniBand
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
MASTER=bond0
SLAVE=yes
PRIMARY=yes


[[email protected] ~]# cat /etc/sysconfig/network-scripts/ifcfg-ib1
DEVICE=ib1
TYPE=InfiniBand
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
MASTER=bond0
SLAVE=yes
PRIMARY=no


[[email protected] ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
#TYPE=InfiniBand
IPADDR=11.11.11.34
NETMASK=255.255.255.0
USERCTL=no
BOOTPROTO=static
ONBOOT=yes

新建bond.conf檔案，內容如下：

[[email protected] ~]# cat /etc/modprobe.d/bond.conf
alias bond0 bonding
options bond0 mode=1 miimon=100

載入該配置檔案

modprobe bonding

重啟網絡卡後，檢視網絡卡狀態：

[[email protected] ~]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active)
Primary Slave: None
Currently Active Slave: ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: ib0
MII Status: up
Speed: 56000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: a0:04:02:20:fe:80
Slave queue ID: 0

Slave Interface: ib1
MII Status: up
Speed: 56000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: a0:04:03:00:fe:80
Slave queue ID: 0

這裡繫結模式選擇的是1，即active-backup模式。

客戶端配置

客戶端安裝首先需要編譯適合當前生產環境版本的lustre-clinet安裝包，可使用lustre原始碼包進行編譯。

首先將 lustre 的原始碼 copy 到本地,然後解壓縮原始碼檔案。進入到原始碼的路徑下。
因為是 lustre 客戶端的編譯,所以和 IO server 端的編譯不同。我們在
編譯的過程中不需要很多複雜的工具,所以在指定編譯命令的時候需要增加一
些相應的引數。編譯命令如下:

./configure --with-o2ib=/usr/src/ofa_kernel/default/ --with-linux=/usr/src/kernels/2.6.32-504.el6.x86_64/ --disable- server

–with-o2ib 引數指定了 lustre 所支援的網路為 IB 網路,如果 lustre 環境並沒有 IB 的網路,而是僅有 tcp 的網路(千兆網路,萬兆網路),那麼就可以不使用這個引數了。
–with-linux 引數指定了但前的作業系統的 kernel 資訊。 –disable-server 引數指定了當前是針對lustre 的客戶端進行編譯。
系統編譯完成後,執行如下命令:

# make -j8

系統會繼續執行,知道編譯過程中沒有任何錯誤資訊後,通過

# make rpms

命令生成 lustre 客戶端的 rpm 安裝包。
命令完成後,會在系統的/root/rpmbuild/RPMS/x86_64 目錄下生成 lustre 客戶端的 rpm 包。然後安裝所有 rpm 包即可。

在安裝的過程中可能會出現如下的錯誤資訊:

# rpm -ivh lustre-client-*
error: Failed dependencies:
/usr/bin/expect is needed by lustre-client-source-
2.5.29.ddnpf5-2.6.32_504.el6.x86_64_g2139632.x86_64 lustre-iokit is needed by lustre-client-tests-2.5.29.ddnpf5-2.6.32_504.el6.x86_64_g2139632.x86_64

提示資訊為作業系統的系統關聯,也就是說在安裝 lustre 軟體包的時候需要依賴 expect 軟體包才能繼續進行安裝。手動安裝 expect。

# yum install expect sg3_utils

這裡我把sg3_utils也安裝了，不然後面還會提示缺少這個包。

再次執行 lustre 的軟體包安裝。

#rpm -ivh lustre-*

Lnet 設定

安裝完相應客戶端軟體之後，配置下lustre.conf，配置方式同server端的配置類似，這裡不做過多說明，注意修改網絡卡裝置名稱即可。配置完Lnet之後，執行lustre模組載入

modprobe lustre

載入成功後，檢視 lnet 資訊

# lctl list_nids
[email protected]

檢視與 mgs 之間的通訊

# lctl ping [email protected]
[email protected] [email protected]

通過上面的輸出可以確認該節點的 lnet 與 mgs 之間的通訊正常。

客戶端掛載

lustre 客戶端的安裝和配置完成之後,就可以對檔案系統進行掛載了。掛載命令如下:

mount -t lustre [email protected]:[email protected]:/lustre /lustre/

看是很長的一條命令,也非常不容易記憶。但是等知道這條命令的具體意義的話,就會非常容易理解和記憶這條命令了。

-t lustre 引數指定了掛載的檔案系統型別為 lustre 檔案系統。
[email protected] 引數指定了互為主備的 mgs1 的 lnet 地址 [email protected] 引數指定了互為主備的 mgs2 的 lnet 地址,兩個 mgs
的 lnet 地址中間用:隔開

:/lustre 引數指定了在檔案系統建立的時候指定的檔案系統的名稱。

/lustre 引數指定了,作業系統的掛載路徑。

總結

以上為lustre部署過程中所有步驟，效能測試後期再進行補充，因為初始環境不同，部署過程中可能會遇到各種各樣的問題，根據錯誤提示，google搜尋應該都能解決。

總體來看，lustre的部署過程相對來說還是比較複雜的，這裡server段的安裝直接固化到作業系統中了，省了很多事情。因為lustre的基於核心態的檔案系統，相對系統版本及各種依賴軟體版本要求較高，安裝過程中也容易出錯。且lustre本身無資料冗餘機制，這裡通過軟體+硬體的方式來保障lustre檔案系統的穩定執行。但一旦出現問題，可能會造成所有資料故障期間無法訪問，影響範圍較廣，所以對運維人員的技能要求也較高。

Lustre並行檔案系統建設

專案背景介紹

軟體安裝

MDS節點部署

格式化磁碟

lustre.conf

載入lustre核心模組

掛載MDT、MGT

OSS節點部署

格式化磁碟

掛載OST

HA實現說明

節點HA

磁碟HA

網路HA

客戶端配置

Lnet 設定

客戶端掛載

總結

Lustre並行檔案系統建設

Lustre並行檔案系統的部署和測試

Parallel File System 並行檔案系統

深入理解Lustre檔案系統-第12篇 Lustre磁碟檔案系統：ldiskfs

如何區分分散式/叢集/並行檔案系統？

深入理解Lustre檔案系統-第10篇 LNET：Lustre網路

深入理解Lustre檔案系統-第7篇 MDC和Lustre元資料

深入理解Lustre檔案系統-第3篇 LNET：Lustre網路

【高效能】Lustre分散式儲存檔案系統介紹和故障分析

深入理解Lustre檔案系統-第2篇 Portal RPC

分散式檔案系統MFS、Ceph、GlusterFS、Lustre的比較

ubuntu+lustre 檔案系統

深入理解Lustre檔案系統-第1篇跟蹤除錯系統

基於Lustre檔案系統的MPI-IO程式設計介面改進

深入理解Lustre檔案系統-第9篇 Portal RPC

深入理解Lustre檔案系統-第3篇 lustre lite

專業定制開發一元奪寶(一元購)網站系統建設，帶源碼

略論並行處理系統的日誌設計

信息網絡系統建設實施階段的監理

高校校園IPTV系統建設的意義

Lustre並行檔案系統建設

專案背景介紹

軟體安裝

MDS節點部署

格式化磁碟

lustre.conf

載入lustre核心模組

掛載MDT、MGT

OSS節點部署

格式化磁碟

掛載OST

HA實現說明

節點HA

磁碟HA

網路HA

客戶端配置

Lnet 設定

客戶端掛載

總結

相關推薦