1. 程式人生 > >ceph-disk & ceph-osd啟動流程(by quqi99)

ceph-disk & ceph-osd啟動流程(by quqi99)

版權宣告:可以任意轉載,轉載時請務必以超連結形式標明文章原始出處和作者資訊及本版權宣告 (作者:張華 發表於:2018-07-25)

問題

客戶重啟物理機之後發現有一些OSD沒有啟動成功, 但手動執行ceph-disk命令(ceph-disk -v activate –mark-init systemd –mount /var/lib/ceph/osd/ceph-1)可以成功.從眾多無關錯誤日誌中提取到了如下有用日誌, 很顯然似乎在120秒的超時時間內沒執行完:

May 22 06:05:19 cephosd06 systemd[1]: Starting Ceph disk activation: /dev/sdh1...
May 22
06:05:30 cephosd06 sh[3926]: main_trigger: main_activate: path = /dev/sdh1 May 22 06:05:30 cephosd06 sh[3926]: get_dm_uuid: get_dm_uuid /dev/sdh1 uuid path is /sys/dev/block/8:113/dm/uuid May 22 06:05:30 cephosd06 sh[3926]: command: Running command: /sbin/blkid -o udev -p /dev/sdh1 May 22 06:05:30 cephosd06 sh[3926]: command: Running
command: /sbin/blkid -p -s TYPE -o value -- /dev/sdh1
May 22 06:05:30 cephosd06 sh[3926]: mount: Mounting /dev/sdh1 on /var/lib/ceph/tmp/mnt.0xG2_W with options noatime,inode64 May 22 06:05:30 cephosd06 sh[3926]: command_check_call: Running command: /bin/mount -t xfs -o noatime,inode64 -- /dev/sdh1 /var/lib/ceph/tmp/mnt.0xG2_W May 22
06:05:30 cephosd06 sh[3926]: command_check_call: Running command: /bin/mount -o noatime,inode64 -- /dev/sdh1 /var/lib/ceph/osd/ceph-45 May 22 06:05:30 cephosd06 ceph-osd[8052]: starting osd.45 at :/0 osd_data /var/lib/ceph/osd/ceph-45 /var/lib/ceph/osd/ceph-45/journal May 22 06:05:31 cephosd06 sh[6944]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdh1', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', func=<function main_trigger at 0x7f574b3fc7d0>, log_stdout=True, prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, statedir='/var/lib/ceph', sync=True, sysconfdir='/etc/ceph', verbose=True) May 22 06:05:31 cephosd06 sh[6944]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdh1 May 22 06:05:31 cephosd06 sh[6944]: command: Running command: /sbin/blkid -o udev -p /dev/sdh1 May 22 06:05:31 cephosd06 sh[6944]: command: Running command: /sbin/blkid -o udev -p /dev/sdh1 May 22 06:05:31 cephosd06 sh[6944]: main_trigger: trigger /dev/sdh1 parttype 4fbd7e29-9d25-41b8-afd0-062c0ceff05d uuid ff8f7341-1c1e-4912-b680-41fd6999fcc8 May 22 06:05:31 cephosd06 sh[6944]: command: Running command: /usr/sbin/ceph-disk --verbose activate /dev/sdh1 May 22 06:07:20 cephosd06 systemd[1]: [email protected]: Main process exited, code=exited, status=124/n/a May 22 06:07:20 cephosd06 systemd[1]: Failed to start Ceph disk activation: /dev/sdh1.

原因

個原因造成或加劇此問題:

vi systemd/ceph-[email protected].service
ExecStart=/bin/sh -c 'timeout 120 flock /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose --l

vi systemd/ceph-[email protected].service
ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph
ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i
Restart=on-failure
StartLimitInterval=30min
StartLimitBurst=30
RestartSec=20s

ceph-osd觸發ceph-disk流程

1, First ceph will create journal partition with the typecode 45b0969e-9b03-4f30-b4c6-b4b80ceff106, 

ceph-osd --cluster=ceph --show-config-value=osd_journal_size
uuid=$(uuidgen)
num=2
sgdisk --new=${num}:0:+128M --change-name=${num}:"ceph journal" --partition-guid=${num}:${uuid} --typecode=${num}:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/loop0
udevadm settle --timeout=600
flock -s /dev/loop0 partprobe /dev/loop0
udevadm settle --timeout=600

2, partx/partprobe command will be called to update partition after running sgdisk to create partition, so partprobe will send udev event to udev daemon

3, udev daemon will call '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name' after receiving udev event created by partprobe according to the following udev rules:

./udev/95-ceph-osd.rules
11 # JOURNAL_UUID
12 ACTION=="add", SUBSYSTEM=="block", \
13 ENV{DEVTYPE}=="partition", \
14 ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
15 OWNER:="ceph", GROUP:="ceph", MODE:="660", \
16 RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"
17 ACTION=="change", SUBSYSTEM=="block", \
18 ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
19 OWNER="ceph", GROUP="ceph", MODE="660"

./src/ceph-disk-udev
29 45b0969e-9b03-4f30-b4c6-b4b80ceff106)
30 # JOURNAL_UUID
31 # activate ceph-tagged journal partitions.
32 /usr/sbin/ceph-disk -v activate-journal /dev/${NAME}
33 ;;

4, so the device /dev/disk/by-partuuid/9195fa44-68ba-49f3-99f7-80d9bcb50430 will be created

5, Then the uuid of journal partition will be writed into the file /var/lib/ceph/osd/ceph-1/journal_uuid, and a soft link is linked into /var/lib/ceph/osd/ceph-1/journal

[email protected]332891-mitaka-ceph-0:~# ll /var/lib/ceph/osd/ceph-1/journal
lrwxrwxrwx 1 ceph ceph 58 Jun 1 02:46 /var/lib/ceph/osd/ceph-1/journal -> /dev/disk/by-partuuid/9195fa44-68ba-49f3-99f7-80d9bcb50430
[email protected]332891-mitaka-ceph-0:~# cat /var/lib/ceph/osd/ceph-1/journal_uuid
9195fa44-68ba-49f3-99f7-80d9bcb50430

偽碼描述ceph-disk的執行流程

1, Prepare test disk
dd if=/dev/zero of=test.img bs=1M count=8096 oflag=direct
#sudo losetup -d /dev/loop0
sudo losetup --show -f test.img
sudo ceph-disk -v prepare --zap-disk --cluster ceph --fs-type xfs -- /dev/loop0

2, Clear the partition
parted --machine -- /dev/loop0 print
sgdisk --zap-all -- /dev/loop0
sgdisk --clear --mbrtogpt -- /dev/loop0
udevadm settle --timeout=600
flock -s /dev/loop0 partprobe /dev/loop0
udevadm settle --timeout=600

3, Create journal partition
ceph-osd --cluster=ceph --show-config-value=osd_journal_size
uuid=$(uuidgen)
num=2
sgdisk --new=${num}:0:+128M --change-name=${num}:"ceph journal" --partition-guid=${num}:${uuid} --typecode=${num}:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/loop0
udevadm settle --timeout=600
flock -s /dev/loop0 partprobe /dev/loop0
udevadm settle --timeout=600

4, Create data partition
uuid=$(uuidgen)
sgdisk --largest-new=1 --change-name=1:"ceph data" --partition-guid=1:${uuid} --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt -- /dev/loop0
udevadm settle --timeout=600
flock -s /dev/loop0 partprobe /dev/loop0
udevadm settle --timeout=600

5, Format data partition
parted --machine -- /dev/loop0 print
mkfs -t xfs -f -i size=2048 -- /dev/loop0p1

6, All default mount attributes should be empty
ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs

7, Mount tmp directory - https://github.com/ceph/ceph/blob/jewel/src/ceph-disk/ceph_disk/main.py#L3169
mkdir /var/lib/ceph/tmp/mnt.uCrLyH
mount -t xfs -o noatime,inode64 -- /dev/loop0p1 /var/lib/ceph/tmp/mnt.uCrLyH
restorecon /var/lib/ceph/tmp/mnt.uCrLyH
cat /proc/mounts

8, Activate - https://github.com/ceph/ceph/blob/jewel/src/ceph-disk/ceph_disk/main.py#L3192
#Get fsid and write fsid to tmp file ceph_fsid by using active function
fsid=$(ceph-osd --cluster=ceph --show-config-value=fsid)
cat << EOF > /var/lib/ceph/tmp/mnt.uCrLyH/ceph_fsid
$fsid
EOF
restorecon -R /var/lib/ceph/tmp/mnt.uCrLyH/ceph_fsid
chown -R ceph:ceph /var/lib/ceph/tmp/mnt.uCrLyH/ceph_fsid

#Get osd_uuid and write it to the tmp file
osd_uuid=$(uuidgen)
cat << EOF > /var/lib/ceph/tmp/mnt.uCrLyH/fsid
$osd_uuid
EOF
restorecon -R /var/lib/ceph/tmp/mnt.uCrLyH/fsid
chown -R ceph:ceph /var/lib/ceph/tmp/mnt.uCrLyH/fsid

#Write magic to the tmp file
cat << EOF > /var/lib/ceph/tmp/mnt.uCrLyH/magic
ceph osd volume v026
EOF
restorecon -R /var/lib/ceph/tmp/mnt.uCrLyH/magic
chown -R ceph:ceph /var/lib/ceph/tmp/mnt.uCrLyH/magic

#Get journal_uuid and write it to the tmp file
journal_uuid # Get it by 'll /dev/disk/by-partuuid/ | grep loop0p2'
cat << EOF > /var/lib/ceph/tmp/mnt.uCrLyH/journal_uuid
$journal_uuid
EOF
restorecon -R /var/lib/ceph/tmp/mnt.uCrLyH/journal_uuid
chown -R ceph:ceph /var/lib/ceph/tmp/mnt.uCrLyH/journal_uuid

#Create journal link
ln -s /dev/disk/by-partuuid/f15b0bc2-8462-44c3-83f3-275646923f4a /var/lib/ceph/tmp/mnt.uCrLyH/journal

#Retore file security for tmp directory
restorecon -R /var/lib/ceph/tmp/mnt.uCrLyH
chown -R ceph:ceph /var/lib/ceph/tmp/mnt.uCrLyH

#Umount tmp directory
umount -- /var/lib/ceph/tmp/mnt.uCrLyH
rm -rf /var/lib/ceph/tmp/mnt.uCrLyH

#Modify the typecode of OSD to 4fbd7e29-9d25-41b8-afd0-062c0ceff05d, which means READY
sgdisk --typecode=1:4fbd7e29-9d25-41b8-afd0-062c0ceff05d -- /dev/loop0
udevadm settle --timeout=600
flock -s /dev/loop0 partprobe /dev/loop0
udevadm settle --timeout=600
udevadm trigger --action=add --sysname-match loop0

9, Start OSD daemon - https://github.com/ceph/ceph/blob/jewel/src/ceph-disk/ceph_disk/main.py#L3471
#ceph-disk -v activate --mark-init systemd --mount /dev/loop0
blkid -p -s TYPE -o value -- /dev/loop0
ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
mkdir /var/lib/ceph/tmp/mnt.GoeBOu
mount -t xfs -o noatime,inode64 -- /dev/loop0 /var/lib/ceph/tmp/mnt.GoeBOu
restorecon /var/lib/ceph/tmp/mnt.GoeBOu
umount -- /var/lib/ceph/tmp/mnt.GoeBOu
rm -rf /var/lib/ceph/tmp/mnt.GoeBOu
systemctl disable [email protected]3
systemctl enable --runtime [email protected]
systemctl start [email protected]3

Reference

[1] https://bugs.launchpad.net/charm-ceph-osd/+bug/1783113