Linux IO 監控與深入分析

阿新 • • 發佈：2018-11-21

4.6 .cn 計時說明扇區版本 play linux patch

https://jaminzhang.github.io/os/Linux-IO-Monitoring-and-Deep-Analysis/

Linux IO 監控與深入分析

引言

接昨天電話面試，面試官問了系統 IO 怎麽分析，當時第一反應是使用 iotop 看系統上各進程的 IO 讀寫速度，然後使用 iostat 看 CPU 的 %iowait 時間占比，（%iowait：CPU等待輸入輸出完成時間的百分比，%iowait的值過高，表示硬盤存在I/O瓶頸）
但回答並是不很全面，確實，比較久之前寫過一篇 Linux iostat 使用，很久沒有在系統上分析 IO 狀態了，所以有好幾個分析工具和參數忘記了（說明要熟悉一個知識和技能是需要不斷應用和重復學習，熟能生巧很有道理，扯遠了，接著說 IO 監控與分析），然後面試官提示還要看 %util 參數（表示磁盤的繁忙程度），他一說，我確實了也記起來了。這個也是常用要看的參數。
下面我重新查找相關資料並再次學習一下吧，還是要經常在實際工作中多應用才能熟練。

1 系統級 IO 監控

iostat


[root@xxxx_wan360_game ~]# iostat -xdm 1
Linux 2.6.32-358.el6.x86_64 (xxxx_wan360_game) 	12/06/2016 	_x86_64_	(8 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
xvdep1            0.00     0.00    0.01    1.31     0.00     0.02    31.35     0.00    1.63   0.06   0.01

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
xvdep1            0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
xvdep1            0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
xvdep1            0.00     0.00    0.00    3.00     0.00     0.01     8.00     0.00    0.00   0.00   0.00


# iostat 選項

-x     Display extended statistics.  
This option works with post 2.5 kernels since it needs /proc/diskstats file or a mounted sysfs to get the statistics. 
This option may also work with older kernels (e.g. 2.4) only if extended statistics are available in /proc/partitions
(the kernel needs to be patched for that).

-d     Display the device utilization report.

-m     Display statistics in megabytes per second instead of blocks or kilobytes per second.  
Data displayed are valid only with kernels 2.4 and later.

rrqm/s
       The number of read requests merged per second that were queued to the device.
	 排隊到設備時，每秒合並的讀請求數量

wrqm/s
       The number of write requests merged per second that were queued to the device.
         排隊到設備時，每秒合並的寫請求數量

r/s
       The number of read requests that were issued to the device per second.
	 每秒發送給設備的讀請求數量

w/s
       The number of write requests that were issued to the device per second.
	 每秒發送給設備的的寫請求數量

rMB/s
       The number of megabytes read from the device per second.
	 每秒從設備中讀取多少 MBs 

wMB/s
       The number of megabytes written to the device per second.
	 每秒往設備中寫入多少 MBs


avgrq-sz
	The average size (in sectors) of the requests that were issued to the device.
	分發給設備的請求的平均大小（以扇區為單位）
	磁盤扇區是磁盤的物理屬性，它是磁盤設備尋址的最小單元，磁盤扇區大小可以用 fdisk -l 命令查看
	另外，常說的“塊”（Block）是文件系統的抽象，不是磁盤本身的屬性。
	另外一種說明：
	提交給驅動層的 IO 請求大小，一般不小於 4K，不大於 max(readahead_kb, max_sectors_kb)
	可用於判斷當前的 IO 模式，一般情況下，尤其是磁盤繁忙時，越大代表順序，越小代表隨機
	 
avgqu-sz
	The average queue length of the requests that were issued to the device.
	分發給設備的請求的平均隊列長度

await
	The average time (in milliseconds) for I/O requests issued to the device to be served. 
	This includes the time spent by the requests in queue and the time spent servicing them.
	分發給設備的 I/O 請求的平均響應時間（單位是毫秒）
	這個時間包含了花在請求在隊列中的時間和服務於請求的時間
	另外一種說明：
	每一個 I/O 請求的處理的平均時間（單位是毫秒）。這裏可以理解為 I/O 的響應時間。
	一般地，系統 I/O 響應時間應該低於 5ms，如果大於 10ms 就比較大了。

svctm
	The average service time (in milliseconds) for I/O requests that were issued to the device. 
	Warning! Do not trust this field any more. This field will be removed in a future sysstat version.
	分發給設備的 I/O 請求的平均服務時間。（單位是毫秒）
	警告！不要再相信這列值了。這一列將會在一個將來的 sysstat 版本中移除。
	另外一種說明：
	一次 IO 請求的服務時間，對於單塊盤，完全隨機讀時，基本在 7ms 左右，即尋道 + 旋轉延遲時間
	
%util
	Percentage of elapsed time during which I/O requests were issued to the device 
	(bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.
	 分發給設備的 I/O 請求的運行時間所占的百分比。（設備的帶寬利用率）
	 設備飽和會發生在這個值接近 100%。
	 另外一種說明：
	 代表磁盤繁忙程度。100% 表示磁盤繁忙，0% 表示磁盤空閑。但是註意，磁盤繁忙不代表磁盤(帶寬)利用率高。
	 在統計時間內所有處理 I/O 時間，除以總共統計時間。
	 例如，如果統計間隔 1 秒，該設備有 0.8 秒在處理 I/O，而 0.2 秒閑置，那麽該設備的 %util = 0.8/1 = 80%，
	 所以該參數暗示了設備的繁忙程度。一般地，如果該參數是 100% 表示設備已經接近滿負荷運行了
	 （當然如果是多磁盤，即使 %util 是 100%，因為磁盤的並發能力，所以磁盤使用未必就到了瓶頸）。

%iowait
	Show the percentage of time that the CPU or CPUs were idle during 
	which the system had an outstanding disk I/O request.
	 顯示當系統有一個顯著的磁盤 I/O 請求期間，CPU 空閑時間的百分比。

總結：
iostat 統計的是通用塊層經過合並(rrqm/s, wrqm/s)後，直接向設備提交的 IO 數據，可以反映系統整體的 IO 狀況，
但是有以下 2 個缺點:

1. 距離業務層比較遙遠，跟代碼中的 write，read 不對應(由於系統預讀 + PageCache + IO 調度算法等因素，也很難對應)
2. 是系統級，沒辦法精確到進程，比如只能告訴你現在磁盤很忙，但是沒辦法告訴你是誰在忙，在忙什麽

另一資料的總結：

如果 %iowait 的值過高，表示磁盤存在 I/O 瓶頸。
如果 %util 接近 100%，說明產生的 I/O 請求太多，I/O 系統已經滿負荷，該磁盤可能存在瓶頸。
如果 svctm 比較接近 await，說明 I/O 幾乎沒有等待時間；
如果 await 遠大於 svctm，說明 I/O 隊列太長，I/O 響應太慢，則需要進行必要優化。
如果 avgqu-sz 比較大，也表示有大量 IO 在等待。

2 進程級 IO 監控

iotop 和 pidstat

iotop 顧名思義, IO 版的 top
pidstat 顧名思義, 統計進程(pid)的 stat，進程的 stat 自然包括進程的 IO 狀況

這兩個命令，都可以按進程統計 IO 狀況，因此可以回答你以下二個問題：

當前系統哪些進程在占用 IO，百分比是多少?
占用 IO 的進程是在讀？還是在寫？讀寫量是多少？

pidstat 參數很多，根據需要使用


[root@xxxx_wan360_game ~]# pidstat -d 1		# 只顯示 IO
Linux 2.6.32-358.el6.x86_64 (xxxx_wan360_game) 	12/06/2016 	_x86_64_	(8 CPU)

05:28:57 PM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
05:28:58 PM        50      0.00      4.00      0.00  sync_supers
05:28:58 PM       627      0.00      8.00      0.00  flush-202:65
05:28:58 PM      3852      0.00      8.00      0.00  cente_s0001
05:28:58 PM      3860      0.00      4.00      0.00  game_s0001
05:28:58 PM      3864      0.00      4.00      0.00  game_s0001
05:28:58 PM      3868      0.00      4.00      0.00  game_s0001
05:28:58 PM      3876      0.00      4.00      0.00  gate_s0001
05:28:58 PM      3880      0.00      4.00      0.00  gate_s0001

05:28:58 PM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command

05:28:59 PM       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
05:29:00 PM     23922      0.00     20.00      0.00  filebeat


# pidstat -u -r -d -t 1        
# -u CPU 使用率
# -r 缺頁及內存信息
# -d IO 信息
# -t 以線程為統計單位
# 1  1 秒統計一次

[root@xxxx_wan360_game ~]# pidstat -u -r -d -t 1
Linux 2.6.32-358.el6.x86_64 (xxxx_wan360_game) 	12/06/2016 	_x86_64_	(8 CPU)

05:32:11 PM      TGID       TID    %usr %system  %guest    %CPU   CPU  Command
05:32:12 PM      3856         -    3.74    0.93    0.00    4.67     5  game_s0001
05:32:12 PM         -      3856    4.67    0.93    0.00    5.61     5  |__game_s0001
05:32:12 PM         -      3922    0.93    0.00    0.00    0.93     2  |__game_s0001
05:32:12 PM      3880         -    0.00    0.93    0.00    0.93     3  gate_s0001
05:32:12 PM         -      3908    0.00    0.93    0.00    0.93     3  |__gate_s0001
05:32:12 PM      6832         -    1.87    4.67    0.00    6.54     4  pidstat
05:32:12 PM         -      6832    1.87    4.67    0.00    6.54     4  |__pidstat

05:32:11 PM      TGID       TID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command
05:32:12 PM      6803         -      5.61      0.00    4124    796   0.00  iostat
05:32:12 PM         -      6803      5.61      0.00    4124    796   0.00  |__iostat
05:32:12 PM      6832         -   1321.50      0.00  101432   1280   0.01  pidstat
05:32:12 PM         -      6832   1321.50      0.00  101432   1280   0.01  |__pidstat
05:32:12 PM      8391         -      0.93      0.00   17992   1176   0.01  zabbix_agentd
05:32:12 PM         -      8391      0.93      0.00   17992   1176   0.01  |__zabbix_agentd
05:32:12 PM      8392         -      2.80      0.00   20064   1320   0.01  zabbix_agentd
05:32:12 PM         -      8392      2.80      0.00   20064   1320   0.01  |__zabbix_agentd

05:32:11 PM      TGID       TID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
05:32:12 PM      1894         -      0.00      3.74      0.00  mysqld
05:32:12 PM         -      1923      0.00      3.74      0.00  |__mysqld

總結:

進程級 IO 監控：

可以回答系統級 IO 監控不能回答的 2 個問題
距離業務層相對較近(例如，可以統計進程的讀寫量)

但是也沒有辦法跟業務層的 read, write 聯系在一起，同時顆粒度較粗，沒有辦法告訴你，當前進程讀寫了哪些文件？耗時？大小？

3. 業務級 IO 監控

ioprofile

ioprofile 命令本質上是 lsof + strace ioprofile 可以回答你以下三個問題:

當前進程某時間內,在業務層面讀寫了哪些文件(read, write)？
讀寫次數是多少？(read, write 的調用次數)
讀寫數據量多少？(read, write 的 byte 數)

註: ioprofile 僅支持多線程程序,對單線程程序不支持. 對於單線程程序的 IO 業務級分析，strace 足以。

總結： ioprofile 本質上是 strace，因此可以看到 read，write 的調用軌跡，可以做業務層的 IO 分析(mmap 方式無能為力)

4. 文件級 IO 監控

文件級 IO 監控可以配合/補充”業務級和進程級” IO 分析
文件級 IO 分析，主要針對單個文件，回答當前哪些進程正在對某個文件進行讀寫操作

lsof 或者 ls /proc/pid/fd
inodewatch.stp

lsof 告訴你當前文件由哪些進程打開


[root@xxxx_wan360_game ~]# lsof ./		# 當前目錄當前由 bash 和 lsof 進程打開
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
lsof     8932 root  cwd    DIR 202,65     4096 33652737 .
lsof     8938 root  cwd    DIR 202,65     4096 33652737 .
bash    16678 root  cwd    DIR 202,65     4096 33652737 .

lsof 命令只能回答靜態的信息，並且“打開”並不一定“讀取”，
對於 cat，echo 這樣的命令，打開和讀取都是瞬間的，lsof 很難捕捉可以用 inodewatch.stp 來彌補

Ref

Linux 下的 IO 監控與分析
使用 iostat 分析 IO 性能
性能優化-分析 IO 瓶頸

Linux IO 監控與深入分析

4.6 .cn 計時說明扇區版本 play linux patch https://jaminzhang.github.io/os/Linux-IO-Monitoring-and-Deep-Analysis/ Linux IO 監控與深入分析引言接昨天電話面試

Linux IO 監控與深入分析

引言

1 系統級 IO 監控

iostat

2 進程級 IO 監控

iotop 和 pidstat

3. 業務級 IO 監控

ioprofile

4. 文件級 IO 監控

Ref

Linux IO 監控與深入分析

Linux下的IO監控與分析

linux系統監控與硬盤分區/格式化/文件系統管理

redis cluster叢集搭建與深入分析（1）

Hystrix斷路器的狀態監控與深入理解

《手Q Android執行緒死鎖監控與自動化分析實踐》

Linux堆記憶體管理深入分析（下）

Linux堆記憶體管理深入分析(上)

Linux監控與分析工具nmon

7.Linux核心設計與實現 P69---深入分析 Linux 核心連結串列(轉)

Linux C 深入分析結構體指標的定義與引用

linux下select,poll,epoll的使用與重點分析

深入分析JavaWeb Item47 -- Struts2攔截器與文件上傳下載

Linux五種IO模型性能分析

Linux系統內對高CPU的監控及日誌分析

深入探討Linux靜態庫與動態庫的詳解（轉）

[數據庫事務與鎖]詳解三: 深入分析事務的隔離級別

Linux IO實時監控iostat命令詳解

第一次作業：深入分析Linux系統進程

Linux進程啟動過程分析do_execve(可執行程序的加載和運行)---Linux進程的管理與調度（十一）

Linux IO 監控與深入分析

引言

1 系統級 IO 監控

iostat

2 進程級 IO 監控

iotop 和 pidstat

3. 業務級 IO 監控

ioprofile

4. 文件級 IO 監控

Ref

相關推薦