系統技術非業餘研究 » BufferedIO和DirectIO混用導致的髒頁回寫問題

阿新 • • 發佈：2019-01-13

今天曲山同學在線上問道：

我測試發現，如果cp一個檔案，然後direct io讀這個檔案，會消耗很長時間。
我猜測dio不能用page cache，而這個檔案cp以後都在cache裡面，要強制刷到磁碟，才能讀？
我cp這個檔案很大，超過256M

由於資料檔案預設是用bufferedio方式開啟的，也就是說它的資料是先緩衝在pagecache裡面的，寫入的資料會導致大量的髒頁，而且這部分資料如果核心記憶體不緊張的話，是一直放在記憶體裡面的的。我們知道directio是直接旁路掉pagecache直接發起裝置IO的，也就是說在發起IO之前要保證資料是先落地到介質去，所以如果檔案比較大的話，這個時間會比較長。從

pagecahce的回寫行為我們可以知道，只要髒頁的數量不超過總記憶體的10%, 我們的機器有4G的記憶體，所以2個100M的檔案總共才200M，不會導致writeback發生，我們可以很順利的觀察到這個現象。

有了上面的分析，下面我們來重現下這個問題。以下是我的步驟：

$ uname -a
Linux rds064075.sqa.cm4 2.6.32-131.21.1.tb477.el6.x86_64 #1 SMP Thu Feb 23 14:24:55 CST 2012 x86_64 x86_64 x86_64 GNU/Linux
$ sudo sysctl vm.drop_caches=3
vm.drop_caches = 3
$ free -m && cat /proc/meminfo |grep -i dirty && time dd if=/dev/urandom of=test.dat count=6144 bs=16384 && free -m && cat /proc/meminfo |grep -i dirty && time dd if=test.dat of=/dev/null count=6144 bs=16384 && free -m && cat /proc/meminfo |grep -i dirty && time dd if=test.dat of=/dev/null count=6144 bs=16384  iflag=direct && free -m && cat /proc/meminfo |grep -i dirty
$ free -m && cat /proc/meminfo |grep -i dirty && time dd if=/dev/urandom of=test.dat count=6144 bs=16384 && free -m && cat /proc/meminfo |grep -i dirty && time dd if=test.dat of=/dev/null count=6144 bs=16384 && free -m && cat /proc/meminfo |grep -i dirty && time dd if=test.dat of=/dev/null count=6144 bs=16384  iflag=direct && free -m && cat /proc/meminfo |grep -i dirty
             total       used       free     shared    buffers     cached
Mem:         48262      22800      25461          0          3         42
-/+ buffers/cache:      22755      25507
Swap:         2047       2047          0
Dirty:               344 kB
6144+0 records in
6144+0 records out
100663296 bytes (101 MB) copied, 15.2308 s, 6.6 MB/s

real	0m15.249s
user	0m0.001s
sys	0m15.228s
             total       used       free     shared    buffers     cached
Mem:         48262      22912      25350          0          3        139
-/+ buffers/cache:      22768      25493
Swap:         2047       2047          0
Dirty:             98556 kB
6144+0 records in
6144+0 records out
100663296 bytes (101 MB) copied, 0.028041 s, 3.6 GB/s

real	0m0.029s
user	0m0.000s
sys	0m0.029s
             total       used       free     shared    buffers     cached
Mem:         48262      22912      25350          0          3        139
-/+ buffers/cache:      22768      25493
Swap:         2047       2047          0
Dirty:             98556 kB
6144+0 records in
6144+0 records out
100663296 bytes (101 MB) copied, 0.466601 s, 216 MB/s

real	0m0.468s
user	0m0.002s
sys	0m0.101s
             total       used       free     shared    buffers     cached
Mem:         48262      22906      25356          0          3        140
-/+ buffers/cache:      22762      25500
Swap:         2047       2047          0
Dirty:               896 kB

從上面的實驗，我們可以看出來我們的檔案是101MB左右，髒頁用了98544KB記憶體，在direct方式讀後，檔案佔用的髒頁被清洗掉了，髒頁變成了80K, 但是這塊資料還是留在了pagecache(140-39), 符合我們的預期。

接著我們從原始碼角度來分析下這個現象，我們知道VFS檔案的讀是從generic_file_aio_read發起的，而不管具體的檔案系統是什麼。
在文卿和三百的幫助下，我們不費吹灰之力就找到了原始碼位置，偷懶的方式如下：

$ stap -L 'kernel.function("generic_file_aio_read")' 
kernel.function("[email protected] 
/filemap.c:1331") $iocb:struct kiocb* $iov:struct iovec const* $nr_segs:long unsigned int $pos:loff_t $count:size_t

準備好emacs,我們來看下讀程式碼的實現：
mm/filemap.c:1331

/**
 * generic_file_aio_read - generic filesystem read routine
 * @iocb:       kernel I/O control block
 * @iov:        io vector request
 * @nr_segs:    number of segments in the iovec
 * @pos:        current file position
 *
 * This is the "read()" routine for all filesystems
 * that can use the page cache directly.
 */
ssize_t
generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
                unsigned long nr_segs, loff_t pos)
{
        /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
        if (filp->f_flags & O_DIRECT) {
                loff_t size;
                struct address_space *mapping;
                struct inode *inode;

                mapping = filp->f_mapping;
                inode = mapping->host;
                if (!count)
                        goto out; /* skip atime */
                size = i_size_read(inode);
                if (pos < size) {
                        retval = filemap_write_and_wait_range(mapping, pos,
                                        pos + iov_length(iov, nr_segs) - 1);
                        if (!retval) {
                                retval = mapping->a_ops->direct_IO(READ,
iocb,
                                                        iov, pos, nr_segs);
                        }
                        if (retval > 0) {
                                *ppos = pos + retval;
                                count -= retval;
                        }

                        /*
                         * Btrfs can have a short DIO read if we encounter
                         * compressed extents, so if there was an error,
or if
                         * we've already read everything we wanted to, or if
                         * there was a short read because we hit EOF, go
ahead
                         * and return.  Otherwise fallthrough to
buffered io for
                         * the rest of the read.
                         */
                        if (retval < 0 || !count || *ppos >= size) {
                                file_accessed(filp);
                                goto out;
                        }
                }
        }

原始碼很清楚的說：在directio方式下開啟的檔案，先要透過filemap_write_and_wait_range回寫資料，才開始後面的IO讀流程。
最後一步驟，我們再用stap來確認下我們之前的實驗：

$ cat dwb.stp
global i;
probe kernel.function("filemap_write_and_wait_range") {
if (execname() != "dd") next;
print_backtrace();
println("===");
if (i++>2) exit();
}

$ sudo stap dwb.stp 
 0xffffffff8110e200 : filemap_write_and_wait_range+0x0/0x90 [kernel]
 0xffffffff8110f278 : generic_file_aio_read+0x498/0x870 [kernel]
 0xffffffff8117323a : do_sync_read+0xfa/0x140 [kernel]
 0xffffffff81173c65 : vfs_read+0xb5/0x1a0 [kernel]
 0xffffffff81173da1 : sys_read+0x51/0x90 [kernel]
 0xffffffff8100b172 : system_call_fastpath+0x16/0x1b [kernel]
===
 0xffffffff8110e200 : filemap_write_and_wait_range+0x0/0x90 [kernel]
 0xffffffff811acbc8 : __blockdev_direct_IO+0x228/0xc40 [kernel]
 0xffffffffa008a24a
===
 0xffffffff8110e200 : filemap_write_and_wait_range+0x0/0x90 [kernel]
 0xffffffff8110f278 : generic_file_aio_read+0x498/0x870 [kernel]
 0xffffffff8117323a : do_sync_read+0xfa/0x140 [kernel]
 0xffffffff81173c65 : vfs_read+0xb5/0x1a0 [kernel]
 0xffffffff81173da1 : sys_read+0x51/0x90 [kernel]
 0xffffffff8100b172 : system_call_fastpath+0x16/0x1b [kernel]
===
 0xffffffff8110e200 : filemap_write_and_wait_range+0x0/0x90 [kernel]
 0xffffffff811acbc8 : __blockdev_direct_IO+0x228/0xc40 [kernel]
 0xffffffffa008a24a
===

filemap_write_and_wait_range的呼叫棧很清晰的暴露了一切！

小結：檔案系統比較複雜，最好不要混用bufferedio和directio！
祝玩得開心！

Post Footer automatically generated by wp-posturl plugin for wordpress.

系統技術非業餘研究 » BufferedIO和DirectIO混用導致的髒頁回寫問題

系統技術非業餘研究 » BufferedIO和DirectIO混用導致的髒頁回寫問題

系統技術非業餘研究 » Flashcache新新增驅逐空閒髒頁引數

系統技術非業餘研究 » Erlang和port通訊的資料格式

系統技術非業餘研究 » ftrace和它的前端工具trace

系統技術非業餘研究 » rebar和common_test使用實踐和疑惑澄清

系統技術非業餘研究 » Erlang節點重啟導致的incarnation問題

系統技術非業餘研究 » Linux下如何知道檔案被那個程序寫

系統技術非業餘研究 » 新的工作和研究方向

系統技術非業餘研究 » 如何檢視節點的可用控制代碼數目和已用控制代碼數

系統技術非業餘研究 » oprofile抓不到取樣資料問題和解決方法

系統技術非業餘研究 » Erlang match_spec引擎介紹和應用

系統技術非業餘研究 » Erlang虛擬機器基礎設施dtrace探測點介紹和使用

系統技術非業餘研究 » Erlang 網路密集型伺服器的瓶頸和解決思路

系統技術非業餘研究 » gen_tcp:send的深度解刨和使用指南(初稿)

系統技術非業餘研究 » 計算機各系統元件的吞吐量和延遲看圖不說話

系統技術非業餘研究 » 如何在TILEPro64多核心板卡上編譯和執行Erlang

系統技術非業餘研究 » erlang的profile工具原理和優缺點

系統技術非業餘研究 » 轉：CPU密集型計算 erlang和C 大比拼

系統技術非業餘研究 » 區域性性原理在計算機和分散式系統中的應用課程PPT

系統技術非業餘研究 » erlang高階原理和應用PPT

系統技術非業餘研究 » BufferedIO和DirectIO混用導致的髒頁回寫問題

相關推薦