1. 程式人生 > >基於kernel2.5.43對第一版經典RCU實現的思考

基於kernel2.5.43對第一版經典RCU實現的思考

最近在研究RCU機制,想從RCU的歷史源頭開始深入理解(追溯根源會有意想不到的收穫,至少可以從程式碼演進過程中領略大牛們的思想)。

想到這裡,網上也有很多和我一樣想法的人士,特別感謝這篇文章:

http://www.wowotech.net/kernel_synchronization/Linux-2-5-43-RCU.html

上述文章講解很詳細,內容都覆蓋全面。有興趣的朋友可以直接閱讀。

以下只是我的一些思考。

假設4核系統,那麼整個系統有以下地方涉及rcu處理(由於4核併發,下列每行對齊只是好看而已,call_rcu不一定每個核上都有呼叫)

cpu0  cpu1 cpu2 cpu3
tick中斷 tick中斷 tick中斷 tick中斷
-->>rcu_pending      
-->>rcu_check_callbacks      
taskset  taskset  taskset  taskset 
-->>rcu_process_callbacks      
       
call_rcu(x) call_rcu(x) call_rcu(x) call_rcu(x)
-->>list_add_tail(&head->list, &RCU_nxtlist(cpu))      
       

資料結構理解

struct rcu_ctrlblk {
	spinlock_t	mutex;		/* Guard this struct                  */
	long		curbatch;	/* Current batch number.	      */
	long		maxbatch;	/* Max requested batch number.        */
	unsigned long	rcu_cpu_mask; 	/* CPUs that need to switch in order  */
					/* for current batch to proceed.      */
};
struct rcu_data {
	long		qsctr;		 /* User-mode/idle loop etc. */
        long            last_qsctr;	 /* value of qsctr at beginning */
                                         /* of rcu grace period */
        long  	       	batch;           /* Batch # for current RCU batch */
        struct list_head  nxtlist;
        struct list_head  curlist;
} ____cacheline_aligned_in_smp;

rcu_ctrlblk個人認為有點像音樂演奏的指揮家,指揮著不同樂器演奏家(rcu_data),演奏家需要盯著(tick中斷)指揮家的節拍不停前進。

rcu_cpu_mask的每一位表示一個rcu_data。

curbatch表示當前正在處理的批次,每次通過一個Grace Period就會前進1。主要是和rcu_data的batch結合處理回撥。

個人認為在Grace Period到期前,rcu_data的batch永遠比curbatch大1。

maxbatch也是比curbatch大1而已。

batch和maxbatch有可能是一樣的。下面程式碼本cpu啟動一個新的Grace Period時,會呼叫rcu_start_batch更新maxbatch。

/*
		 * start the next batch of callbacks
		 */
		spin_lock(&rcu_ctrlblk.mutex);
		RCU_batch(cpu) = rcu_ctrlblk.curbatch + 1;
		rcu_start_batch(RCU_batch(cpu));
		spin_unlock(&rcu_ctrlblk.mutex);

static void rcu_start_batch(long newbatch)
{
	if (rcu_batch_before(rcu_ctrlblk.maxbatch, newbatch)) {
		rcu_ctrlblk.maxbatch = newbatch;
	}
	if (rcu_batch_before(rcu_ctrlblk.maxbatch, rcu_ctrlblk.curbatch) ||
	    (rcu_ctrlblk.rcu_cpu_mask != 0)) {
		return;
	}
	rcu_ctrlblk.rcu_cpu_mask = cpu_online_map;
}

rcu_data

每個核上的batch有可能都不一樣,最大差值是多少呢?是不是等於核數呢?

新的回撥只會插入到nxtlist連結串列,curlist連結串列會在Grace Period到期時執行。

要是能測試驗證一下就更好了。

Bug:hardirq_count() <= 1, 這個判斷條件應該是 hardirq_count() <= (1 << HARDIRQ_SHIFT),在2.5.45版本上修復。由於執行rcu_check_callbacks是在timer的interrupt handler中,因此hardirq_count() <= 1 這個判斷條件永遠不會成立。

void rcu_check_callbacks(int cpu, int user)
{
	if (user || 
	    (idle_cpu(cpu) && !in_softirq() && hardirq_count() <= 1))
		RCU_qsctr(cpu)++;
	tasklet_schedule(&RCU_tasklet(cpu));
}
ChangeLog-2.5.45:
<[email protected]>
	[PATCH] RCU idle detection fix
	
	Patch from Dipankar Sarma <[email protected]>
	
	There is a check in RCU for idle CPUs which signifies quiescent state
	(and hence no reference to RCU protected data) which was broken when
	interrupt counters were changed to use thread_info->preempt_count.
	
	Martin's 32 CPU machine with many idle CPUs was not completing any RCU
	grace period because RCU was forever waiting for idle CPUs to context
	switch.  Had the idle check worked, this would not have happened.  With
	no RCU happening, the dentries were getting "freed" (dentry stats
	showing that) but not getting returned to slab.  This would not show up
	in systems that are generally busy as context switches then would
	happen in all CPUs and the per-CPU quiescent state counter would get
	incremented during context switch.
patch-2.5.45:  
 void rcu_check_callbacks(int cpu, int user)
 {
 	if (user || 
-	    (idle_cpu(cpu) && !in_softirq() && hardirq_count() <= 1))
+	    (idle_cpu(cpu) && !in_softirq() && 
+				hardirq_count() <= (1 << HARDIRQ_SHIFT)))
 		RCU_qsctr(cpu)++;
 	tasklet_schedule(&RCU_tasklet(cpu));
 }
 

另外還有以下幾篇:

Linux2.6.11版本:classic RCU的實現

http://www.wowotech.net/kernel_synchronization/linux2-6-11-RCU.html

RCU作者經典網頁:

http://www2.rdrop.com/users/paulmck/RCU/

作者釋出的RCU相關文章:

https://lwn.net/Kernel/Index/#Read-copy-update