基於kernel2.5.43對第一版經典RCU實現的思考
最近在研究RCU機制,想從RCU的歷史源頭開始深入理解(追溯根源會有意想不到的收穫,至少可以從程式碼演進過程中領略大牛們的思想)。
想到這裡,網上也有很多和我一樣想法的人士,特別感謝這篇文章:
http://www.wowotech.net/kernel_synchronization/Linux-2-5-43-RCU.html
上述文章講解很詳細,內容都覆蓋全面。有興趣的朋友可以直接閱讀。
以下只是我的一些思考。
假設4核系統,那麼整個系統有以下地方涉及rcu處理(由於4核併發,下列每行對齊只是好看而已,call_rcu不一定每個核上都有呼叫)
cpu0 | cpu1 | cpu2 | cpu3 |
tick中斷 | tick中斷 | tick中斷 | tick中斷 |
-->>rcu_pending | |||
-->>rcu_check_callbacks | |||
taskset | taskset | taskset | taskset |
-->>rcu_process_callbacks | |||
call_rcu(x) | call_rcu(x) | call_rcu(x) | call_rcu(x) |
-->>list_add_tail(&head->list, &RCU_nxtlist(cpu)) | |||
資料結構理解
struct rcu_ctrlblk {
spinlock_t mutex; /* Guard this struct */
long curbatch; /* Current batch number. */
long maxbatch; /* Max requested batch number. */
unsigned long rcu_cpu_mask; /* CPUs that need to switch in order */
/* for current batch to proceed. */
};
struct rcu_data {
long qsctr; /* User-mode/idle loop etc. */
long last_qsctr; /* value of qsctr at beginning */
/* of rcu grace period */
long batch; /* Batch # for current RCU batch */
struct list_head nxtlist;
struct list_head curlist;
} ____cacheline_aligned_in_smp;
rcu_ctrlblk個人認為有點像音樂演奏的指揮家,指揮著不同樂器演奏家(rcu_data),演奏家需要盯著(tick中斷)指揮家的節拍不停前進。
rcu_cpu_mask的每一位表示一個rcu_data。
curbatch表示當前正在處理的批次,每次通過一個Grace Period就會前進1。主要是和rcu_data的batch結合處理回撥。
個人認為在Grace Period到期前,rcu_data的batch永遠比curbatch大1。
maxbatch也是比curbatch大1而已。
batch和maxbatch有可能是一樣的。下面程式碼本cpu啟動一個新的Grace Period時,會呼叫rcu_start_batch更新maxbatch。
/*
* start the next batch of callbacks
*/
spin_lock(&rcu_ctrlblk.mutex);
RCU_batch(cpu) = rcu_ctrlblk.curbatch + 1;
rcu_start_batch(RCU_batch(cpu));
spin_unlock(&rcu_ctrlblk.mutex);
static void rcu_start_batch(long newbatch)
{
if (rcu_batch_before(rcu_ctrlblk.maxbatch, newbatch)) {
rcu_ctrlblk.maxbatch = newbatch;
}
if (rcu_batch_before(rcu_ctrlblk.maxbatch, rcu_ctrlblk.curbatch) ||
(rcu_ctrlblk.rcu_cpu_mask != 0)) {
return;
}
rcu_ctrlblk.rcu_cpu_mask = cpu_online_map;
}
rcu_data
每個核上的batch有可能都不一樣,最大差值是多少呢?是不是等於核數呢?
新的回撥只會插入到nxtlist連結串列,curlist連結串列會在Grace Period到期時執行。
要是能測試驗證一下就更好了。
Bug:hardirq_count() <= 1, 這個判斷條件應該是 hardirq_count() <= (1 << HARDIRQ_SHIFT),在2.5.45版本上修復。由於執行rcu_check_callbacks是在timer的interrupt handler中,因此hardirq_count() <= 1 這個判斷條件永遠不會成立。
void rcu_check_callbacks(int cpu, int user)
{
if (user ||
(idle_cpu(cpu) && !in_softirq() && hardirq_count() <= 1))
RCU_qsctr(cpu)++;
tasklet_schedule(&RCU_tasklet(cpu));
}
ChangeLog-2.5.45:
<[email protected]>
[PATCH] RCU idle detection fix
Patch from Dipankar Sarma <[email protected]>
There is a check in RCU for idle CPUs which signifies quiescent state
(and hence no reference to RCU protected data) which was broken when
interrupt counters were changed to use thread_info->preempt_count.
Martin's 32 CPU machine with many idle CPUs was not completing any RCU
grace period because RCU was forever waiting for idle CPUs to context
switch. Had the idle check worked, this would not have happened. With
no RCU happening, the dentries were getting "freed" (dentry stats
showing that) but not getting returned to slab. This would not show up
in systems that are generally busy as context switches then would
happen in all CPUs and the per-CPU quiescent state counter would get
incremented during context switch.
patch-2.5.45:
void rcu_check_callbacks(int cpu, int user)
{
if (user ||
- (idle_cpu(cpu) && !in_softirq() && hardirq_count() <= 1))
+ (idle_cpu(cpu) && !in_softirq() &&
+ hardirq_count() <= (1 << HARDIRQ_SHIFT)))
RCU_qsctr(cpu)++;
tasklet_schedule(&RCU_tasklet(cpu));
}
另外還有以下幾篇:
Linux2.6.11版本:classic RCU的實現
http://www.wowotech.net/kernel_synchronization/linux2-6-11-RCU.html
RCU作者經典網頁:
http://www2.rdrop.com/users/paulmck/RCU/
作者釋出的RCU相關文章: