linux原始碼解讀（十五）：紅黑樹在核心的應用——CFS排程器

阿新 • • 發佈：2022-01-16

　　1、在現代的作業系統中，程序排程是最核心的功能之一；linux 0.11的排程演算法簡單粗暴：遍歷task_struct陣列，找到時間片counter最大的程序執行；顯然這種策略已經不適合越來越複雜的業務場景需求了，所以後來逐步增加了多種排程策略，目前最廣為人知的排程策略有5種：cfs、idle、deadline、realtime、stop，並且這5種排程策略都是同時存在的，不排除後續增加新的排程策略，怎麼才能更方便地統一管理存量和增量的排程策略了？從2.6.23開始引入了sched_class，如下：

struct sched_class {
    const struct sched_class *next;
     
/*
    1、全是成員函式：這裡用函式指標來表達；
    2、排程的方式有很多種，比如cfs、rt、idle、deadline，每種方式的實現方法肯定不同，這裡提供介面函式讓不同的排程方式各自去實現（類似驅動的struct file_operations *ops結構體）
    */
    void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);/*任務加入佇列；cfs就是在紅黑樹插入節點*/
    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int 
 flags);/*任務移除佇列；cfs就是在紅黑樹刪除節點*/
    void (*yield_task) (struct rq *rq);/*讓出任務*/
    bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt);/*讓出到任務*/

    void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);

    /*
     * It is the responsibility of the pick_next_task() method that will
     * return the next task to call put_prev_task() on the @prev task or
     * something equivalent.
     *
     * May return RETRY_TASK when it finds a higher prio class has runnable
     * tasks.
      
*/
    struct task_struct * (*pick_next_task) (struct rq *rq,
                        struct task_struct *prev,
                        struct pin_cookie cookie);
    void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
    int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
    void (*migrate_task_rq)(struct task_struct *p);

    void (*task_woken) (struct rq *this_rq, struct task_struct *task);

    void (*set_cpus_allowed)(struct task_struct *p,
                 const struct cpumask *newmask);

    void (*rq_online)(struct rq *rq);
    void (*rq_offline)(struct rq *rq);
#endif

    void (*set_curr_task) (struct rq *rq);
    void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
    void (*task_fork) (struct task_struct *p);
    void (*task_dead) (struct task_struct *p);

    /*
     * The switched_from() call is allowed to drop rq->lock, therefore we
     * cannot assume the switched_from/switched_to pair is serliazed by
     * rq->lock. They are however serialized by p->pi_lock.
     */
    void (*switched_from) (struct rq *this_rq, struct task_struct *task);
    void (*switched_to) (struct rq *this_rq, struct task_struct *task);
    void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
                 int oldprio);

    unsigned int (*get_rr_interval) (struct rq *rq,
                     struct task_struct *task);

    void (*update_curr) (struct rq *rq);

#define TASK_SET_GROUP  0
#define TASK_MOVE_GROUP    1

#ifdef CONFIG_FAIR_GROUP_SCHED
    void (*task_change_group) (struct task_struct *p, int type);
#endif
};

　　這個class把排程中涉及到的方法全部抽象出來定義成函式指標，不同的排程演算法對於函式的實現肯定不一樣，linux核心直接呼叫這些函式指標就能達到使用不同調度策略的目的了，是不是很巧妙了？和裝置驅動的file_operations結構體思路是一樣的（函式指標的介面分別由各個廠家的驅動實現，但是介面名稱保持一致）！不同調度策略/例項的關係和程式碼檔案如下：

　 2、介紹CFS之前，先總結一下linux排程的型別和背景：

（1）基於時間片輪詢，又稱O(n)排程：每次排程都需要遍歷所有的task_struct，找到時間片最大的執行；如果程序很多，導致task_struct很長，每次光是遍歷就很耗時，時間複雜度是O(n)；n是task_struct的個數；除此以外，還有比較明顯的缺陷：

SMP系統擴充套件不好，訪問run queue需要加鎖
實時程序不能立即排程
cpu可能空轉
程序在多個cpu之間來回跳轉，降低效能

（2）上面的排程很耗時，核心因素就是每次都要遍歷所有的task_struct去尋找時間片最大的程序，時間複雜度被抬高到了O(n)，並且也沒有優先順序的功能，這兩點該怎麼改進了？O(1)演算法由此誕生，簡單來說，先把所有的任務按照不同的優先順序加入不同的佇列，然後先排程優先順序高的佇列，由此專門誕生了prio_array結構體來支撐演算法，如下：

#define MAX_USER_RT_PRIO    100
#define MAX_RT_PRIO         MAX_USER_RT_PRIO
#define MAX_PRIO            (MAX_RT_PRIO + 40)//140個優先順序（0 ~ 139，數值越小優先順序越高）

#define BITMAP_SIZE ((((MAX_PRIO+1+7)/8)+sizeof(long)-1)/sizeof(long))

struct prio_array {
    int nr_active;//所有優先順序佇列中的總任務數。
    unsigned long bitmap[BITMAP_SIZE];//每個位對應一個優先順序的任務佇列，用於記錄哪個任務佇列不為空，能通過 bitmap 夠快速找到不為空的任務佇列
    struct list_head queue[MAX_PRIO];//優先順序佇列陣列，每個元素維護一個優先順序佇列，比如索引為0的元素維護著優先順序為0的任務佇列
};

　　圖示如下：先掃描bitmap，找到不為空的佇列去排程（比如這裡的2、6號佇列不為空）；由於bitmap的大小是固定的，所以遍歷的時間也是固定的，時間複雜度自然是O(1)了；因為數值越低、優先順序越高，所以從bitmap的0開始遍歷，找到第一個不為空的佇列就可以停止遍歷了，這裡又節約了時間，所以整體的效率比簡單粗暴的時間片輪詢高多了！總結一下：O(1)排程演算法的本質就是把大量的任務按照優先順序分佇列，從優先順序高的佇列開始執行，避免了時間片輪詢那種“眉毛鬍子一把抓”的混亂，是一種典型的空間換時間的思路！

　相比時間片輪詢，O(1)演算法確實做了比較大的改進，但是自身也不是100%完美無瑕（否則就不會後後續其他的排程演算法了），比如：

　互動性較強的任務要再次執行，就需要等待當前等待佇列中的所有任務都執行完成：比如程序需要使用者輸入時阻塞，但並不是使用者輸入後馬上喚醒，而是同隊列其他任務都執行完後才繼續執行，可能導致互動不及時，產生卡頓的感覺，影響使用者體驗
不能保證在給定的時間間隔內，為每個任務分配的時間與其優先順序是成正比的；這個問題是上面問題引申出來的：比如A程序優先順序高，分配了20ms，B程序分配10ms，但是A被阻塞了，cpu轉而執行B；A要等到B執行完成後才會繼續，所以A優先順序高完全沒體現出來，名存實亡！

　　為了解決上面的問題，CFS誕生了！

　 3、（1）不管用哪種排程方式，首先要找到程序的task_struct；由於上層業務應用需求多種多樣，作業系統肯定會不停的建立、執行和銷燬程序，導致程序的狀態時刻都在變化，程序的權重/優先順序肯定也要不停地變化，這麼多的關鍵因素都在改變，怎麼高效、快速地管理這些不斷變化的程序了？以往的排程策略用的最多的就是連結串列了，根據不同的程序狀態、優先順序等影響因素加入不同的佇列，但是連結串列有個致命弱點：只能順序遍歷，導致增刪改查效率極低。基於連結串列這種資料結構，又發明了紅黑樹，本質是把原來連結串列“平鋪直敘”式的順序排列改成了按照大小的樹形排列，此時再增刪改查的效率就要高很多了！那麼問題又來了：既然紅黑樹需要按照節點某個值的大小排序，選哪個值比較適合了？linux開發人員選擇的是vruntime！計算公式如下：

　　vruntime = vruntime + 實際執行時間（time process run) * 1024 / 程序權重(load weight of this process)

　　注意：vruntime是累加的！實際執行時間就是程序執行時暫用cpu的時間，權重該怎麼計算了？這裡有個對映表，根據nice值查詢對應的weight！nice值類似於優先順序，取值為下面所示的從15到-20，每次遞減5；根據nice值找到weight後就可以帶入公式計算vruntime了！

/*
 * Nice levels are multiplicative, with a gentle 10% change for every
 * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
 * nice 1, it will get ~10% less CPU time than another CPU-bound task
 * that remained on nice 0.
 *
 * The "10% effect" is relative and cumulative: from _any_ nice level,
 * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
 * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
 * If a task goes up by ~10% and another task goes down by ~10% then
 * the relative distance between them is ~25%.)
 */
const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

　　vruntime的值越小，說明佔用cpu的時間就越少，或者說權重越大，這時就需要優先運行了！所以用紅黑樹是根據所有程序的vruntime來組織的，樹最左下角的節點就是vruntime最小的節點，是需要最先執行的節點；隨著程序的執行，或者說權重的調整，vruntime是不停在變化的，此時就需要動態調整紅黑樹了。由於紅黑樹本身的演算法特點，動態調整肯定比連結串列快多了，這是CFS選擇紅黑樹的根本原因！看到這裡，CFS演算法的特點之一就明顯了：沒有時間片的概念，而是根據實際的執行時間和虛擬執行時間來對任務進行排序，從而選擇排程；

　（2）演算法原理介紹完，接著該看看linux核心是怎麼實現的了！和其他模組一樣，CFS的實現少不了結構體的支援，演算法相關的核心結構體如下：

第一個肯定是task_struct了！新增了不同調度器的描述符，便於確定本程序使用了哪些排程策略！sched_entity就是構造紅黑樹的關鍵成員變量了！

struct task_struct {
    .......
        int prio, static_prio, normal_prio;
    unsigned int rt_priority;
    const struct sched_class *sched_class;/*排程策略的例項*/
    struct sched_entity se;/*cfs排程策略，包含了rb_node*/
    struct sched_rt_entity rt;/*real time排程策略*/
#ifdef CONFIG_CGROUP_SCHED
    struct task_group *sched_task_group;
#endif
    struct sched_dl_entity dl;/*deadline 排程*/
    .......
}

　　組成紅黑樹的關鍵結構體：有個run_node欄位，從名字就能看出是正在執行的程序節點！

struct sched_entity {/*cfs排程策略*/
    struct load_weight    load;        /* for load-balancing */
    struct rb_node        run_node;   /*排程實體是由紅黑樹組織起來的*/ 
    struct list_head    group_node;
    unsigned int        on_rq;
    /*構造紅黑樹時，其實下面的每一項都可以用作節點的key；
    1、但是這裡選vruntime作為key構造紅黑樹，換句話說用vruntime來排序，小的靠左，大的靠右
    2、如果不同程序的vruntime一樣，可以加很小的數改成不一樣的

    */
    u64            exec_start;
    u64            sum_exec_runtime;
    u64            vruntime;/*紅黑樹節點排序的變數*/
    u64            prev_sum_exec_runtime;

    u64            nr_migrations;

#ifdef CONFIG_SCHEDSTATS
    struct sched_statistics statistics;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
    int            depth;
    struct sched_entity    *parent;
    /* rq on which this entity is (to be) queued: */
    struct cfs_rq        *cfs_rq;
    /* rq "owned" by this entity/group: */
    struct cfs_rq        *my_q;
#endif

#ifdef CONFIG_SMP
    /*
     * Per entity load average tracking.
     *
     * Put into separate cache line so it does not
     * collide with read-mostly values above.
     */
    struct sched_avg    avg ____cacheline_aligned_in_smp;
#endif
};

　　還有直接描述cfs正在runquene的結構體：包含紅黑樹的根節點、最左邊的節點（也就是vruntime最小的節點）、當前正在使用的排程結構體；

/* CFS-related fields in a runqueue */
struct cfs_rq {
......
    struct rb_root tasks_timeline;/*紅黑樹的root根節點*/
    struct rb_node *rb_leftmost;/*紅黑樹最左邊的節點，也就是vruntime最小的節點*/

    /*
     * 'curr' points to currently running entity on this cfs_rq.
     * It is set to NULL otherwise (i.e when none are currently running).
     */
    struct sched_entity *curr, *next, *last, *skip;
......

}

　　上面各種結構體種類繁多，不容易理清關係，看看下面的圖就清晰了：

　　結構體準備好後，就可以通過各種api建樹了！

（3）既然紅黑樹排序以vruntime為準，這個值肯定是要不斷調整的，具體的更改函式在update_curr函式(kernel\sched\fair.c)，如下：關鍵程式碼處加了中文註釋

/*
 * Update the current task's runtime statistics.
 */
static void update_curr(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_clock_task(rq_of(cfs_rq));
    u64 delta_exec;

    if (unlikely(!curr))
        return;
    /*計算程序已經執行的時間*/
    delta_exec = now - curr->exec_start;
    if (unlikely((s64)delta_exec <= 0))
        return;

    curr->exec_start = now;//更新開始執行的時間

    schedstat_set(curr->statistics.exec_max,
              max(delta_exec, curr->statistics.exec_max));

    curr->sum_exec_runtime += delta_exec;
    schedstat_add(cfs_rq->exec_clock, delta_exec);

    curr->vruntime += calc_delta_fair(delta_exec, curr);//更新vruntime
    update_min_vruntime(cfs_rq);

    if (entity_is_task(curr)) {
        struct task_struct *curtask = task_of(curr);

        trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
        cpuacct_charge(curtask, delta_exec);
        account_group_exec_runtime(curtask, delta_exec);
    }

    account_cfs_rq_runtime(cfs_rq, delta_exec);
}

　　把節點加入紅黑樹：

/*
 * Enqueue an entity into the rb-tree:
 */
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;/*紅黑樹根節點*/
    struct rb_node *parent = NULL;
    struct sched_entity *entry;
    int leftmost = 1;

    /*
     * Find the right place in the rbtree:
     */
    while (*link) {
        parent = *link;
        /*找到節點例項的首地址，就是container_of的巨集定義*/
        entry = rb_entry(parent, struct sched_entity, run_node);
        /*
         * We dont care about collisions. Nodes with
         * the same key stay together.
         */
        if (entity_before(se, entry)) {
            link = &parent->rb_left;
        } else {
            link = &parent->rb_right;
            leftmost = 0;
        }
    }

    /*
     * Maintain a cache of leftmost tree entries (it is frequently
     * used):
     */
    if (leftmost)
        cfs_rq->rb_leftmost = &se->run_node;
    /*在紅黑樹中插入節點，整個過程會動態調整樹結構保持平衡*/
    rb_link_node(&se->run_node, parent, link);
    /*設定節點的顏色*/
    rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
}

　　和上面的作用剛好相反：刪除節點

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    if (cfs_rq->rb_leftmost == &se->run_node) {
        struct rb_node *next_node;

        next_node = rb_next(&se->run_node);
        cfs_rq->rb_leftmost = next_node;
    }

    rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
}

　　（4）紅黑樹建好後，最最最最重要的功能就是找出需要排程的程序了，如下：

/*
 * Pick the next process, keeping these things in mind, in this order:
 * 1) keep things fair between processes/task groups
 * 2) pick the "next" process, since someone really wants that to run
 * 3) pick the "last" process, for cache locality
 * 4) do not run the "skip" process, if something else is available
 */
static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    struct sched_entity *left = __pick_first_entity(cfs_rq);//樹上最左邊的節點
    struct sched_entity *se;

    /*
     * If curr is set we have to see if its left of the leftmost entity
     * still in the tree, provided there was anything in the tree at all.
       若 second 為空, 或者 curr 的 vruntime 更小 
     */
    if (!left || (curr && entity_before(curr, left)))
        left = curr;

    se = left; /* ideally we run the leftmost entity */

    /*
     * Avoid running the skip buddy, if running something else can
     * be done without getting too unfair.
     */
    if (cfs_rq->skip == se) {
        struct sched_entity *second;

        if (se == curr) {
            second = __pick_first_entity(cfs_rq);/*返回最左邊、也就是vruntime最小的節點*/
        } else {
            second = __pick_next_entity(se);/*找到比se節點大的第一個節點*/
            if (!second || (curr && entity_before(curr, second)))
                second = curr;
        }
        /*判斷是否應該搶佔當前程序*/
        if (second && wakeup_preempt_entity(second, left) < 1)
            se = second;
    }

    /*
     * Prefer last buddy, try to return the CPU to a preempted task.
     */
    if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
        se = cfs_rq->last;

    /*
     * Someone really wants this to run. If it's not unfair, run it.
     */
    if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
        se = cfs_rq->next;

    clear_buddies(cfs_rq, se);

    return se;
}

　　這裡囉嗦幾句：排程器有多個，都實現了pick_next_entity的方法！

總結：

　　1、連結串列這種“憨憨”類的資料結構，能少用就儘量少用；儘量用紅黑樹替代吧，增刪改查的效率高多了！

參考：

1、https://jishuin.proginn.com/p/763bfbd5f8d6 linux程序排程知識點

2、https://mp.weixin.qq.com/s?__biz=MzA3NzYzODg1OA==&mid=2648464309&idx=1&sn=9fc763d9233fbba6d40b69b1ef54aa8b&chksm=87660610b0118f060a4da0c64417e57e8cb35f4732043106fd1b1d9a3ad6134145e0e47f5a9b&scene=21#wechat_redirect O(1)排程演算法

3、https://blog.csdn.net/longwang155069/article/details/104457109 linux O(1)排程器

4、http://www.wowotech.net/process_management/scheduler-history.html O(1)、O(n)和CFS排程器

5、https://zhuanlan.zhihu.com/p/372441187 作業系統排程演算法CFS

linux原始碼解讀（十五）：紅黑樹在核心的應用——CFS排程器

linux原始碼解讀（十五）：紅黑樹在核心的應用——CFS排程器

linux原始碼解讀（十一）：多程序/執行緒的互斥和同步

linux原始碼解讀（十）：記憶體管理——記憶體分配和釋放關鍵函式分析

OpenCV開發筆記（六十五）：紅胖子8分鐘帶你深入瞭解ORB特徵點（圖文並茂+淺顯易懂+程式原始碼）

Dubbo原始碼解析（十五）遠端通訊——Mina

Java SE基礎鞏固（十五）：lambda表示式

C#資料結構與算法系列（十五）：排序演演算法（SortAlgorithm）

Quartz.Net系列（十五）：Quartz.Net四種修改配置的方式

【015期】JavaSE面試題（十五）：網路IO流

設計模式學習筆記（十五）：代理模式

Flink基礎（十五）：Table API 和 Flink SQL（四）視窗（Windows）

大資料實戰（十五）：電商數倉（八）之使用者行為資料採集（八）元件安裝（四）採集日誌Flume

C#程式設計師整理的Unity 3D筆記（十五）：Unity 3D UI控制元件至尊–NGUI

pytorch（十五）：交叉熵和softmax

Hbase基礎（十五）：與Hive的整合

Hive基礎（十五）：Hive 執行過程例項分析

【Linux學習筆記（十五）】之許可權管理，chmod,chown,chgrp,umask

Python使用技巧（十五）：靜態類方法@staticmethod和@classmethod

Linux學習總結（十五）程序組，會話，守護程序

機器學習sklearn（十五）：特徵工程（六）特徵選擇（一）主成分分析PCA

linux原始碼解讀（十五）：紅黑樹在核心的應用——CFS排程器

相關推薦