一個關於內聯優化和呼叫約定的Bug

阿新 • • 發佈：2019-10-23

很久沒有更新部落格了（部落格園怎麼還不更新後臺），前幾天在寫一個Linux 0.11的實驗 [1] 時遇到了一個奇葩的Bug，就在這簡單記錄一下除錯過程吧。

現象

這個實驗要求在Linux 0.11中實現簡單的訊號量 [2]，但在改動核心程式碼後執行測試程式總是報錯，例如：

/* pc_test.c */

#define   __LIBRARY__
#include <stdio.h>
#include <stdlib.h>
#include <semaphore.h>
#include <unistd.h>

_syscall2(long, sem_open, const char *, name, unsigned int, value);
_syscall1(int, sem_unlink, const char *, name);

int main(void)
{
    sem_t *mutex;
    if ((mutex = (sem_t *) sem_open("mutex", 1)) == (sem_t *)-1)
    {
        perror("opening mutex semaphore creating");
        return EXIT_FAILURE;
    }
 
    sem_unlink("mutex");
    
    return EXIT_SUCCESS;
}

提示為段錯誤：

定位

在核心實現訊號量的核心程式碼 sem.c 中插樁除錯，最終把發生段錯誤的位置定在尋找已存在訊號量的 find_sem 函式中：

/* 
以下注釋部分是semaphore.h中我定義的連結串列結構體

#define MAXSEMNAME 128
struct sem_t
{
    char m_name[MAXSEMNAME+1];
    unsigned long m_value;

    struct sem_t * m_prev;
    struct sem_t * m_next;

    struct task_struct * m_wait;
};

typedef struct sem_t sem_t;

#define SEM_FAILED ((sem_t *)-1)
*/

// Data structure optimization is possible here
sem_t _semHead={.m_name = "_semHead", .m_value = 0, .m_prev = NULL,\
                 .m_next = NULL, .m_wait = NULL};

sem_t *find_sem(const char* name)
{
    sem_t *tmpSemP = &_semHead;
    while (tmpSemP->m_next != NULL)
    {
        if (strcmp((tmpSemP->m_name), name) == 0)
        {
            return tmpSemP;
        }
        tmpSemP = tmpSemP->m_next;
    }
    return tmpSemP;
}

由於該函式中存在 P->member 這樣的解引用操作，很大概率就是P的值出了問題，所以就在P對應的操作附近加上 printk ，判斷是否進一步定位Bug：

sem_t *find_sem(const char* name)
{
    printk("Now we are in find_sem\n"); // DEBUG
    sem_t *tmpSemP = &_semHead;
    while (tmpSemP->m_next != NULL)
    {
        printk("find_sem: tmpSemp before strcmp: %p\n", tmpSemP);
        if (strcmp((tmpSemP->m_name), name) == 0)
        {
            printk("find_sem: tmpSemp after strcmp: %p\n", tmpSemP); // DEBUG
            printk("find_sem: return...\n\n"); // DEBUG
            return tmpSemP;
        }
        printk("find_sem: tmpSemp after strcmp: %p\n\n", tmpSemP); // DEBUG
        tmpSemP = tmpSemP->m_next;
    }
    printk("find_sem: return...\n\n"); // DEBUG
    return tmpSemP;
}

重新編譯核心，再次執行上面的 pc_test.c ，奇怪的事情發生了：

可以看到，第一次進入 find_sem 並沒有發生段錯誤，這是因為第一次呼叫 sem_open 的時候核心中還沒有訊號量，所以 tmpSemP->m_next != NULL 不成立，但是第二次和第三次進入 find_sem ，temSemP 的值卻在 strcmp(tmpSemP->m_name, name) 前後發生了改變。我們知道，C中的函式引數是“按值傳遞”的，如果編譯器真的把strcmp 按照C函式的規則編譯，那麼傳遞 m_name 的值， tmpSemP 的值是不可能改變的。所以現在的結論是， string.h 中定義的 strcmp 很可能出了問題。

復現

為了更好的分析和除錯，我將 string.h , semaphore.h 和 sem.c 中的 find_sem 關鍵程式碼拿出來，精簡後在使用者態進行Bug復現：

/* test.c */

#include <stdio.h>

// string.h
inline int strcmp(const char * cs,const char * ct)
{
register int __res ;
__asm__("cld\n"
    "1:\tlodsb\n\t"
    "scasb\n\t"
    "jne 2f\n\t"
    "testb %%al,%%al\n\t"
    "jne 1b\n\t"
    "xorl %%eax,%%eax\n\t"
    "jmp 3f\n"
    "2:\tmovl $1,%%eax\n\t"
    "jl 3f\n\t"
    "negl %%eax\n"
    "3:"
    :"=a" (__res):"D" (cs),"S" (ct));
return __res;
}

//semaphore.h
typedef struct sem_t
{
    char m_name[128];
    struct sem_t *m_next;
} sem_t;

//sem.c
int main(void)
{
    sem_t _semRear={.m_name = "_semRear", .m_next = (sem_t *)0};
    sem_t _semHead={.m_name = "_semHead", .m_next = &_semRear};
    sem_t *tmpSemP = &_semHead;
    char name[] = "test";

    while (tmpSemP->m_next != (sem_t *)0)
    {
        printf("1. tempSemP: %p\n", tmpSemP);
        if(!strcmp((tmpSemP->m_name), name))
            return 0;
        printf("2. tempSemP: %p\n", tmpSemP);

        tmpSemP = tmpSemP->m_next;
    }
    return 0;
}

Bug復現：

分析

我們首先分析一下 strcmp 的實現：

extern inline int strcmp(const char * cs,const char * ct)
{
register int __res ;            // 暫存器變數
__asm__("cld\n"                 // 清理方向位
    "1:\tlodsb\n\t"             // 將ds:[esi]存入al，esi++
    "scasb\n\t"                 // 比較al與es:[edi]，edi++
    "jne 2f\n\t"                // 若不等，向下跳轉到2標誌
    "testb %%al,%%al\n\t"       // 測試al暫存器
    "jne 1b\n\t"                // 若al不為0，則向上跳轉到1標誌
    "xorl %%eax,%%eax\n\t"      // 若al為零，則清空eax（返回值）
    "jmp 3f\n"                  // 向下跳轉到3標誌返回
    "2:\tmovl $1,%%eax\n\t"     // eax置為1
    "jl 3f\n\t"                 // 若上面的比較al更小，則這裡返回正值（1）
    "negl %%eax\n\t"            // 否則eax = -1 返回負值
    "3:"
    :"=a" (__res):"D" (cs),"S" (ct)); // 規定edi暫存器接收cs引數的值，esi接收ct引數的值，最終將eax的值輸出到__res暫存器變數中
return __res;                   // 返回__res
}

如上，為了效能優化， strcmp 使用了內聯優化（函式和彙編），是程式碼還是編譯器的鍋呢？拖入IDA，靜態分析一下：

編譯器忠實的保留了內聯彙編的語句。通過 __printf_chk 的引數，我們知道進入控制流進入 strcmp 之前和之後編譯器都把 tempSemP 放在暫存器 edi 中，並且由於訊號量結構體的第一個成員就是 m_name :

//semaphore.h
typedef struct sem_t
{
    char m_name[128];
    struct sem_t *m_next;
} sem_t;

而 m_name 又是一個數組名，所以 tmpSemP->m_name 和 tmpSemP 就值而言是相同的。由於內聯彙編規定使用 edi 作為第一個引數的輸入暫存器，所以編譯器為了優化，首先就將 tempSemP 放在暫存器 edi ，這樣後面進入 strcmp 的時候就不需要再次改變 edi 了。

但是，內聯彙編的程式碼中明明有 scasb [3] ，其會在比較操作後更改 edi 的值，難道編譯器不知道嗎？通過查閱GCC文件關於內聯彙編的說明 [4]：

asm asm-qualifiers ( AssemblerTemplate 
              : OutputOperands 
              [ : InputOperands
              [ : Clobbers ] ])
6.47.2.6 Clobbers and Scratch Registers

While the compiler is aware of changes to entries listed in the output operands, the inline asm code may modify more than just the outputs. For example, calculations may require additional registers, or the processor may overwrite a register as a side effect of a particular assembler instruction. In order to inform the compiler of these changes, list them in the clobber list. Clobber list items are either register names or the special clobbers (listed below). Each clobber list item is a string constant enclosed in double quotes and separated by commas.

Clobber descriptions may not in any way overlap with an input or output operand….

文件說明了對於彙編語句中被修改但是不在 InputOperands中的暫存器，應該在 Clobbers 中寫出，不然編譯器不知道哪些暫存器（Bug這裡是 edi ）被修改，也就可能在優化的過程中出錯了。

回到 strcmp 的程式碼，最後一行是:"=a" (__res):"D" (cs),"S" (ct)); ，而scasb 與 lodsb [5] 修改的又是 edi , esi 。根據上面文件的說明， clobbers 不能與輸入輸出位置的運算元重複，所以如果這裡在 clobbers 的位置放上 edi , esi 就會報錯：

（這個程式設計師）為了編譯通過，在 clobbers 的位置便沒有放上 edi , esi ，大部分情況下都沒有問題，但是如果編譯器在優化的過程中依賴於 strcmp 不改變 edi , esi ，就可能出現Bug。

試驗

現在我們從理論上發現了Bug的成因，下面我們做個試驗驗證一下。由於該Bug是因為tmpSemP->m_name 和 tmpSemP 就值而言是相同，才導致 tmpSemP 變數中間儲存和 tmpSemP->m_name 傳參使用了相同的暫存器 edi ,我們可以改變結構體成員的排列，避免這種特定的優化方式，應該就會在測試程式中避免bug，例如：

typedef struct sem_t
{
    struct sem_t * m_next;
    char m_name[128];
} sem_t;

再次執行，報錯消失：

再次在IDA中觀察：

可見，這裡在呼叫第一個 __printf_chk 的時候 tempSemP 是放在 ecx 而非 edi 中，而第二個 __printf_chk 是使用之前放在 edx 中的 tempSemP 而非 edi ，確實避免了這種優化。

但是，一個新的問題出現了，根據x86呼叫約定（Calling Convention）， ecx 和 edx 是 Caller-saved (volatile) registers [6] ，即呼叫者不能依賴被呼叫函式保證它們的值不變，那 GCC 為什麼就使用這兩個暫存器作為 strcmp 呼叫前後 tempSemP 的值呢？

其實，在 GCC 文件中對於 inline function 提到了這麼一句 [7]：

This combination of inline and extern has almost the effect of a macro. The way to use it is to put a function definition in a header file with these keywords, and put another copy of the definition (lacking inline and extern) in a library file. The definition in the header file will cause most calls to the function to be inlined. If any uses of the function remain, they will refer to the single copy in the library.

也就是說，在使用 inline 和 extern 修飾的函式時，GCC將其幾乎（almost）和巨集一樣處理，可能也就不在更具呼叫約定優化了。

解決

解決思路有兩種。

一是告知編譯器哪些暫存器不能依賴（volatile），或者直接使用非彙編的寫法，讓編譯器去安排。例如我們可以建立一個 string_fix.h ，在C上實現實現一個 strCmp ：

#ifndef _STRING_FIX_H_
#define _STRING_FIX_H_

/*
 * This header file is for fixing bugs caused by inline assembly
 * in string.h.
 */

int strCmp(const char* s1, const char* s2)
{
    while(*s1 && (*s1 == *s2))
    {
        s1++;
        s2++;
    }
    return *(const unsigned char*)s1 - *(const unsigned char*)s2;
}

#endif

二是手動在原來的內聯彙編中儲存被修改的暫存器，例如：

extern inline int strcmp(const char * cs,const char * ct)
{
register int __res ;
__asm__("push %%edi\n\tpush %%esi\n\t"
    "cld\n\t"
    "1:\tlodsb\n\t"
    "scasb\n\t"
    "jne 2f\n\t"
    "testb %%al,%%al\n\t"
    "jne 1b\n\t"
    "xorl %%eax,%%eax\n\t"
    "jmp 3f\n"
    "2:\tmovl $1,%%eax\n\t"
    "jl 3f\n\t"
    "negl %%eax\n\t"
    "3:\n\t"
    "pop %%esi\n\tpop %%edi\n"
    :"=a" (__res):"D" (cs),"S" (ct));
return __res;
}

測試及後續不再展示。

後記

這真的是Linus Torvalds [8] 寫的程式碼嗎？我試著在網上找到了一份看似權威的程式碼 [9]，結果其中的 strcmp 如下：


extern inline int strcmp(const char * cs,const char * ct)
{
register int __res __asm__("ax");
__asm__("cld\n"
    "1:\tlodsb\n\t"
    "scasb\n\t"
    "jne 2f\n\t"
    "testb %%al,%%al\n\t"
    "jne 1b\n\t"
    "xorl %%eax,%%eax\n\t"
    "jmp 3f\n"
    "2:\tmovl $1,%%eax\n\t"
    "jl 3f\n\t"
    "negl %%eax\n"
    "3:"
    :"=a" (__res):"D" (cs),"S" (ct):"si","di");
return __res;
}

Linus Torvalds明確了 Clobbers 為 si 和 di ，或許那個時候的GCC沒有 Clobbers 不能和 InOutputOperands 重疊這個限制吧。

比較大的可能性是現在的人在研究的過程中為了方便編譯，將 Clobbers 直接做了刪除，例如下面幾篇文章都提到了這種方法：

Ubuntu15.10邂逅linux0.11

linux環境下編譯linux0.11核心

linux0.12 編譯過程

同時，在這篇文章中指出 [10] ，Linux 0.1x 中這種因 Clobbers 無法通過現代編譯器檔案還有：

include/linux/sched.h： set_base，set_limit

include/string.h ：strcpy， strncpy，strcat，strncat，strcmp，strncmp，strchr， strrchr，strspn，strcspn，strpbrk，strstr，memcpy，memmove，memcmp，memchr，

mm/memory.c：copy_page，get_free_page

fs/buffer.c：COPY_BLK

fs/namei.c：match

fs/bitmap.c：clear_block，find_first_zero

kernel/blk_drv/floppy.c：copy_buffer

kernel/blk_drv/hd.c：port_read，port_write

kernel/chr_drv/console.c：scrup，scrdown，csi_J，csi_K，con_write

參考

[1] HIT-OSLAB-MANUAL

[2] Semaphore (programming)

[3] SCASB

[4] 6.47 How to Use Inline Assembly Language in C Code

[5] lodsb

[6] Register_preservation

[7] 5.34 An Inline Function is As Fast As a Macro

[8] Linus Torvalds

[9] Linux 0.11 source

[10] 64位Debian Sid下編譯Linux 0.11內

一個關於內聯優化和呼叫約定的Bug

很久沒有更新部落格了（部落格園怎麼還不更新後臺），前幾天在寫一個Linux 0.11的實驗 [1] 時遇到了一個奇葩的Bug，就在這簡單記錄一下除錯過程吧。現象這個實驗要求在Linux 0.11中實現簡單的訊號量 [2]，但在改動核心程式碼後執行測試程式總是報錯，例如： /* pc_test.c */

將一個基於磁盤的表遷移到SQL Server中的一個內存優化的表

varchar 並發 ken 希望重要資源管理器 limited com arc 本文是微軟的譯文，對應的原文是：https://www.red-gate.com/simple-talk/sql/database-administration/migrating-dis

內聯元素和塊元素的區別以及轉換

spa AC 塊級元素內聯 body 相互轉換自動換行 round -c 1.內聯元素和塊元素內聯元素：p、 div、 h1~h6 塊級元素：span、a 區別：內聯元素占空間全部寬度，自動換行；塊元素必須首先設

內聯框架和選項組名

圖片熱區：有點選事件！矩形去掉邊框填充改為透明後能實現熱區一樣的的效果，甚至比熱區還好用。 **********qq介面的註冊******* 1、滑鼠點選按鈕時，切換背景圖片：【屬性】--【選中】--【img】--【匯入】點擊後要切換的圖片取名img_btn1--【滑鼠單擊時】--【設定選中狀態 im

HTML——01：內聯元素和塊元素

塊元素：<h1>、<p>、<ul>，顯示時會獨立成行。內聯元素：<b>、<a>、<img>，顯示時不會獨立成行。 <div>元素：塊元素，組合其它HTML的容器。它可以來修飾在容器自身內的元素。通常可以使用

HTML中內聯元素和塊級元素的區別

最主要的區別是：內聯不佔行，如input，span，a，你會發現寫多個input的時候，它們都排在同一行塊狀元素，佔行，如div，p。你會發現寫多個div的時候，它們都各自佔一行。有時候你用到什麼元素的時候，可能自己觀察下。也可以去看下w3school。如果你想內聯元素佔行，

html中內聯元素和塊級元素的區別（超級詳細）

1.下表列出了內聯元素和塊級元素的主要區別 html中內聯元素和塊級元素的區別塊級元素行內元素獨佔一行,預設情況下，其寬度自動填滿其父元素寬度相鄰的行內元素會排列在同一行裡，直到一行排不下，才會換行，其寬度隨元素的內容而變化可以設定wid

內聯元素和塊元素的區別及轉換

1.內聯元素和塊元素內聯元素：p、 div、 h1~h6 塊元素：span、a 區別：內聯元素佔空間全部寬度，自動換行；塊元素必須首先設定其相應的寬度，

定義一個包含私有成員變數和函式的類，再定義一個內部類，在內部類函式中訪問外部成員變數，並呼叫外部函式。在外部類函式中建立內部類物件，呼叫內部類函式

public class Test5 { //定義包含私有成員變數和函式 private int a = 201320883; private voi

V8引擎優化機制之隱藏類和內聯快取

我們知道Javascript作為一種動態語言，效能方面與c#,Java之類的靜態語言相比存在著一定的差距。而隨著Web技術的發展，對Javascript的執行效率提出越來越高的要求。為了追求更好的效能，V8引擎借鑑了大量的靜態語言編譯技術來優化引擎的執行效率。比如V8引擎放棄生成中間位元組碼，而是直接從AST

css塊級元素和內聯元素

cnblogs 塊級元素 play spl 元素 png blog lock bsp 左邊的是塊級元素block，右邊內聯inline。修改方法 display: inline; css塊級元素和內聯元素

Android內存優化（一）DVM和ART原理初探

java虛擬機劃分 cimage beef 靜態由於 jar blank 查找要學習Android的內存優化，首先要了解Java虛擬機，此前我用了多篇文章來介紹Java虛擬機的知識，就是為了這個系列做鋪墊。在Android開發中我們接觸的是與Java虛擬機類似的Dal

js的內聯和外部調用

cti 報錯使用解釋全局出現異常原因現在調用如上代碼（實現輪播圖的部分代碼），若以內聯形式寫在html裏，就不會出現異常。但是如果放在外部js文件中，並在html的head中調用，就會出現異常（不報錯，但是輪播邏輯出現異常）。解決：不使用全局變量$in

10.16輸入一個字符串，內有數字和非數字字符，如： a123x456 17960? 302tab5876 將其中連續的數字作為一個整數，依次存放到一數組num中。例如123放在num[0]中，456放在num[1]中……統計共有多少個整數，並輸出這些數。

tab lnp zip sm2 cuc ycm rds qt5 tft 10.16輸入一個字符串，內有數字和非數字字符，如： a123x456 17960? 302tab5876 將其中連續的數字作為一個整數，依次存放到一數組num中。例

一個關於內聯優化和呼叫約定的Bug

現象

定位

復現

分析

試驗

解決

後記

參考

一個關於內聯優化和呼叫約定的Bug

將一個基於磁盤的表遷移到SQL Server中的一個內存優化的表

內聯元素和塊元素的區別以及轉換

內聯框架和選項組名

HTML——01：內聯元素和塊元素

HTML中內聯元素和塊級元素的區別

html中內聯元素和塊級元素的區別（超級詳細）

內聯元素和塊元素的區別及轉換

定義一個包含私有成員變數和函式的類，再定義一個內部類，在內部類函式中訪問外部成員變數，並呼叫外部函式。在外部類函式中建立內部類物件，呼叫內部類函式

V8引擎優化機制之隱藏類和內聯快取

css塊級元素和內聯元素

Android內存優化（一）DVM和ART原理初探

js的內聯和外部調用

10.16輸入一個字符串，內有數字和非數字字符，如： a123x456 17960? 302tab5876 將其中連續的數字作為一個整數，依次存放到一數組num中。例如123放在num[0]中，456放在num[1]中……統計共有多少個整數，並輸出這些數。

內聯元素，取消間隙和默認效果

html5-塊元素和內聯元素

sql 異常<除非另外還指定了 TOP、OFFSET 或 FOR XML，否則，ORDER BY 子句在視圖、內聯函數、派生表、子查詢和公用表表達式中無效。>

內聯函數和宏

0006-css引入內聯和內部和外聯樣式表-前端學習筆記

【20180712】Nginx 參數優化和內核參數優化

一個關於內聯優化和呼叫約定的Bug

現象

定位

復現

分析

試驗

解決

後記

參考

相關推薦