1. 程式人生 > >一次__libc_message的排查

一次__libc_message的排查

glib eba info ise 釋放 last com 明顯 創建

信號是6,abort調用的。總體而言,當你malloc的指針為A,但是你free的指針不是A,則容易出這個錯,當然假設你free的剛好是別人malloc的,則還是正常。 還有一種是你free的地址在glibc裏面記錄的size有問題,也會報這個錯,本文就是第二個情況。

 abort的堆棧如下:

#0 0x00007f338dd60b55 in raise () from /lib64/libc.so.6
#1 0x00007f338dd620c5 in abort () from /lib64/libc.so.6
#2 0x00007f338dd9ee0f in __libc_message () from /lib64/libc.so.6
#3 0x00007f338dda4628 in malloc_printerr () from /lib64/libc.so.6
#4 0x000000000046abfe in OSMemory::Delete (inMemory=0x7f333e7fcf20) at OSMemory.cpp:278
#5 0x000000000046ac2f in operator delete (mem=0x7f333e7fcf20) at OSMemory.cpp:202
#6 0x000000000040e8a7 in __gnu_cxx::new_allocator<std::_List_node<CZMBuff*> >::deallocate (this=0x7f32a4a155a0, __p=0x7f333e7fcf20) at /usr/include/c++/4.3/ext/new_allocator.h:98
#7 0x000000000040e8cf in std::_List_base<CZMBuff*, std::allocator<CZMBuff*> >::_M_put_node (this=0x7f32a4a155a0, __p=0x7f333e7fcf20) at /usr/include/c++/4.3/bits/stl_list.h:318
#8 0x000000000040e9ef in std::_List_base<CZMBuff*, std::allocator<CZMBuff*> >::_M_clear (this=0x7f32a4a155a0) at /usr/include/c++/4.3/bits/list.tcc:79
#9 0x000000000049d579 in std::list<CZMBuff*, std::allocator<CZMBuff*> >::clear (this=0x7f32a4a155a0) at /usr/include/c++/4.3/bits/stl_list.h:1066

由於該段堆棧處於對象的銷毀過程,所以應該是free的報錯。根據對象本身的內存池設計,在malloc的時候,我們使用用戶態的一個記錄結構,記錄了對象的長度。結構如下:

typedef struct
{
size_t ID;
size_t size;
}mem_hdr;

兩個都是8位的長度,之後再跟實際的數據,也就是我調用my_malloc的時候,如果是傳入24個字節,那麽最終會向glibc的malloc提交40個字節,24+16.

查看free的異常的數據如下:

x /40xg 0x7f333e7fcf20 -64 0x7f333e7fcf20 就是上面堆棧中inMemory的值,這個值真正傳給glibc的時候,會減去16而提交,即為0x7f333e7fcf0x7f333e7fcee0: 0x0000000000000000 0x0000000000000028

0x7f333e7fcef0: 0xffffffffffffffff 0xffffffffffffffff---------------------------------------這兩列值明顯異常,按道理應該是指針
0x7f333e7fcf00: 0xffffffffffffffff 0x00000000ffffffff--------------------------------
0x7f333e7fcf10: 0x0000000000000000 0x0000000000000028
0x7f333e7fcf20: 0x00007f32a57976e0 0x00007f333f7c08e0
0x7f333e7fcf30: 0x00007f32c25b2618 0x0000000000000035-------------這個轉化為二進制就是110101 ,後面三位代表flag,#define PREV_INUSE 0x1,前面那個110000為48,表示長度


0x7f333e7fcf40: 0x0000000000000000 0x0000000000000028
0x7f333e7fcf50: 0x00007f330047a640 0x00007f333dbebfd0
0x7f333e7fcf60: 0x00007f32b04b81b8

這個就是應用程序的mem_hdr結構的id 和size,40轉換成16進制就是0x28,0x28後面24個字節(3個指針)也應該

是用戶數據,在本例中,分別就是 _List_node_base* _M_next; _List_node_base* _M_prev; _Tp _M_data; // 數據域,即標準模板類的管理結構。

正常的例子如下:

0x7f333e7fcf10: 0x0000000000000000 0x0000000000000028
0x7f333e7fcf20: 0x00007f32a57976e0 0x00007f333f7c08e0
0x7f333e7fcf30: 0x00007f32c25b2618 0x0000000000000035--------------最關鍵的是0x0000000000000035值被踩成了0x00000000ffffffff,如果只踩24字節而不是32字節,就不會glibc中報錯了。
0x7f333e7fcf40: 0x0000000000000000 0x0000000000000028--------------下一個結構開始

分為兩段來看,下面那段是正常的分配,上面那段是異常的分配,可以明顯看出,上面0x1497650地址開始那段的32個字節,是有問題的。

我們回一下malloc的內存分配管理單元結構:

struct malloc_chunk {
  INTERNAL_SIZE_T      prev_size;  /* Size of previous chunk (if free).  */
  INTERNAL_SIZE_T      size;       /* Size in bytes, including overhead. */
  struct malloc_chunk* fd;         /* double links -- used only if free. */
  struct malloc_chunk* bk;
  /* Only used for large blocks: pointer to next larger size.  */
  struct malloc_chunk* fd_nextsize; /* double links -- used only if free. */
  struct malloc_chunk* bk_nextsize;
};

prev_size: If the previous chunk is free, this field contains the size of previous chunk. Else if previous chunk is allocated, this field contains previous chunk’s user data.
size: This field contains the size of this allocated chunk. Last 3 bits of this field contains flag information.

    • PREV_INUSE (P) – This bit is set when previous chunk is allocated.
    • IS_MMAPPED (M) – This bit is set when chunk is mmap’d.
    • NON_MAIN_ARENA (N) – This bit is set when this chunk belongs to a thread arena.

Bins: Bins are the freelist datastructures. They are used to hold free chunks. Based on chunk sizes, different bins are available:

  • Fast bin
  • Unsorted bin
  • Small bin
  • Large bin

 映射到內存示意圖上如下圖所示:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  <--真正的chunk首指針
|  prev_size, 前一個chunk的大小               | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  size, 低位作標誌位,高位存放chunk的大小    |M|P|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  <--malloc成功返回的首指針
|  正常時存放用戶數據;                          .--------------我們的用戶數據存放在此,這個例子中,相當於我們的數據有32個字節被踩了。
.  空閑時存放malloc_chunk結構後續成員變量。       .
.                                             .
.                                             |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+  <--下一個chunk的首指針
|             prev_size ……                    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

可以看到,我們每次malloc返回的指針並不是內存塊的首指針,前面還有兩個size_t大小的參數,對於非空閑內存而言size參數最為重要。size參數存放著整個chunk的大小,由於物理內存的分配是要做字節對齊的,所以size參數的低位用不上,便作為flag使用。

  內存寫溢出,通常就是把後一個chunk的size參數寫壞了。
  size被寫壞,有兩種結果。一種是free函數能檢查出這個錯誤,程序就會先輸出一些錯誤信息然後abort;一種是free函數無法檢查出這個錯誤,程序便往往會直接crash。
  根據最上面的堆棧推測,誘發bug的是前一種情況。

根據多個core文件的規律,發現每次踩的都是32字節,且踩的數據一模一樣,都是:

0x1497650: 0xffffffffffffffff 0xffffffffffffffff
0x1497660: 0xffffffffffffffff 0x00000000ffffffff

換算成實際代碼,有兩種可能,一種是賦值為-1,一種是直接memcpy的時候是0xffffffffffffffff 。

切換到對應的堆棧,使用info register看寄存器,獲取出來的CZMBuff是ok的,由於free的時候,是從標準模板類的雙向循環列表中移除某個節點,

移除之後,調用free來釋放對應的循環鏈表管理結構,此時出了問題。

標準模板類中的循環列表的結構,表示如下:

// ListNodeBase定義
struct _List_node_base {
  _List_node_base* _M_next;
  _List_node_base* _M_prev;
};
 
// ListNode定義
template <class _Tp>
struct _List_node : public _List_node_base {
  _Tp _M_data;  // 數據域
};
我們的數據域,其實是一個指向CZMBuff的二級指針,因為直接使用p不好打印鏈表中的內容,所以需要借助腳本:

創建一個腳本文件,裏面包含如下內容(可以在網上下載:)
define plist
    if $argc == 0
        help plist
    else
        set $head = &$arg0._M_impl._M_node
        set $current = $arg0._M_impl._M_node._M_next
        set $size = 0
        while $current != $head
            if $argc == 2
                printf "elem[%u]: ", $size
                p *($arg1*)($current + 1)
            end
            if $argc == 3
                if $size == $arg2
                    printf "elem[%u]: ", $size
                    p *($arg1*)($current + 1)
                end
            end
            set $current = $current._M_next
            set $size++
        end
        printf "List size = %u \n", $size
        if $argc == 1
            printf "List "
            whatis $arg0
            printf "Use plist <variable_name> <element_type> to see the elements in the list.\n"
        end
    end
end

document plist
    Prints std::list<T> information.
    Syntax: plist <list> <T> <idx>: Prints list size, if T defined all elements or just element at idx
    Examples:
    plist l - prints list size and definition
    plist l int - prints all elements and list size
    plist l int 2 - prints the third element in the list (if exists) and list size
end

define plist_member
    if $argc == 0
        help plist_member
    else
        set $head = &$arg0._M_impl._M_node
        set $current = $arg0._M_impl._M_node._M_next
        set $size = 0
        while $current != $head
            if $argc == 3
                printf "elem[%u]: ", $size
                p (*($arg1*)($current + 1)).$arg2
            end
            if $argc == 4
                if $size == $arg3
                    printf "elem[%u]: ", $size
                    p (*($arg1*)($current + 1)).$arg2
                end
            end
            set $current = $current._M_next
            set $size++
        end
        printf "List size = %u \n", $size
        if $argc == 1
            printf "List "
            whatis $arg0
            printf "Use plist_member <variable_name> <element_type> <member> to see the elements in the list.\n"
        end
    end
end

document plist_member
    Prints std::list<T> information.
    Syntax: plist <list> <T> <idx>: Prints list size, if T defined all elements or just element at idx
    Examples:
    plist_member l int member - prints all elements and list size
    plist_member l int member 2 - prints the third element in the list (if exists) and list size
end

然後使用plist方法和plist_member 來獲取成員的值,

plist this->m_listBuff
List size = 16595

其中引用計數為counter ,

counter =1 個數為204

counter = 0 個數為 16596

兩者相加為16800,但是 list 裏面,只有 16595 個元素,少掉的那個元素去哪了?沒有進入鏈表唯一的可能是,鏈表中

一次__libc_message的排查