memory-ordering-at-compile-time

阿新 • • 發佈：2018-11-08

淺談Memory Reordering

Memory ordering

在我們編寫的 C/C++程式碼和它被在 CPU 上執行,按照一些規則,程式碼的記憶體互動會被亂序.記憶體亂序同時由編譯器(編譯時候)和處理器(執行時)造成,都為了使程式碼執行的更快.

被編譯開發者和處理器製造商遵循的中心記憶體排序準則是:

不能改變單執行緒程式的行為.

因為這條規則,在寫單執行緒程式碼時記憶體亂序被普遍忽略.即使在多執行緒程式中,它也被時常忽略,因為有 mutexes,semaphores 等來防止它們呼叫中的記憶體亂序.僅當 lock-free 技術被使用時,記憶體在不受任何互斥保護下被多個執行緒共享,記憶體亂序的影響能被看到.

下面先比較 Weak 和 Strong 的記憶體模型,然後分兩部分,實際記憶體亂序如何在編譯和執行時發生,並如何防止它們.

Weak VS strong Memory Models

Jeff Preshing 在 Weak vs. Strong Memory Models 中很好的總結了從 Weak 到 Strong 的型別:

非常弱	資料依賴性的弱	強制	順序一致
DEC Alpha	ARM	X86/64	dual 386
C/C++11 low-level atomics	PowerPC	SPARC TSO	Java volatile/C/C++11 atomics

弱記憶體模型

在最弱的記憶體模型中,可能經歷所有四種記憶體亂序 (LoadLoad, StoreStore, LoadStore and StoreLoad).任何 load 或 store 的操作能與任何的其他的 load 或 store 操作亂序,只要它不改變一個獨立程序的行為.實際中,這樣的亂序由於編譯器引起的指令亂序或處理器本身處理指令的亂序.

當處理器是弱硬體記憶體模式,通常稱它為 weakly-ordered 或 weak ordering.或說它有 relaxed memory model. DEC Alpha

是最具代表的弱排序的處理器.

C/C++的底層原子操作也呈現弱記憶體模型,無論程式碼的平臺是如 x86/64 的強序處理器.下面章節 Memory ordering at compile time 會演示其弱記憶體模型,並說明如何強制記憶體順序來保護編譯器亂序.

資料依賴性的弱

ARM 和 PowerPC 系列的處理器記憶體模型和 Alpha 同樣弱,除了它們保持 data dependency ordering.它意味兩個相依賴的load(load A, load B<-A)被保證順序load B<-A總能在 load A之後.(A data dependency barrier is a partial ordering on interdependent loads only; it is not required to have any effect on stores, independent loads or overlapping loads.)

強記憶體模型

弱和強記憶體模型區別存在分歧.Preshing 總結的定義是:

一個強硬體記憶體模型是在這樣的硬體上每條機器指令隱性的保證 acquire and release
semantics 的執行.因此,當一個 CPU 核進行了一串寫操作,每個其他的 CPU 核看到這些值的改變順序與其順序一致.

所以也就是保證了四種記憶體亂序 (LoadLoad, StoreStore, LoadStore and StoreLoad) 中的 3 種,除了不保證 StoreLoad 的順序.基於以上的定義,x86/64 系列處理器基本就是強順序的.之後 Memory ordering at processor time 可以看到 StoreLoad 在 X86/64 的亂序實驗.

順序一致

在順序一致 (Sequential consistency) 的記憶體模型中,沒有記憶體亂序存在.

如今,很難找到一個現代多核裝置保證在硬體層 Sequential consistency.也就早期的 386 沒有強大到能在執行時進行任何記憶體的亂序.

當用上層語言程式設計時,Sequential consistency 成為一個重要的軟體記憶體模型.Java5 和之後版本,用volatile宣告共享變數.在 C+11 中,可以使用預設的順序約束memory_order_seq_cst在做原子操作時.當使用這些術語後,編譯器會限制編譯亂序和插入特定 CPU 的指令來指定合適的 memory barrier 型別.

Memory ordering at compile time

看如下程式碼:

test.c

int A, B;
void test() {
  A = B + 1;
  B = 0;
}

不開啟編譯器的優化,把它編譯成彙編,我們可以看到,B的賦值在A的後面,和原程式的順序一樣.

$ gcc -S -masm=intel test.c

	mov	eax, DWORD PTR B
	add	eax, 1
	mov	DWORD PTR A, eax
	mov	DWORD PTR B, 0

用O2開啟優化:

$ gcc -S -O2  -masm=intel test.c

	mov	eax, DWORD PTR B
	mov	DWORD PTR B, 0
	add	eax, 1
	mov	DWORD PTR A, eax

這次編譯器把B的賦值提到A的前面.為什麼它可以這麼做呢?記憶體順序的中心沒有破壞.這樣的改變並不影響單執行緒程式,單執行緒程式不能知道這樣的區別.

但是當編寫 lock-free 程式碼時,這樣的編譯器亂序就會引起問題.看如下例子,一個共享的標識來表明其他共享資料是否更新:

int value;
int updated = 0;
void UpdateValue(int x) {
    value = x;
    update = 1;
}

如果編譯器把update的賦值提到value賦值的前面.即使在單核處理器系統中,會有問題:在兩個引數賦值的中間這個執行緒被中斷,使得另外的程式通過update判斷以為value的值已經得到更新,實際上卻沒有.

顯性的 Compiler Barriers

一種方法是用一個特殊的被稱為 Compiler Barrier 的指令來防止編譯器優化的亂序.以下 asm volative 是 GCC 中的方法.

test_barrier.c

int A, B;
void test() {
  A = B + 1;
  asm volatile("" ::: "memory");
  B = 0;
}

經過這樣的修改,開啟優化,B的儲存將保持在要求的順序上.

$ gcc -S -O2  -masm=intel test.c

	mov	eax, DWORD PTR B
	add	eax, 1
	mov	DWORD PTR A, eax
	mov	DWORD PTR B, 0

隱性的 Compiler Barriers

在 C++11 中原子庫中,每個不是 relaxed 的原子操作同時是一個 compiler barrier.

int value;
std::atomic<int> updated(0);
void UpdateValue(int x) {
    value = x;
    // reordering is prevented here
    update.store(1, std::memory_order_release);
}

每一個擁有 compiler barrier 的函式本身也是一個 compiler barrier,即使它是 inline 的.

int a;
int b;
void DoSomething() {
    a = 1;
    UpdateValue(1);
    b = a + 1;
}

進一步推知,大多數被呼叫的函式是一個 compiler barrier.無論它們是否包含 memory barrier.排除 inline 函式,被宣告為pure attribution 或當 link-time code generation 使用時.因為編譯器在編譯時,並不知道UpdateValue的執行是否依賴於a或會改變a的值從而影響b,所以編譯器不會亂序它們之間的順序.

可以看到,有許多隱藏的規則禁止編譯指令的亂序,也防止了編譯器多進一步的程式碼優化,所以在某些場景 Why the “volatile” type class should not be used, 來讓編譯器進一步優化.

無緣由的儲存

有隱形的 Compiler Barriers,同樣 GCC 編譯器也有無緣由的儲存.來自這裡的例項:

extern int v;

    void
    f(int set_v)
    {
      if (set_v)
        v = 1;
    }

在 i686,GCC 3.3.4–4.3.0 用O1編譯得到:

        pushl   %ebp
        movl    %esp, %ebp
        cmpl    $0, 8(%ebp)
        movl    $1, %eax
        cmove   v, %eax        ; load (maybe)
        movl    %eax, v        ; store (always)
        popl    %ebp
        ret

在單執行緒中,沒有問題,但多執行緒中呼叫f(0)僅僅只是讀取 v 的值,但中斷後回去覆蓋其他執行緒修改的值.引起 data rate.在新的 C++11 標準中明確禁止了這樣的行為,看最近 C+11 標準進行的 draft§1.10.22 節:

Compiler transformations that introduce assignments to a potentially shared memory location that would not be modified by the abstract machine are generally precluded by this standard.

Memory ordering at processor time

看一個簡單的 CPU 亂序的簡單例子,即使在強記憶體模型的 X86/64 也能看到.有兩個整數X和Y初始是 0,另外兩個變數 r1 和 r2 讀取它們的值,兩個執行緒並行執行,執行如下的機器程式碼:

每個執行緒儲存 1 到一個共享變數,然後把對方變數讀取到一個變數或一個暫存器中.無論哪個執行緒先寫 1 到記憶體,另外個執行緒讀回那個值,意味著最後 r1=1 或 r2=1 或兩者都是.但是 X86/64 是強記憶體模型,它還是允許亂序機器指令.特別,每個執行緒允許延遲儲存到讀回之後.以致最後 r1 和 r2 能同時等於 0–違反直覺的一個結果.因為指令可能如下順序執行:

寫一個例項程式,實際看一下 CPU 的確亂序了指令.原始碼可以 Github 下載.兩個讀寫的執行緒程式碼如下:

sem_t begin_sem1;
sem_t begin_sem2;
sem_t end_sem;

int X, Y;
int r1, r2;

void *ThreadFunc1(void *param) {
  MersenneTwister random(1);
  for (;;) {
    sem_wait(&begin_sem1);
    // random delay
    while (random.Integer() % 8 != 0) {
    }
    X = 1;
    asm volatile("" ::: "memory");  // prevent compiler ordering
    r1 = Y;
    sem_post(&end_sem);
  }
  return NULL;
}

void *ThreadFunc2(void *param) {
  MersenneTwister random(2);
  for (;;) {
    sem_wait(&begin_sem2);
    // random delay
    while (random.Integer() % 8 != 0) {
    }
    Y = 1;
    asm volatile("" ::: "memory");  // prevent compiler ordering
    r2 = X;
    sem_post(&end_sem);
  }
  return NULL;
}

隨機的延遲被插入在儲存的開始處,為了交錯執行緒的開始時間,以來達到重疊兩個執行緒的指令的目的.隨機延遲使用執行緒安全的MersenneTwister類.彙編程式碼asm volatile("" ::: "memory");如上節所述只是用來防止編譯器的亂序, 因為這裡是要看 CPU 的亂序,排除編譯器的亂序影響.

主執行緒如下,利用 POSIX 的 semaphore 同步它與兩個子執行緒的同步.先讓兩個子執行緒等待,直到主執行緒初始化X=0和 Y=0.然後主執行緒等待,直到兩個子執行緒完成操作,然後主執行緒檢查r1和r2的值.所以 semaphore 防止執行緒見的不同步引起的記憶體亂序,主執行緒程式碼如下:

int main(int argc, char *argv[]) {
  sem_init(&begin_sem1, 0, 0);
  sem_init(&begin_sem2, 0, 0);
  sem_init(&end_sem, 0, 0);

  pthread_t thread[2];
  pthread_create(&thread[0], NULL, ThreadFunc1, NULL);
  pthread_create(&thread[1], NULL, ThreadFunc2, NULL);

  int detected = 0;
  for (int i = 1; ; ++i) {
    X = 0;
    Y = 0;
    sem_post(&begin_sem1);
    sem_post(&begin_sem2);
    sem_wait(&end_sem);
    sem_wait(&end_sem);
    if (r1 == 0 && r2 == 0) {
      detected++;
      printf("%d reorders detected after %d iterations\n", detected, i);
    }
  }
  return 0;
}

在 Intel i5-2435M X64 的 ubuntu 下執行一下程式:

1 reorders detected after 2181 iterations
2 reorders detected after 4575 iterations
3 reorders detected after 7689 iterations
4 reorders detected after 22215 iterations
5 reorders detected after 60023 iterations
6 reorders detected after 60499 iterations
7 reorders detected after 61639 iterations
8 reorders detected after 62243 iterations
9 reorders detected after 67998 iterations
10 reorders detected after 68098 iterations
11 reorders detected after 71179 iterations
12 reorders detected after 71668 iterations
13 reorders detected after 72417 iterations
14 reorders detected after 73970 iterations
15 reorders detected after 78227 iterations
16 reorders detected after 81897 iterations
17 reorders detected after 82722 iterations
18 reorders detected after 85377 iterations
...

差不多每 4000 次的迭代才發現一次 CPU 記憶體亂序.所以多執行緒的 bug 是多麼難發現.那麼如何消除這些亂序.至少有如下兩種方法:

讓兩個子執行緒在同一個 CPU 核下執行.(沒有可移植性方法,如下是 linux 平臺的).
使用 CPU 的 memory barrier 防止它的亂序.

Lock to one processor

讓兩個子執行緒在同一個 CPU 核下執行,程式碼如下:

  cpu_set_t cpus;
  CPU_ZERO(&cpus);
  CPU_SET(0, &cpus);
  pthread_setaffinity_np(thread[0], sizeof(cpu_set_t), &cpus);
  pthread_setaffinity_np(thread[1], sizeof(cpu_set_t), &cpus);

Place a memory barrier

防止一個 Store 在 Load 之後的亂序,需要一個 StoreLoad 的 barrier.這裡使用 mfence的一個全部 memory barrier,防止任何型別的記憶體亂序.程式碼如下:

void *ThreadFunc1(void *param) {
  MersenneTwister random(1);
  for (;;) {
    sem_wait(&begin_sem1);
    // random delay
    while (random.Integer() % 8 != 0) {
    }
    X = 1;
    asm volatile("mfence" ::: "memory");  // prevent CPU ordering
    r1 = Y;
    sem_post(&end_sem);
  }
  return NULL;
  }

Summarization

有兩種記憶體亂序存在:編譯器亂序和 CPU 亂序.
如何防止編譯器亂序.
如何防止 CPU 亂序.

Posted by DreamRunner Jun 28th, 2014 Multithreading

memory-ordering-at-compile-time

淺談Memory Reordering Memory ordering 在我們編寫的 C/C++程式碼和它被在 CPU 上執行,按照一些規則,程式碼的記憶體互動會被亂序.記憶體亂序同時由編譯器(編譯時候)和處理器(執行時)造成,都為了使程式碼執行的更快.

Memory Ordering at Compile Time

Between the time you type in some C/C++ source code and the time

Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In orde

Total jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reduce

效率提升最重要的原則 - Doing one thing at a time

高效 har 這樣的理解 adding 這也 fixed 想法鍛煉前段時間流行的時間管理方法 - url=NotLrz-4f4eCgENFAECrXNw88mSLoJ2Rc2MrkP4aes1yQvPjNQRlmdYcbz9oP9U8JoBzJeY-DSUhhIn

yum pycurl: libcurl link-time version is older than compile-time version解決方案

在執行yum 命令時，出現如下錯誤： pycurl: libcurl link-time version is older than compile-time version 錯誤的原因就是pycurl的版本太舊了，解決方案就是重灌下pycurl。重灌命令較為簡單，先

Likes Dislike Improving Performance in C++ with Compile Time Polymorphism

Virtual functions are one of the most interesting and useful features of classes in C++. They allow for thinking of an object in

Get Debt-Free One Family at a Time – Chatbots Life

Jim Katzaman - Get Debt-Free One Family at a TimeHelping Americans shave years off of debt, cut thousands of dollars in interest, increase lifestyles and s

Compile-time Dependency Injection With Go Cloud's Wire

9 October 2018 Overview The Go team recently announced the open source project Go Cloud, with portable Cloud API

At what time would you read a book?

I read on my smartphone (shameless un-affiliated plug: Moon Reader for Android), so I can read a chapter anywhere at anytime. Plane, sub

Jim Katzaman - Get Debt-Free One Family at a Time

Yes, there will be mathAt first glance, “Take two pills and call me in the morning” is about the extent of math in medicine. More than a few doctors would

Get Debt-Free One Family at a Time – Medium

Helping Americans shave years off of debt, cut thousands of dollars in interest, increase lifestyles and save for secure #retirement. largofinancialservice

Do one thing at a time, and do well.

1.基本格式 *　　*　　*　　*　　*　　command 分　時　日　月　周　命令第1列表示分鐘1～59 每分鐘用*或者 */1表示第2列表示小時1～23（0表示0點）第3列表示日期1～31 第4列表示月份1～12 第5列標識號星期0～6（0表示星期天）第6列要執

python資料分析與挖掘實戰筆記二：第99頁神經網路訓練出現的錯誤'Some keys in session_kwargs are not supported at this time: %s'

在使用神經網路模型預測銷量高低時，系統指出模型訓練時出現錯誤： ValueError Traceback (most recent call last) <ipython-input-20-e46e29b76a5e> in <module&g

xcode This app could not be installed at this time.

If you importing a custom framework, make sure custom framework->bundle-id is not same as currentAppProject->bundle-id. If you are n

【SDK】Memory read error at 0xF8007080

文件中都是不包含 tran func p s transacti nal interface sdk 2017.2 報錯：Memory read error at 0xF8007080. AHB AP transaction error, DAP status f000

[Caffe]:關於*** Aborted at 1479432790 (unix time) try "date -d @1479432790" 錯誤的另一種原因

關於參數 col exce href 次數表示 ati core dump 問題：設置solver.prototxt時，lr_policy:"step"，運行時出現下面問題 *** Aborted at 1479432790 (unix time) try "date

一次lr異常Error: C interpreter run time error: Action.c (17): Error -- memory violation : Exception ACCESS_VIOLATION received問題分析

exceptio err png 今天 ret pos 通過 ima qq群今天qq群裏人問我一個問題他想將token變量換成lr中的參數所以，他通過lr_save_string函數轉換編譯也不報錯，但是運行提示：解決方法：一次lr異常Error: C

memory-ordering-at-compile-time

淺談Memory Reordering

Memory ordering

Weak VS strong Memory Models

弱記憶體模型

資料依賴性的弱

強記憶體模型

順序一致

Memory ordering at compile time

顯性的 Compiler Barriers

隱性的 Compiler Barriers

無緣由的儲存

Memory ordering at processor time

Lock to one processor

Place a memory barrier

More

Summarization

memory-ordering-at-compile-time

Memory Ordering at Compile Time

Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In orde

效率提升最重要的原則 - Doing one thing at a time

yum pycurl: libcurl link-time version is older than compile-time version解決方案

Likes Dislike Improving Performance in C++ with Compile Time Polymorphism

Get Debt-Free One Family at a Time – Chatbots Life

Compile-time Dependency Injection With Go Cloud's Wire

At what time would you read a book?

Jim Katzaman - Get Debt-Free One Family at a Time

Get Debt-Free One Family at a Time – Medium

Do one thing at a time, and do well.

python資料分析與挖掘實戰筆記二：第99頁神經網路訓練出現的錯誤'Some keys in session_kwargs are not supported at this time: %s'

xcode This app could not be installed at this time.

【SDK】Memory read error at 0xF8007080

[Caffe]:關於*** Aborted at 1479432790 (unix time) try "date -d @1479432790" 錯誤的另一種原因

一次lr異常Error: C interpreter run time error: Action.c (17): Error -- memory violation : Exception ACCESS_VIOLATION received問題分析

【閱讀筆記】Real-time Personalization using Embeddings for Search Ranking at Airbnb

memory check error at 0x03D70F16 = 0x00, should be 0xFD.

elasticsearch max virtual memory areas vm.max_map_count [65530] is too low, increase to at

memory-ordering-at-compile-time

淺談Memory Reordering

Memory ordering

Weak VS strong Memory Models

弱記憶體模型

資料依賴性的弱

強記憶體模型

順序一致

Memory ordering at compile time

顯性的 Compiler Barriers

隱性的 Compiler Barriers

無緣由的儲存

Memory ordering at processor time

Lock to one processor

Place a memory barrier

More

Summarization

相關推薦