《CUDA By Example》【Chapter 09】原子性？

阿新 • • 發佈：2019-02-16

9.1 概述

瞭解不同NVIDIA GPU的計算功能集
瞭解原子操作以及為什麼需要使用
瞭解如何在CUDA C核函式中執行帶有原子操作的運算

9.2 計算功能集

不同架構的CPU有不同的功能和指令集（如MMX,SSE,SSE2），對於CUDA支援的GPU也一樣。NVIDIA將GPU支援的各種功能統稱為計算功能集（Compute Capability）。

9.2.1 NVIDIA GPU計算功能集

計算功能集包括1.0, 1.1, 1.2, 1.3以及2.0。高版本計算功能集是低版本計算功能集的超集。
本章介紹硬體在記憶體上執行原子操作的能力。從功能集1.2開始，既支援共享記憶體原子操作又支援全域性記憶體原子操作。

9.2.2 基於最小計算功能集的編譯

告訴編譯器程式碼需要使用某一版本如（1.2/1.1）版本或者更高的計算功能集。

nvcc -arch=sm_12
nvcc -arch=sm_11

9.3原子操作簡介

9.4計算直方圖

9.4.1在CPU上計算直方圖

hist_cpu.cu


#include "../common/book.h"

#define SIZE    (100*1024*1024)

int main( void ) {
    unsigned char *buffer =
                     (unsigned char*)big_random_block( SIZE );

    // capture the start time 

    clock_t         start, stop;
    start = clock();

    unsigned int    histo[256];
    for (int i=0; i<256; i++)
        histo[i] = 0;

    for (int i=0; i<SIZE; i++)
        histo[buffer[i]]++;

    stop = clock();
    float   elapsedTime = (float)(stop - start) /
                          (float 
)CLOCKS_PER_SEC * 1000.0f;
    printf( "Time to generate:  %3.1f ms\n", elapsedTime );

    long histoCount = 0;
    for (int i=0; i<256; i++) {
        histoCount += histo[i];
    }
    printf( "Histogram Sum:  %ld\n", histoCount );

    free( buffer );
    return 0;
}

9.4.2在GPU上計算直方圖

使用全域性記憶體原子操作，效能可能會下降。
核函式中計算很少，很可能是全域性記憶體上的原子操作引起了效能的降低。當數千個執行緒嘗試訪問少量的記憶體位置時，將產生大量的競爭。為了確保遞增操作的原子性，對相同記憶體位置的操作都將被硬體序列化。

#include "../common/book.h"

#define SIZE    (100*1024*1024)


__global__ void histo_kernel( unsigned char *buffer,
                              long size,
                              unsigned int *histo ) {

    // clear out the accumulation buffer called temp
    // since we are launched with 256 threads, it is easy
    // to clear that memory with one write per thread
    __shared__  unsigned int temp[256];
    temp[threadIdx.x] = 0;
    __syncthreads();

    // calculate the starting index and the offset to the next
    // block that each thread will be processing
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;
    while (i < size) {
        atomicAdd( &temp[buffer[i]], 1 );
        i += stride;
    }
    // sync the data from the above writes to shared memory
    // then add the shared memory values to the values from
    // the other thread blocks using global memory
    // atomic adds
    // same as before, since we have 256 threads, updating the
    // global histogram is just one write per thread!
    __syncthreads();
    atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );
}

int main( void ) {
    unsigned char *buffer =
                     (unsigned char*)big_random_block( SIZE );

    // capture the start time
    // starting the timer here so that we include the cost of
    // all of the operations on the GPU.  if the data were
    // already on the GPU and we just timed the kernel
    // the timing would drop from 74 ms to 15 ms.  Very fast.
    cudaEvent_t     start, stop;
    HANDLE_ERROR( cudaEventCreate( &start ) );
    HANDLE_ERROR( cudaEventCreate( &stop ) );
    HANDLE_ERROR( cudaEventRecord( start, 0 ) );

    // allocate memory on the GPU for the file's data
    unsigned char *dev_buffer;
    unsigned int *dev_histo;
    HANDLE_ERROR( cudaMalloc( (void**)&dev_buffer, SIZE ) );
    HANDLE_ERROR( cudaMemcpy( dev_buffer, buffer, SIZE,
                              cudaMemcpyHostToDevice ) );

    HANDLE_ERROR( cudaMalloc( (void**)&dev_histo,
                              256 * sizeof( int ) ) );
    HANDLE_ERROR( cudaMemset( dev_histo, 0,
                              256 * sizeof( int ) ) );

    // kernel launch - 2x the number of mps gave best timing
    cudaDeviceProp  prop;
    HANDLE_ERROR( cudaGetDeviceProperties( &prop, 0 ) );
    int blocks = prop.multiProcessorCount;
    histo_kernel<<<blocks*2,256>>>( dev_buffer,
                                    SIZE, dev_histo );

    unsigned int    histo[256];
    HANDLE_ERROR( cudaMemcpy( histo, dev_histo,
                              256 * sizeof( int ),
                              cudaMemcpyDeviceToHost ) );

    // get stop time, and display the timing results
    HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
    HANDLE_ERROR( cudaEventSynchronize( stop ) );
    float   elapsedTime;
    HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
                                        start, stop ) );
    printf( "Time to generate:  %3.1f ms\n", elapsedTime );

    long histoCount = 0;
    for (int i=0; i<256; i++) {
        histoCount += histo[i];
    }
    printf( "Histogram Sum:  %ld\n", histoCount );

    // verify that we have the same counts via CPU
    for (int i=0; i<SIZE; i++)
        histo[buffer[i]]--;
    for (int i=0; i<256; i++) {
        if (histo[i] != 0)
            printf( "Failure at %d!\n", i );
    }

    HANDLE_ERROR( cudaEventDestroy( start ) );
    HANDLE_ERROR( cudaEventDestroy( stop ) );
    cudaFree( dev_histo );
    cudaFree( dev_buffer );
    free( buffer );
    return 0;
}

使用共享記憶體原子操作和全域性記憶體原子操作。上面程式碼的效能問題是由於原子操作帶來的，有意思的是，解決的辦法是增加原子操作來優化效能。這裡引入了共享記憶體來優化。比單純使用全域性記憶體原子操作好很多。

#include "../common/book.h"

#define SIZE    (100*1024*1024)


__global__ void histo_kernel( unsigned char *buffer,
                              long size,
                              unsigned int *histo ) {
    // calculate the starting index and the offset to the next
    // block that each thread will be processing
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;
    while (i < size) {
        atomicAdd( &histo[buffer[i]], 1 );
        i += stride;
    }
}

int main( void ) {
    unsigned char *buffer =
                     (unsigned char*)big_random_block( SIZE );

    // capture the start time
    // starting the timer here so that we include the cost of
    // all of the operations on the GPU.
    cudaEvent_t     start, stop;
    HANDLE_ERROR( cudaEventCreate( &start ) );
    HANDLE_ERROR( cudaEventCreate( &stop ) );
    HANDLE_ERROR( cudaEventRecord( start, 0 ) );

    // allocate memory on the GPU for the file's data
    unsigned char *dev_buffer;
    unsigned int *dev_histo;
    HANDLE_ERROR( cudaMalloc( (void**)&dev_buffer, SIZE ) );
    HANDLE_ERROR( cudaMemcpy( dev_buffer, buffer, SIZE,
                              cudaMemcpyHostToDevice ) );

    HANDLE_ERROR( cudaMalloc( (void**)&dev_histo,
                              256 * sizeof( int ) ) );
    HANDLE_ERROR( cudaMemset( dev_histo, 0,
                              256 * sizeof( int ) ) );

    // kernel launch - 2x the number of mps gave best timing
    cudaDeviceProp  prop;
    HANDLE_ERROR( cudaGetDeviceProperties( &prop, 0 ) );
    int blocks = prop.multiProcessorCount;
    histo_kernel<<<blocks*2,256>>>( dev_buffer, SIZE, dev_histo );

    unsigned int    histo[256];
    HANDLE_ERROR( cudaMemcpy( histo, dev_histo,
                              256 * sizeof( int ),
                              cudaMemcpyDeviceToHost ) );

    // get stop time, and display the timing results
    HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
    HANDLE_ERROR( cudaEventSynchronize( stop ) );
    float   elapsedTime;
    HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
                                        start, stop ) );
    printf( "Time to generate:  %3.1f ms\n", elapsedTime );

    long histoCount = 0;
    for (int i=0; i<256; i++) {
        histoCount += histo[i];
    }
    printf( "Histogram Sum:  %ld\n", histoCount );

    // verify that we have the same counts via CPU
    for (int i=0; i<SIZE; i++)
        histo[buffer[i]]--;
    for (int i=0; i<256; i++) {
        if (histo[i] != 0)
            printf( "Failure at %d!  Off by %d\n", i, histo[i] );
    }

    HANDLE_ERROR( cudaEventDestroy( start ) );
    HANDLE_ERROR( cudaEventDestroy( stop ) );
    cudaFree( dev_histo );
    cudaFree( dev_buffer );
    free( buffer );
    return 0;
}

9.5 小結

有時候以來原子操作會帶來效能問題，並且這些問題只能通過對演算法的部分重構來加以解決。在直方圖中，使用了一種兩階段演算法，從而降低了在全域性記憶體訪問上競爭程度。通常，這種降低記憶體競爭程度的策略總能帶來不錯的效果。

《CUDA By Example》【Chapter 09】原子性？

9.1 概述瞭解不同NVIDIA GPU的計算功能集瞭解原子操作以及為什麼需要使用瞭解如何在CUDA C核函式中執行帶有原子操作的運算 9.2 計算功能集不同架構的CPU有不同的功能和指令集（如MMX,SSE,SSE2），對於CUDA支援的

《CUDA By Example》【Chapter 05】執行緒協作？

概述本章介紹程式碼在各個並行副本之間的通訊和協作。 1，瞭解不同執行緒之間的通訊機制； 2，瞭解並行執行執行緒的同步機制； 5.2 並行執行緒塊的分解 add<<<N,1>>>( dev_a, dev_b, d

【Chapter 3】用戶體驗分析

image blog 評論 gin 自己 img 結束基於 margin 　　在我們日常生活中，微信公眾號的普遍性已經是達到90%以上，如何做好一個微信公眾號對於一個組織來說是非常重要的。因為公眾號起到的作用是【宣傳信息】，如何讓一些商家想讓用戶了解的信息及時傳達或者讓用

【資訊科技】【2015.09】航空視訊中的車輛自動檢測與跟蹤

本文為英國拉夫堡大學（作者：XiyanChen）的博士論文，共194頁。本文研究了航空視訊自動實時檢測和跟蹤的挑戰性任務。本文的目的是建立一個自動化系統，可以準確地定位出現在空中視訊中的任何車輛，並用跟蹤器跟蹤目標車輛。車輛檢測與跟蹤有著廣泛的應用，近年來該課題一直是一個活

【新書推薦】【2018.09】5G時代的衛星通訊

【2018.09】5G時代的衛星通訊Satellite Communications in the 5G Era，共600頁。如果需要電子版，請聯絡QQ：3042075372。衛星通訊(SatCom)在確保隨時隨地無縫獲得電信服務方面發揮著重要作用，並且是航空、軍事、海事、

【語法09】Python錯誤和異常

作為Python初學者，在剛學習Python程式設計時，經常會看到一些報錯資訊，在前面我們沒有提及，這章節我們會專門介紹。 Python有兩種錯誤很容易辨認：語法錯誤和異常。語法錯誤 ''' Python 的語法錯誤或者稱之為解析錯，是初學者經常碰到的，如下例

【電腦科學】【2016.09】視覺識別的深度學習

我們的研究目標是開發促進自動視覺識別的方法。為了預測與影象相關的唯一或多重標籤，我們研究了用於監督特徵學習的不同型別的深度神經網路結構和方法。我們首先回顧了卷積神經網路的最新進展，旨在瞭解這一系列統計模型背後的歷史、現代結構的侷限性以及當前用於訓練深層CNN的

【雷達與對抗】【2015.09】通用雷達模型在汽車領域的應用

本文為奧地利格拉茨技術大學（作者：Mario Grgic BSc）的電子工程碩士論文，共78頁。先進的駕駛輔助系統在現代車輛中越來越重要，它們用於輔助駕駛員控制車輛行駛，例如在諸如巡航控制之類的應用中，但是最新的發展趨勢甚至考慮由駕駛輔助系統完全接管車輛控制

《GPU高效能程式設計 CUDA實戰》(CUDA By Example)讀書筆記

寫在最前這本書是2011年出版的，按照計算機的發展速度來說已經算是上古書籍了，不過由於其簡單易懂，仍舊被推薦為入門神書。先上封面：由於書比較老，而且由於學習的目的不同，這裡只介紹了基礎程式碼相關的內容，跳過了那些影象處理的內容。另外這本書的程式碼這裡：csd

【電腦科學】【2016.09】深度學習的不確定性

本文為英國劍橋大學（作者：YarinGal）的博士論文，共174頁。深度學習已經吸引了資訊工程各個領域的研究人員，如人工智慧、計算機視覺和語言處理等，也吸引了諸如物理、生物學和生產製造等傳統科學的極大關注。神經網路、卷積神經網路等影象處理工具、遞迴神經網路等序列處理模型、以及dr

用Visual Studio 2017執行Julia集樣例(CUDA by example)

說說遇到的一些問題： 1）VS 直接開啟資料夾是不能除錯執行的，需要先建立一個專案，再將已有程式碼拷貝至專案所在資料夾下。具體操作見：如何使用vs將現有的專案或者資料夾(尤其是多層目錄的)新增到專案中 2）VS專案中一些檔案的功能：sln、sdf、vcxproj

【網絡】高性能網絡編程--下一個10年，是時候考慮C10M並發問題了

分享千萬改善 iii 接下來 field 連接數開發總結轉載：http://www.52im.net/thread-568-1-1.html 1、前言在本系列文章的上篇中我們回顧了過雲的10年裏，高性能網絡編程領域著名的C10K問題及其成功的解決方案（上

【通俗理解】顯著性檢驗，T-test，P-value

顯著性檢驗，判定實驗結果是否由隨機誤差導致的。舉例很好，很清楚雖然樣本中，均值蘇州銷售額大於鄭州，但T-test發現這是隨機導致的，P>0.05，當樣本量足夠大可能他們的銷售額就沒有差異了假設：兩個樣本集之間不存在任何區別結果：在顯著性水平α =0.05

【Objective-C】09-空指針和野指針

復制註意一個 20px 行程 def mage tle 指向一、什麽是空指針和野指針 1.空指針 1> 沒有存儲不論什麽內存地址的指針就稱為空指針(NULL指針) 2> 空指針就是被賦值為0的指針。在沒有

【09】react 之表單組件

密碼愛好你在 false create 保留 input 數據編寫不太清楚有多少初學React的同學和博主當時一樣，在看完React的生命周期、數據流之後覺得已經上手了，甩開文檔啪啪啪的開始敲了起來。結果...居然被一個input標簽給教做人了。故事是這樣的

【Static Program Analysis - Chapter 3】Type Analysis

stat strong solution gen pointer image out ons 我們類型分析，個人理解就是（通過靜態分析技術）分析出代碼中，哪些地方只能是某種或某幾種數據類型，這是一種約束。 ? 例如，給定一個程序: 其中，我們可以很直接地得到一些約束：

linux【報錯】userdel: user xiaoming is currently used by process 4713解決

令行 padding eat sudo syn roc 遇到命令 quest 學習linux的初學者肯定會遇到一些莫名其妙的問題，比如我，在學習刪除一個用戶的時候，就遇到上面的報錯 1 userdel: user xiaoming is curren

【sqli-labs】 less46 GET -Error based -Numeric -Order By Clause(GET型基於錯誤的數字型Order By從句註入)

security 使用 tab eric and name users date for http://192.168.136.128/sqli-labs-master/Less-46/?sort=1 sort=4時出現報錯說明參數是添加在order by 之後

【已解決】使用Navicat連接MySQL數據庫時報錯Client does not support authentication protocol requested by server; consider upgrading MySQL client

ransient nbsp .com res ctr lis org get nfa 報錯緣由：　　起初在驗證一個mybatis的demo時提示“Error querying database”（見下方完整代碼），找了很多方法沒有解決，最後在貼吧找到答案。錯誤原因是安裝m

Luogu P3941 入陣曲【字首和】By cellur925

題目傳送門題目大意：給你一個\(n\)*\(m\)的矩陣，每個位置都有一個數，求有多少不同的子矩陣使得矩陣內所有數的和是\(k\)的倍數。資料範圍給的非常友好233，期望得到的暴力分：75分。前12個點可以用\(O(n^4)\)演算法水過，對於\(<=400\)的有特殊性質2的資料，我們還可

《CUDA By Example》【Chapter 09】原子性？

9.1 概述

9.2 計算功能集

9.2.1 NVIDIA GPU計算功能集

9.2.2 基於最小計算功能集的編譯

9.3原子操作簡介

9.4計算直方圖

9.4.1在CPU上計算直方圖

hist_cpu.cu

9.4.2在GPU上計算直方圖

9.5 小結

相關推薦