OpenMP從入門到棄坑(1)

阿新 • • 發佈：2022-03-09

OpenMP使用教程(入門)

0x01 介紹

OpenMP是目前最常用的並行程式設計模型之一，它的出現使得程式設計師可以較為簡單地編寫並行程式(parallel software)。在使用OpenMP之前，我們首先要了解一下內容

瞭解如何編寫c/c++程式。OpenMP支援c/c++以及Fortran，但我們一般都使用c/c++
如何將程式連結到某一個Library

OpenMP在計算機之中處於的層級如下圖所示：

0x02 核心語法

大多數OpemMP的組成都是類似#pragma omp construct[clause[clause]...]的編譯器指令(Compiler directives)，例如
#pragma omp parallel num_threads(4)
函式原型(Prototypes)和使用的型別(type)定義在：<omp.h>檔案之中，需要下載以及安裝

0x03 編譯

> gcc -fopenmp test.c
> export OMP_NUM_THREADS=n	#n is num of the professors in your computer
> ./a.out

0x04 示例程式

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main(){
    int nthreads,tid;
    /* Fork a team of threads giving them their own copies of variables */
     #pragma omp parallel private(nthreads, tid)
     {
       /* Obtain thread number */
        tid = omp_get_thread_num();
        printf("Hello World from thread = %d\n", tid);
        /* Only master thread does this */
        if (tid == 0){
            nthreads = omp_get_num_threads();
            printf("Number of threads = %d\n", nthreads);
          }
     }  /* All threads join master thread and disband */
     return 0;
}

Output：


Hello World from thread = 0
Hello World from thread = 1
Hello World from thread = 13
Hello World from thread = 5
Hello World from thread = 14
Hello World from thread = 15
Hello World from thread = 3
Number of threads = 16
Hello World from thread = 8
Hello World from thread = 9
Hello World from thread = 11
Hello World from thread = 6
Hello World from thread = 2
Hello World from thread = 4
Hello World from thread = 10
Hello World from thread = 12
Hello World from thread = 7

0x05 共享記憶體程式

上圖為共享記憶體空間的的多個執行緒示例，一個程序擁有多個執行緒，執行緒通過向共享記憶體進行讀或者寫操作來進行通訊。OS通過策略來協調何時來執行某一個程式，同時有同步(synchronization機制)來協調執行順序，從而保證程式的正確執行。

0x06 執行緒建立

double A[1000];
omp_set_num_threads(4);//request 4 threads
#pragma omp parallel // each thread process the below function
{
  int ID=omp_get_thread_num(4); //each thread needs a thread ID
  pooh(ID,A)
}

上述程式的執行如下圖所示

![Screen Shot 2022-03-09 at 5.16.56 PM](/Users/leosher/Desktop/Screen Shot 2022-03-09 at 5.16.56 PM.png)

下面講述一個具體的案例：

對於積分\(\int_0^1\frac{4}{1+x^2}dx\)，我們可以由積分公式得出是\(\pi\)。但是對於一些複雜的積分，計算機無法根據積分公式求解，那麼就只能進行近似求解：

\[\sum_{i=0}^NF(x_i)\Delta x\approx\pi \]

如果\(\Delta x\)設定的越小，那麼我們的結果就越接近精確解，不過程式所耗費的時間就越長，不過我們取的每一個\(\Delta x\)都是不相關的。因此可以進行並行化處理以縮短執行的時間。

首先我們看一下序列(serial)的程式

static long num_steps = 100000000;
double step;
int main ()
{
int i; double x, pi, sum = 0.0;
step = 1.0/(double) num steps;
for (i=0;i< num_steps; i++)
x=(i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum:
}

執行的結果為：

time=652.356ms
pi=3.141593

下面再看一下並行化的程式：

#include<omp.h>
#include<time.h>
#include<stdio.h>
static long num_steps=1000000000;
double step;
#define NUM_THREADS 16
void main(){
  clock_t start,end;
  start =clock();//or time(&start);
  int i, nthreads;
  double pi,sum[NUM_THREADS];
  step=1.0/(double)num_steps;
  omp_set_num_threads(NUM_THREADS);
  #pragma omp parallel
  {
    int i,id,nthrds;
    double x;
    id=omp_get_thread_num();
    nthrds=omp_get_num_threads();
    if(id==0) nthreads=nthrds;//主程序才可以修改全域性變數
    for(i=id,sum[id]=0.0;i<num_steps;i+=nthrds){
      x=(i+0.5)*step;
      sum[id]+=4.0/(1.0+x*x);
    }
  }
  for(i=0,pi=0.0;i<nthreads;i++)pi+=sum[i]*step;
  end=clock();
  printf("time=%f",(float)(start-end)/1000);
  printf("pi=%f",pi);
}

輸出的結果

NUM_THREADS=2
time=293.483000
pi=3.141593

NUM_THREADS=4
time=159.554000
pi=3.141593

注意：我們上述的程式存在著FalsePooling。因為我們可以看出隨著NUM_THREADS的增加,time並非線性減少。有管FalsePooling詳見https://zhuanlan.zhihu.com/p/55917869

0x07 同步

同步(Synchronization)是將一個或多個執行緒進行協調，最常見的有Barrier和Mutual exclusion兩種

Barrier

#pragma omp parallel
{
	int id=omp_get_thread_num0;
	A[id] = big_calc1 (id);
	#pragma omp barrier  //barrier
	B[id] = big_calc2(id, A);
}

只有當所有執行緒都到達barrier的時候才會繼續執行

Critical

float res;
#pragma omp parallel
{
	float B; 
  int i, id, nthrds;
  id = omp_get_thread_num0;
  nthrds=omp_get_num_threads0;
  for(i=id;i<niters;i+=nthrds){
    B= big _job(i);
    #pragma omp critical //Threads wait their turn. Only one at a time calls consume()
    res += consume (B);
  }
}

Atomic

#pragma omp parallel
{
	double tmp, B;
	B= DOITO;
	tmp = big ugly(B);
	#pragma omp atomic
	X+= tmp;
}

Loop

#pragma omp parallel
{
	#pragma omp for
	for(i=0;i<n;i++){ //i is private by default
		do...;
	}
}

Reduction

Reduction(op:list)

Inside a parallel or a work-sharing construct:

A local copy of each list variable is made and initialized depending on the "op" (e.g. 0 for " "+").
Updates occur on the local copy.
Local copies are reduced into a single value and combined with the original global value.

OpenMP從入門到棄坑(1)

OpenMP使用教程(入門)

0x01 介紹

0x02 核心語法

0x03 編譯

0x04 示例程式

0x05 共享記憶體程式

0x06 執行緒建立

0x07 同步

Barrier

Critical

Atomic

Loop

Reduction

OpenMP從入門到棄坑(1)

Swoole從入門到入土(1)——入坑

pytest從入門到放棄1--pytest執行規則

Scrapy從入門到放棄1

爬蟲從入門到入獄(1)——正則表示式

Docker從入門到放棄(1) Docker簡介與安裝

【MySQL】從入門到精通1-一些背景知識

前端防禦從入門到棄坑——CSP變遷

SpringCache從入門到棄坑

樹狀陣列從入門到棄療

<Python> python從入門到實踐（實踐篇）（1） --詞雲製作

K8S從入門到放棄系列-(1)環境初始化

《Java從入門到失業》第五章：繼承與多型（5.1-5.7）：繼承

Java從入門到實戰之（1）java基礎

Redis從入門到高可用，分散式實踐（1）- 基礎介紹

Docker 從入門到掉坑

最全得django講解，零基礎從入門到精通（No.1）

【python從入門到放棄】1.list&tuple

1.4 Java從入門到精通

re的學習記錄從入門到入獄【1】（第二週）

OpenMP從入門到棄坑(1)

OpenMP使用教程(入門)

0x01 介紹

0x02 核心語法

0x03 編譯

0x04 示例程式

0x05 共享記憶體程式

0x06 執行緒建立

0x07 同步

Barrier

Critical

Atomic

Loop

Reduction

相關推薦