1. 程式人生 > 其它 >OpenMP從入門到棄坑(1)

OpenMP從入門到棄坑(1)

OpenMP使用教程(入門)

0x01 介紹

OpenMP是目前最常用的並行程式設計模型之一,它的出現使得程式設計師可以較為簡單地編寫並行程式(parallel software)。在使用OpenMP之前,我們首先要了解一下內容

  • 瞭解如何編寫c/c++程式。OpenMP支援c/c++以及Fortran,但我們一般都使用c/c++
  • 如何將程式連結到某一個Library

OpenMP在計算機之中處於的層級如下圖所示:

0x02 核心語法

  • 大多數OpemMP的組成都是類似#pragma omp construct[clause[clause]...]的編譯器指令(Compiler directives),例如
    #pragma omp parallel num_threads(4)
  • 函式原型(Prototypes)和使用的型別(type)定義在:<omp.h>檔案之中,需要下載以及安裝

0x03 編譯

> gcc -fopenmp test.c
> export OMP_NUM_THREADS=n	#n is num of the professors in your computer
> ./a.out

0x04 示例程式

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main(){
    int nthreads,tid;
    /* Fork a team of threads giving them their own copies of variables */
     #pragma omp parallel private(nthreads, tid)
     {
       /* Obtain thread number */
        tid = omp_get_thread_num();
        printf("Hello World from thread = %d\n", tid);
        /* Only master thread does this */
        if (tid == 0){
            nthreads = omp_get_num_threads();
            printf("Number of threads = %d\n", nthreads);
          }
     }  /* All threads join master thread and disband */
     return 0;
}

Output:


Hello World from thread = 0
Hello World from thread = 1
Hello World from thread = 13
Hello World from thread = 5
Hello World from thread = 14
Hello World from thread = 15
Hello World from thread = 3
Number of threads = 16
Hello World from thread = 8
Hello World from thread = 9
Hello World from thread = 11
Hello World from thread = 6
Hello World from thread = 2
Hello World from thread = 4
Hello World from thread = 10
Hello World from thread = 12
Hello World from thread = 7

0x05 共享記憶體程式

上圖為共享記憶體空間的的多個執行緒示例,一個程序擁有多個執行緒,執行緒通過向共享記憶體進行讀或者寫操作來進行通訊。OS通過策略來協調何時來執行某一個程式,同時有同步(synchronization機制)來協調執行順序,從而保證程式的正確執行。

0x06 執行緒建立

double A[1000];
omp_set_num_threads(4);//request 4 threads
#pragma omp parallel // each thread process the below function
{
  int ID=omp_get_thread_num(4); //each thread needs a thread ID
  pooh(ID,A)
}

上述程式的執行如下圖所示

![Screen Shot 2022-03-09 at 5.16.56 PM](/Users/leosher/Desktop/Screen Shot 2022-03-09 at 5.16.56 PM.png)

下面講述一個具體的案例:

對於積分\(\int_0^1\frac{4}{1+x^2}dx\),我們可以由積分公式得出是\(\pi\)。但是對於一些複雜的積分,計算機無法根據積分公式求解,那麼就只能進行近似求解:

\[\sum_{i=0}^NF(x_i)\Delta x\approx\pi \]

如果\(\Delta x\)設定的越小,那麼我們的結果就越接近精確解,不過程式所耗費的時間就越長,不過我們取的每一個\(\Delta x\)都是不相關的。因此可以進行並行化處理以縮短執行的時間。

首先我們看一下序列(serial)的程式

static long num_steps = 100000000;
double step;
int main ()
{
int i; double x, pi, sum = 0.0;
step = 1.0/(double) num steps;
for (i=0;i< num_steps; i++)
x=(i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum:
}

執行的結果為:

time=652.356ms
pi=3.141593

下面再看一下並行化的程式:

#include<omp.h>
#include<time.h>
#include<stdio.h>
static long num_steps=1000000000;
double step;
#define NUM_THREADS 16
void main(){
  clock_t start,end;
  start =clock();//or time(&start);
  int i, nthreads;
  double pi,sum[NUM_THREADS];
  step=1.0/(double)num_steps;
  omp_set_num_threads(NUM_THREADS);
  #pragma omp parallel
  {
    int i,id,nthrds;
    double x;
    id=omp_get_thread_num();
    nthrds=omp_get_num_threads();
    if(id==0) nthreads=nthrds;//主程序才可以修改全域性變數
    for(i=id,sum[id]=0.0;i<num_steps;i+=nthrds){
      x=(i+0.5)*step;
      sum[id]+=4.0/(1.0+x*x);
    }
  }
  for(i=0,pi=0.0;i<nthreads;i++)pi+=sum[i]*step;
  end=clock();
  printf("time=%f",(float)(start-end)/1000);
  printf("pi=%f",pi);
}

輸出的結果

NUM_THREADS=2
time=293.483000
pi=3.141593

NUM_THREADS=4
time=159.554000
pi=3.141593

注意:我們上述的程式存在著FalsePooling。因為我們可以看出隨著NUM_THREADS的增加,time並非線性減少。有管FalsePooling詳見https://zhuanlan.zhihu.com/p/55917869

0x07 同步

同步(Synchronization)是將一個或多個執行緒進行協調,最常見的有Barrier和Mutual exclusion兩種

Barrier

#pragma omp parallel
{
	int id=omp_get_thread_num0;
	A[id] = big_calc1 (id);
	#pragma omp barrier  //barrier
	B[id] = big_calc2(id, A);
}

只有當所有執行緒都到達barrier的時候才會繼續執行

Critical

float res;
#pragma omp parallel
{
	float B; 
  int i, id, nthrds;
  id = omp_get_thread_num0;
  nthrds=omp_get_num_threads0;
  for(i=id;i<niters;i+=nthrds){
    B= big _job(i);
    #pragma omp critical //Threads wait their turn. Only one at a time calls consume()
    res += consume (B);
  }
}

Atomic

#pragma omp parallel
{
	double tmp, B;
	B= DOITO;
	tmp = big ugly(B);
	#pragma omp atomic
	X+= tmp;
}

Loop

#pragma omp parallel
{
	#pragma omp for
	for(i=0;i<n;i++){ //i is private by default
		do...;
	}
}


Reduction

Reduction(op:list)

Inside a parallel or a work-sharing construct:

  • A local copy of each list variable is made and initialized depending on the "op" (e.g. 0 for " "+").
  • Updates occur on the local copy.
  • Local copies are reduced into a single value and combined with the original global value.