1. 程式人生 > 其它 >多執行緒入門之矩陣乘法

多執行緒入門之矩陣乘法

  • 問題:矩陣相乘一系列優化。下面我們以c[2000][2000] = a[2000][2000] * b[2000][2000] 為例來進行優化。

  • 環境:CPU intel 9600k 6核心6執行緒 3.7GHz

多執行緒

To optimize the performance of this program, we have to use the advantage of multi-cores. We will create a thread for each row in a matrix that does the multiplication in parellel and reduce the processing time.

Let us create a Thread class that implements Runnable Interface.

RowMultiplyWorker.class :

public class RowMultiplyWorker implements Runnable /* extends Thread */ {
    public RowMultiplyWorker(int[][] result, int[][] matrix1, int[][] matrix2, int row) {
        this.result = result;
        this.matrix1 = matrix1;
        this.matrix2 = matrix2;
        this.row = row;
    }

    @Override
    public void run() {
        for (int i = 0; i < matrix2[0].length; i++) {
            result[row][i] = 0;
            for (int j = 0; j < matrix1[row].length; j++) {
                result[row][i] += matrix1[row][j] * matrix2[j][i];
            }
        }
    }

    private int[][] matrix1;
    private int[][] matrix2;
    private final int[][] result;
    private final int row;
}

Next, create a class to create 10 threads at a time because if we create 2000 threads for 2000 x 2000 matrix then the application gets to hang up. So we will be using the 10 threads as a group and let them complete then again initiate the next 10 threads untill complete each row multiplication.

ParallelThreadCreator

:

import java.util.List;
import java.util.ArrayList;

public class ParallelThreadCreator {

    // creating 10 threads and waiting for them to complete then again repeat steps.
    public static void multiply(int[][] matrix1, int[][] matrix2, int[][] result) {
        List<Thread> threads = new ArrayList<>();
        int rows1 = matrix1.length;
        for (int i = 0; i < rows1; i++) {
            // 方式一:直接建立執行緒(RowMultiplyWorker implements Runnable)
            RowMultiplyWorker task = new RowMultiplyWorker(result, matrix1, matrix2, i);
            Thread thread = new Thread(task);
            thread.start();
            threads.add(thread);

//            // 方式二:以子類的方式建立執行緒(RowMultiplyWorker extends Thread)
//            Thread thread = new RowMultiplyWorker(result, matrix1, matrix2, i);
//            thread.start();
//            threads.add(thread);

            if (threads.size() % 10 == 0) {
                waitForThreads(threads);
            }
        }
    }

    private static void waitForThreads(List<Thread> threads) {
        for (Thread thread : threads) {
            try {
                thread.join();
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
        threads.clear();
    }
}

Let us create the main class to test the time taking using this approach.

import java.util.Random;

public class Main {
    public static void main(String[] args) {
        // 初始化a,b陣列
        init();

        // 無任何優化
        int t1 = get1();
        System.out.println("單執行緒序列:" + t1 + "毫秒,約" + t1 / 1000 + "秒");
        System.out.println("==================================================");

        // 利用 Cache:改變迴圈的巢狀順序
        int t2 = get2();
        System.out.println("單執行緒序列(利用 Cache:改變迴圈的巢狀順序):" + t2 + "毫秒,約" + t2 / 1000 + "秒");
        // 檢查get2()計算的d陣列,結果是否正確
        if (check(c, d)) System.out.println("正確");
        else System.out.println("錯誤");
        System.out.println("==================================================");

        // 多執行緒
        long start = System.currentTimeMillis();
        ParallelThreadCreator.multiply(a, b, e);
        long end = System.currentTimeMillis();
        int t3 = (int) (end - start);
        System.out.println("多執行緒並行:" + t3 + "毫秒,約" + t3 / 1000 + "秒");
        // 檢查多執行緒計算的e陣列,結果是否正確
        if (check(c, e)) System.out.println("正確");
        else System.out.println("錯誤");
    }

    // 返回一次矩陣乘法運算所需時間(ms)
    private static int get1() {
        long start = System.currentTimeMillis();
        for (int i = 0; i < N; i++)
            for (int j = 0; j < N; j++)
                for (int k = 0; k < N; k++)
                    c[i][j] += a[i][k] * b[k][j];
        long end = System.currentTimeMillis();
        return (int) (end - start);
    }

    // 返回一次矩陣乘法運算所需時間(ms)無任何優化
    private static int get2() {
        long start = System.currentTimeMillis();
        for (int i = 0; i < N; i++)
            for (int k = 0; k < N; k++)
                for (int j = 0; j < N; j++)
                    d[i][j] += a[i][k] * b[k][j];
        long end = System.currentTimeMillis();
        return (int) (end - start);
    }

    private static void init() {
        // a,b矩陣賦值 [1, 100] 範圍內的隨機數
        for (int i = 0; i < N; i++)
            for (int j = 0; j < N; j++) {
                // 元素值為[1, 100]的隨機數
                a[i][j] = rd.nextInt(seed) + 1;
                b[i][j] = rd.nextInt(seed) + 1;
            }
    }

    private static boolean check(int[][] matrix1, int[][] matrix2) {
        for (int i = 0; i < N; i++)
            for (int j = 0; j < N; j++)
                if (matrix1[i][j] != matrix2[i][j])
                    return false;

        return true;
    }

    private static final int N = 2000, seed = 100;
    private static int[][] a = new int[N][N];
    private static int[][] b = new int[N][N];
    private static int[][] c = new int[N][N];   // 無任何優化計算的結果
    private static int[][] d = new int[N][N];   // 利用 Cache:改變迴圈的巢狀順序計算的結果
    private static int[][] e = new int[N][N];   // 多執行緒計算的結果
    private static Random rd = new Random();
}

In this article, we have seen how to build the concurrent application for Matrix Multiplication. Now compare the time taken for 2000 x 2000 sized matrix is 65 seconds in the normal program and when we applied the multiple threads then it took only 17 seconds to complete this.

Still we can improve parallel version using Runtime.getRuntime().availableProcessors(). This returns the threads available to use and this count can be instead of 10 threads.

So, It is better to use the multiple threads efficiently for large scale applications.

實驗結果

因為我們是把N行分組,然後把該組交由多執行緒計算,所以組內行數必須能整除N。否則最後會有幾行沒有計算。

  • 當我們設定矩陣維度N = 2000,組內行數(執行緒數量)設定為20時,結果如下:

    單執行緒序列:48698毫秒,約48秒
    ==================================================
    單執行緒序列(利用 Cache:改變迴圈的巢狀順序):5859毫秒,約5秒
    正確
    ==================================================
    多執行緒並行:8210毫秒,約8秒
    正確
    
  • 當我們設定矩陣維度N = 2000,組內行數(執行緒數量)設定為10時,結果如下:

    單執行緒序列:48904毫秒,約48秒
    ==================================================
    單執行緒序列(利用 Cache:改變迴圈的巢狀順序):5864毫秒,約5秒
    正確
    ==================================================
    多執行緒並行:8887毫秒,約8秒
    正確
    
  • 當我們設定矩陣維度N = 2000,組內行數(執行緒數量)設定為8時,結果如下:

    單執行緒序列:48263毫秒,約48秒
    ==================================================
    單執行緒序列(利用 Cache:改變迴圈的巢狀順序):5871毫秒,約5秒
    正確
    ==================================================
    多執行緒並行:8934毫秒,約8秒
    正確
    
  • 當我們設定矩陣維度N = 2000,組內行數(執行緒數量)設定為4時,結果如下:

    單執行緒序列:49045毫秒,約49秒
    ==================================================
    單執行緒序列(利用 Cache:改變迴圈的巢狀順序):6003毫秒,約6秒
    正確
    ==================================================
    多執行緒並行:11993毫秒,約11秒
    正確
    
  • 當我們設定矩陣維度N = 2000,組內行數設定為2時,結果如下:

    單執行緒序列:50119毫秒,約50秒
    ==================================================
    單執行緒序列(利用 Cache:改變迴圈的巢狀順序):5861毫秒,約5秒
    正確
    ==================================================
    多執行緒並行:24061毫秒,約24秒
    正確
    
  • 當我們設定矩陣維度N = 2000,組內行數(執行緒數量)設定為1時,結果如下:

    單執行緒序列:51972毫秒,約51秒
    ==================================================
    單執行緒序列(利用 Cache:改變迴圈的巢狀順序):6111毫秒,約6秒
    正確
    ==================================================
    多執行緒並行:49762毫秒,約49秒
    正確
    

我們可以發現設定過多(超過硬體本身執行緒規格)或過少(沒有充分利用硬體本身執行緒規格)的執行緒數都不合適,執行緒數量的設定應該與硬體提供的執行緒數相當。當然,我們也不能一味的盲目推崇多執行緒,而不注重程式的優化,這一點從 利用 Cache:改變迴圈的巢狀順序 資料就可以看出。它的用時更短,當然這也和我們硬體本身提供的執行緒數量有限有關,畢竟只有6執行緒。

Preference