手把手教你如何用Julia做GPU程式設計（附程式碼）

640?wx_fmt=gif

640?wx_fmt=jpeg

新智元報道

來源：nextjournal

編輯：肖琴、三石

【新智元導讀】本文旨在快速介紹GPU的工作原理，詳細介紹當前的Julia GPU生態系統，並讓讀者瞭解簡單的GPU程式設計是多麼的容易。

GPU是如何工作的？

首先，什麼是GPU？

GPU是一個大規模並行處理器，具有幾千個並行處理單元。例如，本文中使用的Tesla k80提供4992個並行CUDA核心。 GPU在頻率，延遲和硬體功能方面與CPU完全不同，但有點類似於擁有4992個核心的慢速CPU！

640?wx_fmt=png

“Tesla K80”

可啟用並行執行緒的數量可以大幅提高GPU速度，但也讓它的使用性變得更加困難。讓我們來詳細看看在使用這種原始動力時，你會遇到哪些缺點：

GPU是一個獨立的硬體，具有自己的記憶體空間和不同的架構。因此，從RAM到GPU儲存器（VRAM）的傳輸時間很長。即使在GPU上啟動核心（換句話說，排程函式呼叫）也會帶來較大的延遲。 GPU的時間約為10us，而CPU的時間則為幾納秒。
在沒有高階包裝器的情況下，設定核心會很快變得複雜
較低的精度是預設值，而較高的精度計算可以輕鬆地消除所有效能增益
GPU函式(核心)本質上是並行的，所以編寫GPU核心至少和編寫並行CPU程式碼一樣困難，但是硬體上的差異增加了相當多的複雜性
與上述相關，許多演算法都不能很好地移植到GPU上。
核心通常是用C/ C++編寫的，這並不是寫演算法的最佳語言。
CUDA和OpenCL之間存在分歧，OpenCL是用於編寫低階GPU程式碼的主要框架。雖然CUDA只支援英偉達硬體，但OpenCL支援所有硬體，但有些粗糙。

Julia的誕生是個好訊息！它是一種高階指令碼語言，允許你在Julia本身編寫核心和周圍的程式碼，同時在大多數GPU硬體上執行！

GPUArrays

大多數高度並行的演算法需要通過相當多的資料來克服所有執行緒和延遲開銷。因此，大多數演算法都需要陣列來管理所有資料，這需要一個好的GPU陣列庫（array library）作為關鍵基礎。

GPUArrays.jl是Julia的基礎。它提供了一個抽象陣列實現，專門用於使用高度並行硬體的原始功能。它包含設定GPU所需的所有功能，啟動Julia GPU函式並提供一些基本的陣列演算法。

抽象意味著它需要以CuArrays和CLArrays形式的具體實現。由於繼承了GPUArrays的所有功能，它們都提供完全相同的介面。唯一的區別出現在分配陣列時，這會強制你決定陣列是否位於CUDA或OpenCL裝置上。關於這一點的更多資訊，請參閱記憶體部分。

GPUArrays有助於減少程式碼重複，因為它允許編寫獨立於硬體的GPU核心，可以通過CuArrays或CLArrays將其編譯為本機GPU程式碼。因此，許多通用核心可以在繼承自GPUArrays的所有packages之間共享。

一點選擇建議：CuArrays僅適用於Nvidia GPU，而CLArrays適用於大多數可用的GPU。CuArrays比CLArrays更穩定，並且已經可以在Julia 0.7上執行。速度上差異不明顯。我建議兩者都試一下，看看哪個效果最好。

對於本文，我將選擇CuArrays，因為本文是為Julia 0.7 / 1.0而寫的，CLArrays仍然不支援。

效能

讓我們用一個簡單的互動式程式碼示例來快速說明為什麼要將計算轉移到GPU上，這個示例計算julia set：

1using CuArrays, FileIO, Colors, GPUArrays, BenchmarkTools
2using CuArrays: CuArray
3"""
4The function calculating the Julia set
5"""
6function juliaset(z0, maxiter)
7    c = ComplexF32(-0.5, 0.75)
8    z = z0
9    for i in 1:maxiter
10        abs2(z) > 4f0 && return (i - 1) % UInt8
11        z = z * z + c
12    end
13    return maxiter % UInt8 # % is used to convert without overflow check
14end
15range = 100:50:2^12
16cutimes, jltimes = Float64[], Float64[]
17function run_bench(in, out)
18  # use dot syntax to apply `juliaset` to each elemt of q_converted
19  # and write the output to result
20  out .= juliaset.(in, 16)
21  # all calls to the GPU are scheduled asynchronous,
22  # so we need to synchronize
23  GPUArrays.synchronize(out)
24end
25# store a reference to the last results for plotting
26last_jl, last_cu = nothing, nothing
27for N in range
28  w, h = N, N
29  q = [ComplexF32(r, i) for i=1:-(2.0/w):-1, r=-1.5:(3.0/h):1.5]
30  for (times, Typ) in ((cutimes, CuArray), (jltimes, Array))
31    # convert to Array or CuArray - moving the calculation to CPU/GPU
32    q_converted = Typ(q)
33    result = Typ(zeros(UInt8, size(q)))
34    for i in 1:10 # 5 samples per size
35      # benchmarking macro, all variables need to be prefixed with $
36      t = [email protected] begin
37                run_bench(q_converted, result)
38      end
39      global last_jl, last_cu # we're in local scope
40      if result isa CuArray
41        last_cu = result
42      else
43          last_jl = result
44      end
45      push!(times, t)
46    end
47  end
48end
49
50cu_jl = hcat(Array(last_cu), last_jl)
51cmap = colormap("Blues", 16 + 1)
52color_lookup(val, cmap) = cmap[val + 1]
53save("results/juliaset.png", color_lookup.(cu_jl, (cmap,)))

640?wx_fmt=png

1using Plots; plotly()
2x = repeat(range, inner = 10)
3speedup = jltimes ./ cutimes
4Plots.scatter(
5  log2.(x), [speedup, fill(1.0, length(speedup))],
6  label = ["cuda" "cpu"], markersize = 2, markerstrokewidth = 0,
7  legend = :right, xlabel = "2^N", ylabel = "speedup"
8)

640?wx_fmt=png

如你所見，對於大型陣列，通過將計算移動到GPU可以獲得穩定的60-80倍的加速。而且非常簡單，只需將Julia array轉換為GPUArray。

有人可能認為GPU的效能受到像Julia這樣的動態語言的影響，但Julia的GPU效能應該與CUDA或OpenCL的原始效能相當。Tim Besard在整合LLVM Nvidia編譯pipeline方面做得非常出色，達到了與純CUDA C程式碼相同（有時甚至更好）的效能。Tim發表了一篇非常詳細的博文，裡面進一步解釋了這一點[1]。CLArrays方法有點不同，它直接從Julia生成OpenCL C程式碼，具有與OpenCL C相同的效能！

為了更好地瞭解效能並檢視與多執行緒CPU程式碼的比較，我收集了一些基準測試[2]。

記憶體（Memory）

GPU具有自己的儲存空間，包括視訊儲存器（VRAM），不同的快取記憶體和暫存器。無論你做什麼，任何Julia物件都必須先轉移到GPU才能使用。並非Julia中的所有型別都可以在GPU上工作。

首先讓我們看一下Julia的型別：

1struct Test # an immutable struct
2# that only contains other immutable, which makes
3# isbitstype(Test) == true
4    x::Float32
5end
6
7# the isbits property is important, since those types can be used
8# without constraints on the GPU!
9@assert isbitstype(Test) == true
10x = (2, 2)
11isa(x, Tuple{Int, Int}) # tuples are also immutable
12mutable struct Test2 #-> mutable, isbits(Test2) == false
13    x::Float32
14end
15struct Test3
16    # contains a heap allocation/ reference, not isbits
17    x::Vector{Float32}
18    y::Test2 # Test2 is mutable and also heap allocated / a reference
19end
20Vector{Test} # <-  An Array with isbits elements is contigious in memory
21Vector{Test2} # <- An Array with mutable elements is basically an array of heap pointers. Since it just contains cpu heap pointers, it won't work on the GPU.

"Array{Test2,1}"

所有這些Julia型別在轉移到GPU或在GPU上建立時表現都不同。下表概述了預期結果：

640?wx_fmt=png

建立位置描述了物件是否在CPU上建立然後傳輸到GPU核心，或者是否在核心的GPU上建立。這個表顯示了是否可以建立型別的例項，並且對於從CPU到GPU的傳輸，該表還指示物件是否通過引用複製或傳遞。

Garbage Collection

使用GPU時的一個很大的區別是GPU上沒有垃圾回收（ garbage collector, GC）。這不是什麼大問題，因為為GPU編寫的高效能核心不應該一開始就建立任何GC-tracked memory。

為GPU實現GC是可能的，但請記住，每個執行的核心都是大規模並行的。在~1000 GPU執行緒中的每一個執行緒建立和跟蹤大量堆記憶體將很快破壞效能增益，因此這實際上是不值得的。

作為核心中堆分配陣列的替代方法，你可以使用GPUArrays。GPUArray建構函式將建立GPU緩衝區並將資料傳輸到VRAM。如果呼叫Array(gpu_array)，陣列將被轉移回RAM，表示為普通的Julia陣列。這些GPU陣列的Julia控制代碼由Julia的GC跟蹤，如果它不再使用，GPU記憶體將被釋放。

因此，只能在裝置上使用堆疊分配，並且對其餘的預先分配的GPU緩衝區使用。由於傳輸非常昂貴的，因此在程式設計GPU時儘可能多地重用和預分配是很常見的。

The GPUArray Constructors

1using CuArrays, LinearAlgebra
2
3# GPU Arrays can be constructed from all Julia arrays containing isbits types!
4A1D = cu([1, 2, 3]) # cl for CLArrays
5A1D = fill(CuArray{Int}, 0, (100,)) # CLArray for CLArrays
6# Float32 array - Float32 is usually preferred and can be up to 30x faster on most GPUs than Float64
7diagonal_matrix = CuArray{Float32}(I, 100, 100)
8filled = fill(CuArray, 77f0, (4, 4, 4)) # 3D array filled with Float32 77
9randy = rand(CuArray, Float32, 42, 42) # random numbers generated on the GPU
10# The array constructor also accepts isbits iterators with a known size
11# Note, that since you can also pass isbits types to a gpu kernel directly, in most cases you won't need to materialize them as an gpu array
12from_iter = CuArray(1:10)
13# let's create a point type to further illustrate what can be done:
14struct Point
15 x::Float32
16 y::Float32
17end
18Base.convert(::Type{Point}, x::NTuple{2, Any}) = Point(x[1], x[2])
19# because we defined the above convert from a tuple to a point
20# [Point(2, 2)] can be written as Point[(2,2)] since all array
21# elements will get converted to Point
22custom_types = cu(Point[(1, 2), (4, 3), (2, 2)])
23typeof(custom_types)

"CuArray{Point, 1}"

Array Operations

許多操作是已經定義好的。最重要的是，GPUArrays支援Julia的fusing dot broadcasting notation。這種標記法允許你將函式應用於陣列的每個元素，並使用f的返回值建立一個新陣列。這個功能通常稱為對映（map）。 broadcast 指的是具有不同形狀的陣列被散佈到相同的形狀。

它的工作方式如下：

1x = zeros(4, 4) # 4x4 array of zeros
2y = zeros(4) # 4 element array
3z = 2 # a scalar
4# y's 1st dimension gets repeated for the 2nd dimension in x
5# and the scalar z get's repeated for all dimensions
6# the below is equal to `broadcast(+, broadcast(+, xx, y), z)`
7x .+ y .+ z

640?wx_fmt=png

關於broadcasting如何工作的更多解釋，可以看看這個指南：

julia.guide/broadcasting

這意味著在不分配堆記憶體（僅建立isbits型別）的情況下執行的任何Julia函式都可以應用於GPUArray的每個元素，並且多個dot呼叫將融合到一個核心呼叫中。由於核心呼叫延遲很高，這種融合是一個非常重要的優化。

1using CuArrays
2A = cu([1, 2, 3])
3B = cu([1, 2, 3])
4C = rand(CuArray, Float32, 3)
5result = A .+ B .- C
6test(a::T) where T = a * convert(T, 2) # convert to same type as `a`
7
8# inplace broadcast, writes directly into `result`
9result .= test.(A) # custom function work
10
11# The cool thing is that this composes well with custom types and custom functions.
12# Let's go back to our Point type and define addition for it
13Base.:(+)(p1::Point, p2::Point) = Point(p1.x + p2.x, p1.y + p2.y)
14
15# now this works:
16custom_types = cu(Point[(1, 2), (4, 3), (2, 2)])
17
18# This particular example also shows the power of broadcasting:
19# Non array types are broadcasted and repeated for the whole length
20result = custom_types .+ Ref(Point(2, 2))
21
22# So the above is equal to (minus all the allocations):
23# this allocates a new array on the gpu, which we can avoid with the above broadcast
24broadcasted = fill(CuArray, Point(2, 2), (3,))
25
26result == custom_types .+ broadcasted

ture

現實世界中的GPUArrays

讓我們直接看看一些很酷的用例。

如下面的視訊所示，這個GPU加速煙霧模擬是使用GPUArrays + CLArrays建立的，可在GPU或CPU上執行，GPU版本的速度提高了15倍：

還有更多的用例，包括求解微分方程，有限元模擬和求解偏微分方程。

讓我們來看一個簡單的機器學習示例，看看如何使用GPUArrays：

1using Flux, Flux.Data.MNIST, Statistics
2using Flux: onehotbatch, onecold, crossentropy, throttle
3using Base.Iterators: repeated, partition
4using CuArrays
5
6# Classify MNIST digits with a convolutional network
7
8imgs = MNIST.images()
9
10labels = onehotbatch(MNIST.labels(), 0:9)
11
12# Partition into batches of size 1,000
13train = [(cat(float.(imgs[i])..., dims = 4), labels[:,i])
14 for i in partition(1:60_000, 1000)]
15
16use_gpu = true # helper to easily switch between gpu/cpu
17
18todevice(x) = use_gpu ? gpu(x) : x
19
20train = todevice.(train)
21
22# Prepare test set (first 1,000 images)
23tX = cat(float.(MNIST.images(:test)[1:1000])..., dims = 4) |> todevice
24tY = onehotbatch(MNIST.labels(:test)[1:1000], 0:9) |> todevice
25
26m = Chain(
27 Conv((2,2), 1=>

手把手教你如何用Julia做GPU程式設計（附程式碼）

【新智元導讀】本文旨在快速介紹GPU的工作原理，詳細介紹當前的Julia GPU生態系統，並讓讀者瞭解簡單的GPU程式設計是多麼的容易。

手把手教你如何用Julia做GPU程式設計（附程式碼）

手把手教你用R實現標記化（附程式碼、學習資料、語料庫）

教你用TensorFlow實現神經網路（附程式碼）

【震驚】手把手教你用python做繪圖工具（一）

傻瓜教程：手把手教你解決多個應用例項（附程式碼、手繪圖）

【Python量化】手把手教你用python做股票分析入門

手把手教你用matlab做深度學習(一)- --CNN

手把手教你用電腦做伺服器建網站並讓外網訪問

手把手教你用C#做疫情傳播模擬

手把手教你如何玩轉面試題（Java基礎）

ChainDesk : 手把手教你實現簡易比特幣（Java版）

手把手教你如何自定義DAO框架（重量級乾貨）（yet）

自創資料集，用TensorFlow預測股票教程 !（附程式碼）

Swift面向協議程式設計（附程式碼）

手把手 | 教你用幾行Python和消費資料做客戶細分

手把手教你用幾行Python和消費資料做客戶細分

手把手教你用Delphi實現硬體版hello world程式設計控制點亮電燈泡

人工智慧應用-手把手教你用Python硬體程式設計實現開啟或關閉電燈泡

零基礎可上手 | 手把手教你用Cloud AutoML做毒蜘蛛分類器

手把手教你用圖靈機器人做微信公眾號自動回覆助手

手把手教你如何用Julia做GPU程式設計（附程式碼）

【新智元導讀】本文旨在快速介紹GPU的工作原理，詳細介紹當前的Julia GPU生態系統，並讓讀者瞭解簡單的GPU程式設計是多麼的容易。

相關推薦