This is Chuanqi's Blog

1. 費米架構




  • SM Streaming multi-processors with multiple processing cores
    • Each SM contains 32 processing cores
    • Executive in a Single Instruction Multiple Thread ( SIMT ) fashion
    • Up to 16 SM on a card for a maximum of 512 compute cores
    • Instruction Cache ?K 快取指令
    • Warp Scheduler Warp 排程器
    • Dispatch Unit 將指令傳送的要執行的warp中
    • Register File 暫存器檔案
    • core 也叫 streaming processor,相當於CPU的ALU單元
    • LD/ST load 和 store 單元,負責訪存
    • SFU special function unit 特殊函式單元 cos sin
    • L1 cache /shared mem 64K可配置

計算能力 2.x Fermi 關於cache 的描述

const cache

A multiprocessor also has a read-only constant cache that is shared by all functional units and speeds up reads 
from the constant memory space, which resides in device memory.

data cache

There is an L1 cache for each multiprocessor and an L2 cache shared by all multiprocessors, 
both of which are used to cache accesses to local or global memory, including temporary register spills. 
The cache behavior (e.g., whether reads are cached in both L1 and L2 or in L2 only) can be partially configured on 
a per-access basis using modifiers to the load or store instruction

The same on-chip memory is used for both L1 and shared memory: It can be configured as 48 KB of shared memory and 16 KB of L1 cache
 or as 16 KB of shared memory and 48 KB of L1 cache, using cudaFuncSetCacheConfig()/cuFuncSetCacheConfig():

b) 開普勒架構
c) Maxwell
d) 最新的Pascal架構
e) 講一下 sp sm sfu ld/st
f) Regeister file
g) Shared memory l1cache
h) l2cache
2. GPU計算流程
a) 取指令
b) 譯碼
c) 執行
d) 寫回
e) Warp排程的特點
f) 記憶體請求合併的特點
g) Warp分歧的處理
3. 儲存分層介紹 各層主要的特點,以及發現的問題
a) 片上儲存
i. Register file
ii. Shared memory
iii. L1Dcache
iv. Bypass
b) 片外儲存
i. L2cache
ii. DRAM 排程


