【小白學PyTorch】9 tensor資料結構與儲存結構

阿新 • • 發佈：2020-09-12

文章來自微信公眾號【機器學習煉丹術】。上一節課，講解了MNIST影象分類的一個小實戰，現在我們繼續深入學習一下pytorch的一些有的沒的的小知識來作為只是儲備。參考目錄： @[toc] ## 1 pytorch資料結構 ### 1.1 預設整數與浮點數 **【pytorch預設的整數是int64】** pytorch的預設整數是用64個位元儲存，也就是8個位元組（Byte）儲存的。 **【pytorch預設的浮點數是float32】** pytorch的預設浮點數是用32個位元儲存，也就是4個位元組（Byte）儲存的。 ```puthon import torch import numpy as np #---------------------- print('torch的浮點數與整數的預設資料型別') a = torch.tensor([1,2,3]) b = torch.tensor([1.,2.,3.]) print(a,a.dtype) print(b,b.dtype) ``` 輸出： ```python torch的浮點數與整數的預設資料型別 tensor([1, 2, 3]) torch.int64 tensor([1., 2., 3.]) torch.float32 ``` ### 1.2 dtype修改變數型別 ```puthon print('torch的浮點數與整數的預設資料型別') a = torch.tensor([1,2,3],dtype=torch.int8) b = torch.tensor([1.,2.,3.],dtype = torch.float64) print(a,a.dtype) print(b,b.dtype) ``` 輸出結果： ```python torch的浮點數與整數的預設資料型別 tensor([1, 2, 3], dtype=torch.int8) torch.int8 tensor([1., 2., 3.], dtype=torch.float64) torch.float64 ``` ### 1.3 變數型別有哪些張量的資料型別其實和numpy.array基本一一對應，除了不支援```str```,主要有下面幾種形式： ```python torch.float64 # 等同於（torch.double） torch.float32 # 預設,FloatTensor torch.float16 torch.int64 # 等同於torch.long torch.int32 # 預設 torch.int16 torch.int8 torch.uint8 # 二進位制碼，表示0-255 torch.bool ``` 在建立變數的時候，想要建立指定的變數型別，上文中提到了用dtype關鍵字來控制，但是我個人更喜歡使用特定的建構函式： ```ptyhon print('torch的建構函式') a = torch.IntTensor([1,2,3]) b = torch.LongTensor([1,2,3]) c = torch.FloatTensor([1,2,3]) d = torch.DoubleTensor([1,2,3]) e = torch.tensor([1,2,3]) f = torch.tensor([1.,2.,3.]) print(a.dtype) print(b.dtype) print(c.dtype) print(d.dtype) print(e.dtype) print(f.dtype) ``` 輸出結果： ```python torch的建構函式 torch.int32 torch.int64 torch.float32 torch.float64 torch.int64 torch.float32 ``` 因此我們可以得到結果： - ```torch.IntTensor```對應```torch.int32``` - ```torch.LongTensor```對應```torch.int64```,**LongTensor常用在深度學習中的標籤值** ，比方說分類任務中的類別標籤0，1，2，3等，要求用ing64的資料型別； - ```torch.FloatTensor```對應```torch.float32```。**FloatTensor常用做深度學習中可學習引數或者輸入資料的型別** - ```torch.DoubleTensor```對應```torch.float64``` - ```torch.tensor```則有一個推斷的能力，加入輸入的資料是整數，則預設int64，相當於LongTensor；假如輸入資料是浮點數，則預設float32，相當於FLoatTensor。**剛好對應深度學習中的標籤和引數的資料型別，所以一般情況下，直接使用tensor就可以了，但是假如出現報錯的時候，也要學會使用dtype或者建構函式來確保資料型別的匹配** ### 1.4 資料型別轉換 **【使用torch.float()方法】** ```python print('資料型別轉換') a = torch.tensor([1,2,3]) b = a.float() c = a.double() d = a.long() print(b.dtype) print(c.dtype) print(d.dtype) >>> 資料型別轉換 >>> torch.float32 >>> torch.float64 >>> torch.int64 ``` 我個人比較習慣這個的方法。 **【使用type方法】** ```python b = a.type(torch.float32) c = a.type(torch.float64) d = a.type(torch.int64) print(b.dtype) # torch.float32 print(c.dtype) # torch.float64 print(d.dtype) # torch.int64 ``` ## 2 torch vs numpy PyTorch是一個python包，目的是**加入深度學習應用**， torch基本上是實現了numpy的大部分必要的功能，並且tensor是可以利用GPU進行加速訓練的。 ### 2.1 兩者轉換轉換時非常非常簡單的： ```python import torch import numpy as np a = np.array([1.,2.,3.]) b = torch.tensor(a) c = b.numpy() print(a) print(b) print(c) ``` 輸出結果： ```python [1. 2. 3.] tensor([1., 2., 3.], dtype=torch.float64) [1. 2. 3.] ``` **** 下面的內容就變得有點意思了，是記憶體複製相關的。假如a和b兩個變數共享同一個記憶體，那麼改變a的話，b也會跟著改變；如果a和b變數的記憶體複製了，那麼兩者是兩個記憶體，所以改變a是不會改變b的。**下面是講解numpy和torch互相轉換的時候，什麼情況是共享記憶體，什麼情況下是記憶體複製** （其實這個問題，也就是做個瞭解罷了，無用的小知識） **【Tensor()轉換】** 當numpy的資料型別和torch的資料型別相同時，共享記憶體；不同的時候，記憶體複製 ```python print('numpy 和torch互相轉換1') a = np.array([1,2,3],dtype=np.float64) b = torch.Tensor(a) b[0] = 999 print('共享記憶體' if a[0]==b[0] else '不共享記憶體') >>> 不共享記憶體 ``` 因為np.float64和torch.float32資料型別不同 ```python print('numpy 和torch互相轉換2') a = np.array([1,2,3],dtype=np.float32) b = torch.Tensor(a) b[0] = 999 print('共享記憶體' if a[0]==b[0] else '不共享記憶體') >>> 共享記憶體 ``` 因為np.float32和torch.float32資料型別相同 **【from_numpy()轉換】** ```python print('from_numpy()') a = np.array([1,2,3],dtype=np.float64) b = torch.from_numpy(a) b[0] = 999 print('共享記憶體' if a[0]==b[0] else '不共享記憶體') >>> 共享記憶體 a = np.array([1,2,3],dtype=np.float32) b = torch.from_numpy(a) b[0] = 999 print('共享記憶體' if a[0]==b[0] else '不共享記憶體') >>> 共享記憶體 ``` 如果你使用from_numpy()的時候，不管是什麼型別，都是共享記憶體的。 **【tensor()轉換】** **更常用的是這個tensor(),注意看T的大小寫**，如果使用的是tensor方法，那麼不管輸入型別是什麼，torch.tensor都會進行資料拷貝，不共享記憶體。 **【.numpy()】** tensor轉成numpy的時候，```.numpy```方法是記憶體共享的哦。如果想改成記憶體拷貝的話，可以使用```.numpy().copy()```就不共享記憶體了。或者使用```.clone().numpy()```也可以實現同樣的效果。clone是tensor的方法，copy是numpy的方法。 **【總結】** 記不清的話，就記住，**tensor()資料拷貝了，.numpy()共享記憶體就行了。** ### 2.2 兩者區別 **【命名】** 雖然PyTorch實現了Numpy的很多功能，但是**相同的功能卻有著不同的命名方式，這讓使用者迷惑。** 例如建立隨機張量的時候： ```python print('命名規則') a = torch.rand(2,3,4) b = np.random.rand(2,3,4) ``` **【張量重塑】** 這部分會放在下一章節詳細說明~ ## 3 張量 - **標量**：資料是一個數字 - **向量**：資料是一串數字,也是一維張量 - **矩陣**：資料二維陣列，也是二維張量 - **張量**：資料的維度超過2的時候，就叫多維張量 ### 3.1 張量修改尺寸 - pytorch常用reshape和view - numpy用resize和reshape - pytorch也有resize但是不常用 **【reshape和view共享記憶體(常用)】** ```python a = torch.arange(0,6) b = a.reshape((2,3)) print(b) c = a.view((2,3)) print(c) a[0] = 999 print(b) print(c) ``` 輸出結果： ```python tensor([[0, 1, 2], [3, 4, 5]]) tensor([[0, 1, 2], [3, 4, 5]]) tensor([[999, 1, 2], [ 3, 4, 5]]) tensor([[999, 1, 2], [ 3, 4, 5]]) ``` 上面的a，b，c三個變數其實是共享同一個記憶體，遷一而動全身。而且要求遵旨規則：**原始資料有6個元素，所以可以修改成$2\times 3$的形式，但是無法修改成$2\times 4$的形式** ，我們來試試： ```python a = torch.arange(0,6) b = a.reshape((2,4)) ``` 會丟擲這樣的錯誤： ![](https://img-service.csdnimg.cn/img_convert/40eeac9569430237bcdc2f247c8d194a.png) **【torch的resize_（不常用）】** 但是pytorch有一個不常用的函式（對我來說用的不多），```resize```,這個方法可以不遵守這個規則： ```python a = torch.arange(0,6) a.resize_(2,4) print(a) ``` 輸出結果為： ![](https://img-service.csdnimg.cn/img_convert/fd02402930d243c95615cc5686aeba71.png) 自動的補充了兩個元素。雖然不知道這個函式有什麼意義。。。。。。 **這裡可以看到函式resize後面有一個\_，這個表示inplace=True的意思，當有這個\_或者引數inplace的時候，就是表示所作的修改是在原來的資料變數上完成的，也就不需要賦值給新的變量了。** **【numpy的resize與reshape（常用）】** ```python import numpy as np a = np.arange(0,6) a.resize(2,3) print(a) ``` ```python import numpy as np a = np.arange(0,6) b = a.reshape(2,3) print(b) ``` 兩個程式碼塊的輸出都是下面的，區別在於numpy的resize是沒有返回值的，相當於inplace=True了，直接在原變數的進行修改，而reshape是有返回值的，不在原變數上修改（但是呢reshape是共享記憶體的）： ```python [[0 1 2] [3 4 5]] ``` ### 3.2 張量記憶體儲存結構 ```tensor```的資料結構包含兩個部分: - 頭資訊區Tensor：儲存張量的形狀size，步長stride，資料型別等資訊 - 儲存區Storage：儲存真正的資料頭資訊區Tensor的佔用記憶體較小，主要的佔用記憶體是Storate。 **每一個tensor都有著對應的storage，一般不同的tensor的頭資訊可能不同，但是卻可能使用相同的storage**。（這裡就是之前共享記憶體的view、reshape方法，雖然頭資訊的張量形狀size發生了改變，但是其實儲存的資料都是同一個storage） ### 3.3 儲存區我們來檢視一個tensor的儲存區： ```python import torch a = torch.arange(0,6) print(a.storage()) ``` 輸出為： ```python 0 1 2 3 4 5 [torch.LongStorage of size 6] ``` 然後對tensor變數做一個view的變換： ```python b = a.view(2,3) ``` 這個```b.storage()```輸出出來時和```a.storate()```，相同的，這也是為什麼view變換是記憶體共享的了。 ```python # id()是獲取物件的記憶體地址 print(id(a)==id(b)) # False print(id(a.storage)==id(b.storage)) # True ``` 可以發現，其實a和b雖然儲存區是相同的，但是其實a和b整體式不同的。自然，這個不同就不同在頭資訊區，應該是尺寸size改變了。**這也就是頭資訊區不同，但是儲存區相同，從而節省大量記憶體** 我們更進一步，假設對tensor切片了，那麼切片後的資料是否共享記憶體，切片後的資料的storage是什麼樣子的呢？ ```python print('研究tensor的切片') a = torch.arange(0,6) b = a[2] print(id(a.storage)==id(b.storage)) ``` 輸出結果為： ```python >>> True ``` 沒錯，就算切片之後，兩個tensor依然使用同一個儲存區，所以相比也是共享記憶體的，修改一個另一個也會變化。 ```python #.data_ptr(),返回tensor首個元素的記憶體地址。 print(a.data_ptr(),b.data_ptr()) print(b.data_ptr()-a.data_ptr()) ``` 輸出為： ```python 2080207827328 2080207827344 16 ``` 這是因為b的第一個元素和a的第一個元素記憶體地址相差了16個位元組，因為預設的tesnor是int64，也就是8個位元組一個元素，所以這裡相差了2個整形元素 ### 3.4 頭資訊區依然是上面那兩個tensor變數，a和b ```python a = torch.arange(0,6) b = a.view(2,3) print(a.stride(),b.stride()) ``` 輸出為： ```python (1,) (3, 1) ``` 變數a是一維陣列，並且就是[0，1，2，3，4，5],所以步長stride是1；而b是二維陣列，是[[0,1,2],[3,4,5]]，所以就是先3個3個分成第一維度的，然後再1個1個的作為第二維度。由此可見，絕大多數操作並不修改 tensor 的資料，只是修改了 tensor 的頭資訊，這種做法更節省記憶體，同時提升了處理

【小白學PyTorch】9 tensor資料結構與儲存結構

【小白學PyTorch】9 tensor資料結構與儲存結構

【小白學PyTorch】1 搭建一個超簡單的網路

【小白學PyTorch】3 淺談Dataset和Dataloader

【小白學PyTorch】4 構建模型三要素與權重初始化

【小白學PyTorch】5 torchvision預訓練模型與資料集全覽

【小白學PyTorch】6 模型的構建訪問遍歷儲存（附程式碼）

【小白學PyTorch】8 實戰之MNIST小試牛刀

【小白學PyTorch】11 MobileNet詳解及PyTorch實現

【小白學PyTorch】12 SENet詳解及PyTorch實現

【小白學PyTorch】13 EfficientNet詳解及PyTorch實現

【小白學PyTorch】15 TF2實現一個簡單的服裝分類任務

【小白學PyTorch】16 TF2讀取圖片的方法

【小白學PyTorch】17 TFrec檔案的建立與讀取

【小白學PyTorch】18 TF2構建自定義模型

【小白學PyTorch】19 TF2模型的儲存與載入

【小白學PyTorch】20 TF2的eager模式與求導

【小白學PyTorch】21 Keras的API詳解（上）卷積、啟用、初始化、正則

【小白學PyTorch】21 Keras的API詳解（下）池化、Normalization層

【小白學AI】XGBoost 推導詳解與牛頓法

【小白學AI】XGBoost推導詳解與牛頓法

【小白學PyTorch】9 tensor資料結構與儲存結構

相關推薦