PaddlePaddle2.0 資料載入及處理

阿新 • • 發佈：2020-12-25

技術標籤：三歲白話paddle2.0 python 深度學習 paddlepaddle

PaddlePaddle2.0 資料載入及處理

大家好這裡是小白三歲，三歲白話系列第7話來啦！

AIStudio專案地址：

https://aistudio.baidu.com/aistudio/projectdetail/1349615

參考文件：

Paddle官網：https://www.paddlepaddle.org.cn/documentation/docs/zh/2.0-rc/tutorial/quick_start/getting_started/getting_started.html#id3

paddle API檢視地址：

https://www.paddlepaddle.org.cn/documentation/docs/zh/2.0-rc1/api/index_cn.html

CSDN地址

三歲白話系列CSDN：https://blog.csdn.net/weixin_45623093/category_10616602.html

paddlepaddle社群號：https://blog.csdn.net/PaddlePaddle

# 匯入paddle並檢視版本
import paddle
print(paddle.__version__)

2.0.0-rc1

資料集

分為框架自帶資料集和自定義（自己上傳）的資料集

資料的處理

paddle對內建的資料集和非內建的提供了兩種不用的模式

接下來讓我們一起來看看叭！

框架自帶資料集

paddle.vision.datasets是cv（視覺領域）的有關資料集

paddle.text.datasets是nlp（自然語言領域）的有關資料集

可以使用__all__魔法方法進行檢視

print('視覺相關資料集：', paddle.vision.datasets.__all__)
print('自然語言相關資料集：', paddle.text.datasets.__all__)

視覺相關資料集： ['DatasetFolder', 'ImageFolder', 'MNIST', 'FashionMNIST', 'Flowers', 'Cifar10', 'Cifar100', 'VOC2012']
自然語言相關資料集： ['Conll05st', 'Imdb', 'Imikolov', 'Movielens', 'UCIHousing', 'WMT14', 'WMT16']

ToTensor

ToTensor是位於paddle.vision.transforms下的API

作用是將 PIL.Image 或 numpy.ndarray 轉換成 paddle.Tensor

接下來看一下手寫數字識別的資料集的匯入吧

在第6話的時候我們就詳解了數字識別，這裡我們再匯入看看

手寫數字識別API說明

from paddle.vision.transforms import ToTensor  # 匯入ToTensor API
# 訓練資料集 用ToTensor將資料格式轉為Tensor

train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())  # 通過mode選擇訓練集和測試集

# 驗證資料集
val_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())

Cache file /home/aistudio/.cache/paddle/dataset/mnist/train-images-idx3-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/train-images-idx3-ubyte.gz 
Begin to download

Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/train-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/train-labels-idx1-ubyte.gz 
Begin to download
........
Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-images-idx3-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-images-idx3-ubyte.gz 
Begin to download

Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-labels-idx1-ubyte.gz 
Begin to download
..
Download finished

自帶資料集的處理方案

paddle.vision.transforms中就有有關的處理辦法

使用__all__魔法方法檢視所有的處理方法

print('資料處理方法：', paddle.vision.transforms.__all__)

資料處理方法： ['BaseTransform', 'Compose', 'Resize', 'RandomResizedCrop', 'CenterCrop', 'RandomHorizontalFlip', 'RandomVerticalFlip', 'Transpose', 'Normalize', 'BrightnessTransform', 'SaturationTransform', 'ContrastTransform', 'HueTransform', 'ColorJitter', 'RandomCrop', 'Pad', 'RandomRotation', 'Grayscale', 'ToTensor', 'to_tensor', 'hflip', 'vflip', 'resize', 'pad', 'rotate', 'to_grayscale', 'crop', 'center_crop', 'adjust_brightness', 'adjust_contrast', 'adjust_hue', 'normalize']

舉例介紹

Compose 將用於資料集預處理的介面以列表的方式進行組合。

Resize 將輸入資料調整為指定大小。

ColorJitter 隨機調整影象的亮度，對比度，飽和度和色調。

from paddle.vision.transforms import Compose, Resize, ColorJitter


# 定義想要使用那些資料增強方式，這裡用到了隨機調整亮度、對比度和飽和度（ColorJitter），改變圖片大小（Resize）
transform = Compose([ColorJitter(), Resize(size=100)])

# 通過transform引數傳遞定義好的資料增項方法即可完成對自帶資料集的應用
train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=transform)

Cache file /home/aistudio/.cache/paddle/dataset/mnist/train-images-idx3-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/train-images-idx3-ubyte.gz 
Begin to download

Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/train-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/train-labels-idx1-ubyte.gz 
Begin to download
........
Download finished

非自帶資料集的定義與載入

定義非自帶資料集

paddle.io.Dataset

概述Dataset的方法和行為的抽象類。

對映式(map-style)資料集需要繼承這個基類，對映式資料集為可以通過一個鍵值索引並獲取指定樣本的資料集，所有對映式資料集須實現以下方法：

__getitem__: 根據給定索引獲取資料集中指定樣本，在 paddle.io.DataLoader 中需要使用此函式通過下標獲取樣本。

__len__: 返回資料集樣本個數，paddle.io.BatchSampler 中需要樣本個數生成下標序列。

from paddle.io import Dataset  # 匯入Datasrt庫


class MyDataset(Dataset):
    """
    步驟一：繼承paddle.io.Dataset類
    """
    def __init__(self, mode='train'):
        """
        步驟二：實現建構函式，定義資料讀取方式，劃分訓練和測試資料集
        """
        super(MyDataset, self).__init__()

        if mode == 'train':
            self.data = [
                ['traindata1', 'label1'],
                ['traindata2', 'label2'],
                ['traindata3', 'label3'],
                ['traindata4', 'label4'],
            ]
        else:
            self.data = [
                ['testdata1', 'label1'],
                ['testdata2', 'label2'],
                ['testdata3', 'label3'],
                ['testdata4', 'label4'],
            ]

    def __getitem__(self, index):
        """
        步驟三：實現__getitem__方法，定義指定index時如何獲取資料，並返回單條資料（訓練資料，對應的標籤）
        """
        data = self.data[index][0]
        label = self.data[index][1]

        return data, label

    def __len__(self):
        """
        步驟四：實現__len__方法，返回資料集總數目
        """
        return len(self.data)

# 測試定義的資料集
train_dataset2 = MyDataset(mode='train')
val_dataset2 = MyDataset(mode='test')

print('=============train dataset=============')
for data, label in train_dataset2:
    print(data, label)

print('=============evaluation dataset=============')
for data, label in val_dataset2:
    print(data, label)

=============train dataset=============
traindata1 label1
traindata2 label2
traindata3 label3
traindata4 label4
=============evaluation dataset=============
testdata1 label1
testdata2 label2
testdata3 label3
testdata4 label4

匯入資料

class paddle.io.DataLoader(dataset, feed_list=None, places=None, return_list=False, batch_sampler=None, batch_size=1, shuffle=False, drop_last=False, collate_fn=None, num_workers=0, use_buffer_reader=True, use_shared_memory=False, timeout=0, worker_init_fn=None)

DataLoader返回一個迭代器，該迭代器根據 batch_sampler給定的順序迭代一次給定的 dataset

DataLoader支援單程序和多程序的資料載入方式，當 num_workers 大於0時，將使用多程序方式非同步載入資料。

具體內容

# 此處暫時使用手寫數字識別的資料進行演示
train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True)
for batch_id, data in enumerate(train_loader()):
    x_data = data[0]
    y_data = data[1]

    print(x_data.numpy().shape)
    print(y_data.numpy().shape)
'''
定義了一個數據迭代器train_loader, 用於載入訓練資料。
通過batch_size=64我們設定了資料集的批大小為64，
通過shuffle=True，我們在取資料前會打亂資料。
此外，我們還可以通過設定num_workers來開啟多程序資料載入，提升載入速度。
'''

非自帶資料集處理

方法一：一種是在資料集的建構函式中進行資料增強方法的定義，之後對__getitem__中返回的資料進行應用

方法二：給自定義的資料集類暴漏一個構造引數，在例項化類的時候將資料增強方法傳遞進去

這裡用方法一進行舉例子：

from paddle.io import Dataset  # 匯入類庫 Dataset


class MyDataset(Dataset):  # 定義Dataset的子類MyDataset
    def __init__(self, mode='train'):
        super(MyDataset, self).__init__()

        if mode == 'train':
            self.data = [
                ['traindata1', 'label1'],
                ['traindata2', 'label2'],
                ['traindata3', 'label3'],
                ['traindata4', 'label4'],
            ]
        else:
            self.data = [
                ['testdata1', 'label1'],
                ['testdata2', 'label2'],
                ['testdata3', 'label3'],
                ['testdata4', 'label4'],
            ]

        # 定義要使用的資料預處理方法，針對圖片的操作
        self.transform = Compose([ColorJitter(), Resize(size=100)])  # 和自帶資料的處理類似

    def __getitem__(self, index):
        data = self.data[index][0]

        # 在這裡對訓練資料進行應用
        # 這裡只是一個示例，測試時需要將資料集更換為圖片資料進行測試
        data = self.transform(data)

        label = self.data[index][1]

        return data, label

    def __len__(self):
        return len(self.data)