torch.utils.data.DataLoader函式
阿新 • • 發佈:2018-12-15
class DataLoader(object): r""" Data loader. Combines a dataset and a sampler, and provides single- or multi-process iterators over the dataset. Arguments: dataset (Dataset): dataset from which to load the data. batch_size (int, optional): how many samples per batch to load (default: 1). shuffle (bool, optional): set to ``True`` to have the data reshuffled at every epoch (default: False). sampler (Sampler, optional): defines the strategy to draw samples from the dataset. If specified, ``shuffle`` must be False. batch_sampler (Sampler, optional): like sampler, but returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last. num_workers (int, optional): how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0) collate_fn (callable, optional): merges a list of samples to form a mini-batch. pin_memory (bool, optional): If ``True``, the data loader will copy tensors into CUDA pinned memory before returning them. drop_last (bool, optional): set to ``True`` to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If ``False`` and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False) timeout (numeric, optional): if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default: 0) worker_init_fn (callable, optional): If not None, this will be called on each worker subprocess with the worker id (an int in ``[0, num_workers - 1]``) as input, after seeding and before data loading. (default: None) .. note:: By default, each worker will have its PyTorch seed set to ``base_seed + worker_id``, where ``base_seed`` is a long generated by main process using its RNG. However, seeds for other libraies may be duplicated upon initializing workers (w.g., NumPy), causing each worker to return identical random numbers. (See :ref:`dataloader-workers-random-seed` section in FAQ.) You may use ``torch.initial_seed()`` to access the PyTorch seed for each worker in :attr:`worker_init_fn`, and use it to set other seeds before data loading. .. warning:: If ``spawn`` start method is used, :attr:`worker_init_fn` cannot be an unpicklable object, e.g., a lambda function. """ __initialized = False def __init__(self, dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=default_collate, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None): self.dataset = dataset self.batch_size = batch_size self.num_workers = num_workers self.collate_fn = collate_fn self.pin_memory = pin_memory self.drop_last = drop_last self.timeout = timeout self.worker_init_fn = worker_init_fn if timeout < 0: raise ValueError('timeout option should be non-negative') if batch_sampler is not None: if batch_size > 1 or shuffle or sampler is not None or drop_last: raise ValueError('batch_sampler option is mutually exclusive ' 'with batch_size, shuffle, sampler, and ' 'drop_last') self.batch_size = None self.drop_last = None if sampler is not None and shuffle: raise ValueError('sampler option is mutually exclusive with ' 'shuffle') if self.num_workers < 0: raise ValueError('num_workers option cannot be negative; ' 'use num_workers=0 to disable multiprocessing.') if batch_sampler is None: if sampler is None: if shuffle: sampler = RandomSampler(dataset) else: sampler = SequentialSampler(dataset) batch_sampler = BatchSampler(sampler, batch_size, drop_last) self.sampler = sampler self.batch_sampler = batch_sampler self.__initialized = True def __setattr__(self, attr, val): if self.__initialized and attr in ('batch_size', 'sampler', 'drop_last'): raise ValueError('{} attribute should not be set after {} is ' 'initialized'.format(attr, self.__class__.__name__)) super(DataLoader, self).__setattr__(attr, val) def __iter__(self): return _DataLoaderIter(self) def __len__(self): return len(self.batch_sampler)
資料載入器,結合了資料集和取樣器,並且可以提供多個執行緒處理資料集。
在訓練模型時使用到此函式,用來把訓練資料分成多個小組,此函式每次丟擲一組資料。直至把所有的資料都丟擲。就是做一個數據的初始化。
此函式的引數:
dataset:包含所有資料的資料集
batch_size :每一小組所包含資料的數量
Shuffle : 是否打亂資料位置,當為Ture時打亂資料,全部丟擲資料後再次dataloader時重新打亂。
sampler : 自定義從資料集中取樣的策略,如果制定了取樣策略,shuffle則必須為False.
Batch_sampler:和sampler一樣,但是每次返回一組的索引,和batch_size, shuffle, sampler, drop_last 互斥。
num_workers : 使用執行緒的數量,當為0時資料直接載入到主程式,預設為0。
collate_fn:不太瞭解
pin_memory:s 是一個布林型別,為T時將會把資料在返回前先複製到CUDA的固定記憶體中
drop_last:布林型別,為T時將會把最後不足batch_size的資料丟掉,為F將會把剩餘的資料作為最後一小組。
timeout:預設為0。當為正數的時候,這個數值為時間上限,每次取一個batch超過這個值的時候會報錯。此引數必須為正數。
worker_init_fn:和程序有關係,暫時用不到
應用例項:
''' 批訓練:把資料分為一小批一小批進行訓練 Dataloader就是用來包裝使用的資料, 比如說該程式中把資料5個5個的打包, 每一次丟擲一組資料進行操作。 ''' import torch import torch.utils.data as Data torch.manual_seed(1) BATCH_SIZE = 5 x = torch.linspace(1,10,10) y = torch.linspace(10,1,10) torch_dataset = Data.TensorDataset(x,y) #把資料放在資料庫中 loader = Data.DataLoader( # 從dataset資料庫中每次抽出batch_size個數據 dataset=torch_dataset, batch_size=BATCH_SIZE, shuffle=True,#將資料打亂 num_workers=2, #使用兩個執行緒 ) def show_batch(): for epoch in range(3): #對全部資料進行3次訓練 for step,(batch_x,batch_y) in enumerate(loader): # 每一次挑選出來的size個數據 # training # 打印出來,觀察資料 print('Epoch:',epoch,'|Step:',step,'|batch x:', batch_x.numpy(),'|batch y:',batch_y.numpy()) if __name__ == '__main__': show_batch()
結果:
Epoch: 0 |Step: 0 |batch x: [ 5. 7. 10. 3. 4.] |batch y: [6. 4. 1. 8. 7.]
Epoch: 0 |Step: 1 |batch x: [2. 1. 8. 9. 6.] |batch y: [ 9. 10. 3. 2. 5.]
Epoch: 1 |Step: 0 |batch x: [ 4. 6. 7. 10. 8.] |batch y: [7. 5. 4. 1. 3.]
Epoch: 1 |Step: 1 |batch x: [5. 3. 2. 1. 9.] |batch y: [ 6. 8. 9. 10. 2.]
Epoch: 2 |Step: 0 |batch x: [ 4. 2. 5. 6. 10.] |batch y: [7. 9. 6. 5. 1.]
Epoch: 2 |Step: 1 |batch x: [3. 9. 1. 8. 7.] |batch y: [ 8. 2. 10. 3. 4.]