pytorch學習筆記（十四）： DataLoader原始碼閱讀

阿新 • • 發佈：2019-01-20

pytorch 資料載入部分的介面可以說是現存深度學習框架中設計的最好的，給了我們足夠的靈活性。本博文就對 pytorch 的多執行緒載入模組（DataLoader）進行原始碼上的註釋。

輸入流水線

pytorch 的輸入流水線的操作順序是這樣的：

建立一個 Dataset 物件
建立一個 DataLoader 物件
不停的迴圈這個 DataLoader 物件

dataset = MyDataset()
dataloader = DataLoader(dataset)
num_epoches = 100
for epoch in range(num_epoches):
    for 
 data in dataloader:
        ....

在之前文章也提到過，如果現有的 Dataset 不能夠滿足需求，我們也可以自定義 Dataset，通過繼承 torch.utils.data.Dataset。在繼承的時候，需要 override 三個方法。

__init__：用來初始化資料集
__getitem__
__len__

從本文中，您可以看到 __getitem__ 和 __len__ 在 DataLoader 中是如何被使用的。

DataLoader

從DataLoader 看起，下面是原始碼。為了方便起見，採用在原始碼中添加註釋的形式進行解讀。

class DataLoader(object):
    def __init__(self, dataset, batch_size=1, shuffle=False, sampler=None, 
                 batch_sampler=None,
                 num_workers=0, collate_fn=default_collate, pin_memory=False, 
                 drop_last=False):
        self.dataset = dataset
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.collate_fn = collate_fn
        self.pin_memory = pin_memory
        self.drop_last = drop_last

        if 
 batch_sampler is not None:
            if batch_size > 1 or shuffle or sampler is not None or drop_last:
                raise ValueError('batch_sampler is mutually exclusive with '
                                 'batch_size, shuffle, sampler, and drop_last')

        if sampler is not None and shuffle:
            raise ValueError('sampler is mutually exclusive with shuffle')

        if batch_sampler is None:
            if sampler is None:
                if shuffle:
                    # dataset.__len__() 在 Sampler 中被使用。
                    # 目的是生成一個 長度為 len(dataset) 的 序列索引（隨機的）。
                    sampler = RandomSampler(dataset)
                else:
                    # dataset.__len__() 在 Sampler 中被使用。
                    # 目的是生成一個 長度為 len(dataset) 的 序列索引（順序的）。
                    sampler = SequentialSampler(dataset)
            # Sampler 是個迭代器，一次之只返回一個 索引
            # BatchSampler 也是個迭代器，但是一次返回 batch_size 個 索引
            batch_sampler = BatchSampler(sampler, batch_size, drop_last)

        self.sampler = sampler
        self.batch_sampler = batch_sampler

    def __iter__(self):
        return DataLoaderIter(self)

    def __len__(self):
        return len(self.batch_sampler)

# 以下兩個程式碼是等價的
for data in dataloader:
    ...
# 等價與
iterr = iter(dataloader)
while True:
    try:
        next(iterr)
    except StopIteration:
        break

在 DataLoader 中，iter(dataloader) 返回的是一個 DataLoaderIter 物件，這個才是我們一直 next的物件。

下面會先介紹一下幾個 Sampler，然後介紹核心部分 DataLoaderIter。

RandomSampler, SequentialSampler, BatchSampler

首先，是 RandomSampler， iter(randomSampler) 會返回一個可迭代物件，這個可迭代物件每次 next 都會輸出當前要取樣的 index，SequentialSampler也是一樣，只不過她產生的 index 是順序的

class RandomSampler(Sampler):

    def __init__(self, data_source):
        self.data_source = data_source

    def __iter__(self):
        return iter(torch.randperm(len(self.data_source)).long())

    def __len__(self):
        return len(self.data_source)

BatchSampler 是一個普通 Sampler 的 wrapper，普通Sampler 一次僅產生一個 index，而 BatchSampler 一次產生一個 batch 的 indices。

class BatchSampler(object):
    def __init__(self, sampler, batch_size, drop_last):
        # 這裡的 sampler 是 RandomSampler 或者 SequentialSampler
        # 他們每一次吐出一個 idx
        self.sampler = sampler
        self.batch_size = batch_size
        self.drop_last = drop_last

    def __iter__(self):
        batch = []
        for idx in self.sampler:
            batch.append(idx)
            if len(batch) == self.batch_size:
                yield batch
                batch = []
        if len(batch) > 0 and not self.drop_last:
            yield batch

    def __len__(self):
        if self.drop_last:
            return len(self.sampler) // self.batch_size
        else:
            return (len(self.sampler) + self.batch_size - 1) // self.batch_size

DataLoaderIter

self.index_queue 中存放是 (batch_idx, sample_indices) ，其中 batch_idx 是個 int 值， sample_indices 是個 list ，存放了組成 batch 的 sample indices。
self.data_queue 中存放的是 (batch_idx, samples), 其中 samples 是一個 mini-batch 的樣本
self.send_idx 表示：這次放到 self.index_queue 中的 batch_id
self.rcvd_idx 表示：這次要取的 batch_id
self.batches_outstanding 表示：

class DataLoaderIter(object):
    "Iterates once over the DataLoader's dataset, as specified by the sampler"

    def __init__(self, loader):
        # loader 是 DataLoader 物件
        self.dataset = loader.dataset
        # 這個留在最後一個部分介紹
        self.collate_fn = loader.collate_fn
        self.batch_sampler = loader.batch_sampler
        # 表示 開 幾個程序。
        self.num_workers = loader.num_workers
        # 是否使用 pin_memory
        self.pin_memory = loader.pin_memory
        self.done_event = threading.Event()

        # 這樣就可以用 next 操作 batch_sampler 了
        self.sample_iter = iter(self.batch_sampler)

        if self.num_workers > 0:
            # 用來放置 batch_idx 的佇列，其中元素的是 一個 list，其中放了一個 batch 內樣本的索引
            self.index_queue = multiprocessing.SimpleQueue()
            # 用來放置 batch_data 的佇列，裡面的 元素的 一個 batch的 資料
            self.data_queue = multiprocessing.SimpleQueue()

            # 當前已經準備好的 batch 的數量（可能有些正在 準備中）
            # 當為 0 時， 說明， dataset 中已經沒有剩餘資料了。
            # 初始值為 0, 在 self._put_indices() 中 +1,在 self.__next__ 中減一
            self.batches_outstanding = 0 
            self.shutdown = False
            # 用來記錄 這次要放到 index_queue 中 batch 的 idx
            self.send_idx = 0
            # 用來記錄 這次要從的 data_queue 中取出 的 batch 的 idx
            self.rcvd_idx = 0
            # 因為多執行緒，可能會導致 data_queue 中的 batch 亂序
            # 用這個來保證 batch 的返回 是 idx 升序出去的。
            self.reorder_dict = {}
            # 這個地方就開始 開多程序了，一共開了 num_workers 個程序
            # 執行 _worker_loop ， 下面將介紹 _worker_loop
            self.workers = [
                multiprocessing.Process(
                    target=_worker_loop,
                    args=(self.dataset, self.index_queue, self.data_queue, self.collate_fn))
                for _ in range(self.num_workers)]

            for w in self.workers:
                w.daemon = True  # ensure that the worker exits on process exit
                w.start()

            if self.pin_memory:
                in_data = self.data_queue
                self.data_queue = queue.Queue()
                self.pin_thread = threading.Thread(
                    target=_pin_memory_loop,
                    args=(in_data, self.data_queue, self.done_event))
                self.pin_thread.daemon = True
                self.pin_thread.start()

            # prime the prefetch loop
            # 初始化的時候，就將 2*num_workers 個 (batch_idx, sampler_indices) 放到 index_queue 中。
            for _ in range(2 * self.num_workers):
                self._put_indices()

    def __len__(self):
        return len(self.batch_sampler)

    def __next__(self):
        if self.num_workers == 0:  # same-process loading
            indices = next(self.sample_iter)  # may raise StopIteration
            batch = self.collate_fn([self.dataset[i] for i in indices])
            if self.pin_memory:
                batch = pin_memory_batch(batch)
            return batch

        # check if the next sample has already been generated
        if self.rcvd_idx in self.reorder_dict:
            batch = self.reorder_dict.pop(self.rcvd_idx)
            return self._process_next_batch(batch)

        if self.batches_outstanding == 0:
            # 說明沒有 剩餘 可操作資料了， 可以停止 worker 了
            self._shutdown_workers()
            raise StopIteration

        while True:
            # 這裡的操作就是 給 亂序的 data_queue 排一排 序
            assert (not self.shutdown and self.batches_outstanding > 0)
            idx, batch = self.data_queue.get()
            # 一個 batch 被 返回，batches_outstanding -1
            self.batches_outstanding -= 1
            if idx != self.rcvd_idx:
                # store out-of-order samples
                self.reorder_dict[idx] = batch
                continue
            # 返回的時候，再向 indice_queue 中 放下一個 (batch_idx, sample_indices)
            return self._process_next_batch(batch)

    next = __next__  # Python 2 compatibility

    def __iter__(self):
        return self

    def _put_indices(self):
        assert self.batches_outstanding < 2 * self.num_workers
        indices = next(self.sample_iter, None)
        if indices is None:
            return
        self.index_queue.put((self.send_idx, indices))
        self.batches_outstanding += 1
        self.send_idx += 1

    def _process_next_batch(self, batch):
        self.rcvd_idx += 1
        # 放下一個 (batch_idx, sample_indices)
        self._put_indices()
        if isinstance(batch, ExceptionWrapper):
            raise batch.exc_type(batch.exc_msg)
        return batch

    def __getstate__(self):
        # TODO: add limited pickling support for sharing an iterator
        # across multiple threads for HOGWILD.
        # Probably the best way to do this is by moving the sample pushing
        # to a separate thread and then just sharing the data queue
        # but signalling the end is tricky without a non-blocking API
        raise NotImplementedError("DataLoaderIterator cannot be pickled")

    def _shutdown_workers(self):
        if not self.shutdown:
            self.shutdown = True
            self.done_event.set()
            for _ in self.workers:
                # shutdown 的時候， 會將一個 None 放到 index_queue 中
                # 如果 _worker_loop 獲得了這個 None， _worker_loop 將會跳出無限迴圈，將會結束執行
                self.index_queue.put(None)

    def __del__(self):
        if self.num_workers > 0:
            self._shutdown_workers()

`__worker_loop`

這部分是多程序執行的程式碼：他從index_queue 中取索引，然後處理資料，然後再將處理好的 batch 資料放到 data_queue 中。

def _worker_loop(dataset, index_queue, data_queue, collate_fn):
    global _use_shared_memory
    _use_shared_memory = True

    torch.set_num_threads(1)
    while True:
        r = index_queue.get()
        if r is None:
            # 想 data_queue 中放 None
            data_queue.put(None)
            break
        idx, batch_indices = r
        try:
            # 這裡就可以看到 dataset.__getiterm__ 的作用了。
            # 傳到 collate_fn 的資料是 list of ...
            samples = collate_fn([dataset[i] for i in batch_indices])
        except Exception:
            data_queue.put((idx, ExceptionWrapper(sys.exc_info())))
        else:
            data_queue.put((idx, samples))

collate_fn

我們 __getiterm__ 經常返回的是（img_tensor, label）,
所以放入 collate_fn 的引數就是 [(img_tensor, label), ....] .
batch[0] 就是 (img_tensor, label） ，也就是 collections.Sequence 型別。

def default_collate(batch):
    "Puts each data field into a tensor with outer dimension batch size"
    if torch.is_tensor(batch[0]):
        out = None
        if _use_shared_memory:
            # If we're in a background process, concatenate directly into a
            # shared memory tensor to avoid an extra copy
            # 計算 batch 中所有 元素的個數 
            numel = sum([x.numel() for x in batch])
            # 沒有找到對應的 api 。。。。。。
            storage = batch[0].storage()._new_shared(numel)
            out = batch[0].new(storage)
        return torch.stack(batch, 0, out=out)
    elif type(batch[0]).__module__ == 'numpy':
        elem = batch[0]
        if type(elem).__name__ == 'ndarray':
            return torch.stack([torch.from_numpy(b) for b in batch], 0)
        if elem.shape == ():  # scalars
            py_type = float if elem.dtype.name.startswith('float') else int
            return numpy_type_map[elem.dtype.name](list(map(py_type, batch)))
    elif isinstance(batch[0], int):
        return torch.LongTensor(batch)
    elif isinstance(batch[0], float):
        return torch.DoubleTensor(batch)
    elif isinstance(batch[0], string_classes):
        return batch
    elif isinstance(batch[0], collections.Mapping):
        return {key: default_collate([d[key] for d in batch]) for key in batch[0]}
    elif isinstance(batch[0], collections.Sequence):
        transposed = zip(*batch)
        return [default_collate(samples) for samples in transposed]

    raise TypeError(("batch must contain tensors, numbers, dicts or lists; found {}"
                     .format(type(batch[0]))))

對於 image captioning 任務，既有圖片，又有文字，pytorch 官方開源的工具箱 torchtext 使得文字資料的處理非常簡單，可以通過自定義 collate_fn 的方式將 DataLoader 與 torchtext 完美的整合起來。

總結

data_queue 中最多有 2*num_worker 個 batch

注意

Queue的特點

當裡面沒有資料時： queue.get() 會阻塞，阻塞的時候，其它程序/執行緒如果有 queue.put() 操作，本執行緒/程序會被通知，然後就可以 get 成功。
當資料滿了: queue.put() 會阻塞

pytorch學習筆記（十四）： DataLoader原始碼閱讀

輸入流水線

DataLoader

RandomSampler, SequentialSampler, BatchSampler

DataLoaderIter

`__worker_loop`

collate_fn

總結

注意

Queue的特點

pytorch學習筆記（十四）： DataLoader原始碼閱讀

Java框架spring Boot學習筆記（十四）：log4j介紹

javaweb學習筆記（十四）：JSP（4）

機器學習筆記（十四）：TensorFlow實戰六（經典卷積神經網路：AlexNet ）

機器學習筆記（十四）：異常檢測

pytorch學習筆記（十七）：python 端擴充套件 pytorch

pytorch學習筆記（十一）：fine-tune 預訓練的模型

pytorch學習筆記（十二）：詳解 Module 類

pytorch學習筆記（十六）：pytorch 寫程式碼時應該注意

機器學習筆記（十二）：TensorFlow實戰四（影象識別與卷積神經網路）

機器學習筆記（十二）：TensorFlow實現四（影象識別與卷積神經網路）

OpenCV2學習筆記（十五）：利用Cmake高速查找OpenCV函數源代碼

如鵬網學習筆記（十四）ASP.NET

EF學習筆記（十一）：實施繼承

Java學習筆記（十五）：import關鍵字

Java學習筆記（十五）：this關鍵字

Java學習筆記（十六）：static關鍵字

Java學習筆記（十七）：super關鍵字

R語言學習筆記（十一）：廣義線性模型

R語言學習筆記（十六）：處理缺失值

pytorch學習筆記（十四）： DataLoader原始碼閱讀

輸入流水線

DataLoader

RandomSampler, SequentialSampler, BatchSampler

DataLoaderIter

__worker_loop

collate_fn

總結

注意

Queue的特點

相關推薦

`__worker_loop`