1. 程式人生 > >pandas.read_csv分塊讀取大檔案

pandas.read_csv分塊讀取大檔案

import time
import pandas as pd
from tqdm import tqdm


# @execution_time
def reader_pandas(file, chunkSize=100000, patitions=10 ** 4):
    reader = pd.read_csv(file, iterator=True)
    chunks = []
    with tqdm(range(patitions), 'Reading ...') as t:
        for _ in t:
            try:
                chunk = reader.get_chunk(chunkSize)
                chunks.append(chunk)
            except StopIteration:
                break
    return pd.concat(chunks, ignore_index=True)

print(reader_pandas("./data/train_set.csv"))

輸出:

D:\software\Anaconda3\python.exe D:/Competitions/DaGuanBei/test.py
Reading ...:   0%|          | 2/10000 [00:41<79:10:31, 28.51s/it] 
            id  ...  class
0            0  ...     14
1            1  ...      3
2            2  ...     12
3            3  ...     13
4            4  ...     12
5            5  ...     13
6            6  ...      1
7            7  ...     10
8            8  ...     10
9            9  ...     19
10          10  ...     18
11          11  ...      7
12          12  ...      9
13          13  ...      4
14          14  ...     17
15          15  ...      9
16          16  ...     13
17          17  ...     10
18          18  ...     10
19          19  ...     14
20          20  ...     10
21          21  ...      9
22          22  ...      1
23          23  ...      2
24          24  ...     13
25          25  ...      1
26          26  ...      7
27          27  ...     17
28          28  ...     10
29          29  ...      8
...        ...  ...    ...
102247  102247  ...      9
102248  102248  ...     18
102249  102249  ...     13
102250  102250  ...      9
102251  102251  ...      1
102252  102252  ...     14
102253  102253  ...     12
102254  102254  ...     11
102255  102255  ...     19
102256  102256  ...      2
102257  102257  ...      4
102258  102258  ...      3
102259  102259  ...      6
102260  102260  ...      9
102261  102261  ...      1
102262  102262  ...     18
102263  102263  ...      6
102264  102264  ...      8
102265  102265  ...     16
102266  102266  ...     18
102267  102267  ...     15
102268  102268  ...      3
102269  102269  ...      3
102270  102270  ...      3
102271  102271  ...      8
102272  102272  ...     14
102273  102273  ...      8
102274  102274  ...     12
102275  102275  ...      4
102276  102276  ...     11

[102277 rows x 4 columns]

Process finished with exit code 0

上面的程式碼運用的是pandas的read_csv(),預設引數sep=','分隔符為',',正好和csv以逗號為分隔符吻合。

iterator : boolean, default False

返回一個TextFileReader 物件,以便逐塊處理檔案。

iterator=True表示逐塊讀取檔案。

reader.get_chunk(chunkSize)表示每次讀取塊的大小為chunkSize。

tqdm模組是用來列印讀取檔案的進度條,詳見參考資料。

參考資料: