pandas.read_csv分塊讀取大檔案
阿新 • • 發佈:2019-01-27
import time import pandas as pd from tqdm import tqdm # @execution_time def reader_pandas(file, chunkSize=100000, patitions=10 ** 4): reader = pd.read_csv(file, iterator=True) chunks = [] with tqdm(range(patitions), 'Reading ...') as t: for _ in t: try: chunk = reader.get_chunk(chunkSize) chunks.append(chunk) except StopIteration: break return pd.concat(chunks, ignore_index=True) print(reader_pandas("./data/train_set.csv"))
輸出:
D:\software\Anaconda3\python.exe D:/Competitions/DaGuanBei/test.py Reading ...: 0%| | 2/10000 [00:41<79:10:31, 28.51s/it] id ... class 0 0 ... 14 1 1 ... 3 2 2 ... 12 3 3 ... 13 4 4 ... 12 5 5 ... 13 6 6 ... 1 7 7 ... 10 8 8 ... 10 9 9 ... 19 10 10 ... 18 11 11 ... 7 12 12 ... 9 13 13 ... 4 14 14 ... 17 15 15 ... 9 16 16 ... 13 17 17 ... 10 18 18 ... 10 19 19 ... 14 20 20 ... 10 21 21 ... 9 22 22 ... 1 23 23 ... 2 24 24 ... 13 25 25 ... 1 26 26 ... 7 27 27 ... 17 28 28 ... 10 29 29 ... 8 ... ... ... ... 102247 102247 ... 9 102248 102248 ... 18 102249 102249 ... 13 102250 102250 ... 9 102251 102251 ... 1 102252 102252 ... 14 102253 102253 ... 12 102254 102254 ... 11 102255 102255 ... 19 102256 102256 ... 2 102257 102257 ... 4 102258 102258 ... 3 102259 102259 ... 6 102260 102260 ... 9 102261 102261 ... 1 102262 102262 ... 18 102263 102263 ... 6 102264 102264 ... 8 102265 102265 ... 16 102266 102266 ... 18 102267 102267 ... 15 102268 102268 ... 3 102269 102269 ... 3 102270 102270 ... 3 102271 102271 ... 8 102272 102272 ... 14 102273 102273 ... 8 102274 102274 ... 12 102275 102275 ... 4 102276 102276 ... 11 [102277 rows x 4 columns] Process finished with exit code 0
上面的程式碼運用的是pandas的read_csv(),預設引數sep=','分隔符為',',正好和csv以逗號為分隔符吻合。
iterator : boolean, default False
返回一個TextFileReader 物件,以便逐塊處理檔案。
iterator=True表示逐塊讀取檔案。
reader.get_chunk(chunkSize)表示每次讀取塊的大小為chunkSize。
tqdm模組是用來列印讀取檔案的進度條,詳見參考資料。
參考資料: