Pandas 中的遍歷與並行處理

阿新 • • 發佈：2020-09-21

使用 pandas 處理資料時，遍歷和並行處理是比較常見的操作了本文總結了幾種不同樣式的操作和並行處理方法。

1. 準備示例資料

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(40, 100, (5, 10)), columns=[f's{i}' for i in range(10)], index=['john', 'bob', 'mike', 'bill', 'lisa'])
df['is_passed'] = df.s9.map(lambda x: True if x > 60 else False)

df 輸出：

      s0  s1  s2  s3  s4  s5  s6  s7  s8  s9  is_passed
john  56  70  85  91  92  80  63  81  45  57      False
bob   99  93  80  42  91  81  53  75  61  78       True
mike  76  92  76  80  57  98  94  79  87  94       True
bill  81  83  92  91  51  55  40  77  96  90       True
lisa  85  82  56  57  54  56  49  43  99  51      False

2. 遍歷

在 pandas 中，共有三種遍歷資料的方法，分別是：

2.1. iterrows

按行遍歷，將 DataFrame 的每一行迭代為 (index, Series) 對，可以通過 row[name] 或 row.name 對元素進行訪問。

>>> for index, row in df.iterrows():
...     print(row['s0'])  # 也可使用 row.s0
    
56
99
76
81
85

2.2. itertuples

按行遍歷，將 DataFrame 的每一行迭代為命名元祖，可以通過 row.name 對元素進行訪問，比 iterrows

效率高。

>>> for row in df.itertuples():
...     print(row.s0)
    
56
99
76
81
85

2.3. iteritems

按列遍歷，將 DataFrame 的每一列迭代為 (列名, Series) 對，可以通過 row[index] 對元素進行訪問。

>>> for index, row in df.iteritems():
...     print(row[0])
    
56
70
85
91
92
80
63
81
45
57
False

3. 並行處理

3.1. map 方法

類似 Python 內建的 map() 方法，pandas 中的 map() 方法將函式、字典索引或是一些需要接受單個輸入值的特別的物件與對應的單個列的每一個元素建立聯絡並序列得到結果。map() 還有一個引數 na_action，類似 R 中的 na.action，取值為 None(預設) 或 ingore，用於控制遇到缺失值的處理方式，設定為 ingore 時序列運算過程中將忽略 Nan 值原樣返回。

比如這裡將 is_passed 列中的 True 換為 1，False 換位 0，可以有下面幾種實現方式：

3.1.1. 字典對映

>>> # 定義對映字典
... score_map = {True: 1, False: 0}

>>> # 利用 map() 方法得到對應 mike 列的對映列
... df.is_passed.map(score_map)

john    0
bob     1
mike    1
bill    1
lisa    0
Name: is_passed, dtype: int64

3.1.2. `lambda` 函式

>>> # 如同建立該列時的那樣
... df.is_passed.map(lambda x: 1 if x else 0)

john    0
bob     1
mike    1
bill    1
lisa    0
Name: is_passed, dtype: int64

3.1.3. 常規函式

>>> def bool_to_num(x):
...     return 1 if x else 0

>>> df.is_passed.map(bool_to_num)

3.1.4. 特殊物件

一些接收單個輸入值且有輸出的物件也可以用map()方法來處理：

>>> df.is_passed.map('is passed: {}'.format)

john    is passed: False
bob      is passed: True
mike     is passed: True
bill     is passed: True
lisa    is passed: False
Name: is_passed, dtype: object

3.2. apply 方法

apply() 使用方式跟 map() 很像，主要傳入的主要引數都是接受輸入返回輸出，但相較於 map() 針對單列 Series 進行處理，一條 apply() 語句可以對單列或多列進行運算，覆蓋非常多的使用場景，下面分別介紹：

3.2.1. 單列資料

傳入 lambda 函式：

df.is_passed.apply(lambda x: 1 if x else 0)

3.2.2. 輸入多列資料

>>> def gen_describe(s9, is_passed):
...     return f"s9's score is {s9}, so {'passed' if is_passed else 'failed'}"

>>> df.apply(lambda r: gen_describe(r['s9'], r['is_passed']), axis=1)

john    s9's score is 57, so failed
bob     s9's score is 78, so passed
mike    s9's score is 94, so passed
bill    s9's score is 90, so passed
lisa    s9's score is 51, so failed
dtype: object

3.2.3. 輸出多列資料

>>> df.apply(lambda row: (row['s9'], row['s8']), axis=1)

john    (57, 45)
bob     (78, 61)
mike    (94, 87)
bill    (90, 96)
lisa    (51, 99)
dtype: object

3.3. applymap 方法

applymap 是與 map 方法相對應的專屬於 DataFrame 物件的方法，類似 map 方法傳入函式、字典等，傳入對應的輸出結果，
不同的是 applymap 將傳入的函式等作用於整個資料框中每一個位置的元素，比如將 df 中的所有小於 50 的全部改為 50：

>>> def at_least_get_50(x):
...     if isinstance(x, int) and x < 50:
...         return 50
...     return x

>>> df.applymap(at_least_get_50)

      s0  s1  s2  s3  s4  s5  s6  s7  s8  s9  is_passed
john  56  70  85  91  92  80  63  81  50  57      False
bob   99  93  80  50  91  81  53  75  61  78       True
mike  76  92  76  80  57  98  94  79  87  94       True
bill  81  83  92  91  51  55  50  77  96  90       True
lisa  85  82  56  57  54  56  50  50  99  51      False

附：結合 tqdm 給 apply 過程新增進度條

在 jupyter 中並行處理較大資料量的時候，往往執行後就只能乾等著報錯或者執行完了，使用 tqdm 可以檢視資料實時處理進度，使用前需使用 pip install tqdm 安裝該包。使用示例如下：

from tqdm import tqdm

def gen_describe(s9, is_passed):
    return f"s9's score is {s9}, so {'passed' if is_passed else 'failed'}"

#啟動對緊跟著的 apply 過程的監視
tqdm.pandas(desc='apply')
df.progress_apply(lambda r: gen_describe(r['s9'], r['is_passed']), axis=1)

參考

（資料科學學習手札69）詳解pandas中的map、apply、applymap、groupby、agg

Pandas 中的遍歷與並行處理

1. 準備示例資料

2. 遍歷

2.1. iterrows

2.2. itertuples

2.3. iteritems

3. 並行處理

3.1. map 方法

3.1.1. 字典對映

3.1.2. `lambda` 函式

3.1.3. 常規函式

3.1.4. 特殊物件

3.2. apply 方法

3.2.1. 單列資料

3.2.2. 輸入多列資料

3.2.3. 輸出多列資料

3.3. applymap 方法

附：結合 tqdm 給 apply 過程新增進度條

參考

Pandas 中的遍歷與並行處理

pandas中遍歷dataframe的每一個元素的實現

在pandas中遍歷DataFrame行的實現方法

Vue中foreach陣列與js中遍歷陣列的寫法

力扣-105-從前序遍歷與中序遍歷序列構造二叉樹

LeetCode 105. 從前序遍歷與中序遍歷序列構造二叉樹

在Pandas Dataframe中遍歷行的不同方法

LeetCode 105[Python]. 從前序與中序遍歷序列構造二叉樹根據一棵樹的前序遍歷與中序遍歷構造二叉樹。注意: 你可以假設樹中沒有重複的元素。

根據一棵樹的前序遍歷與中序遍歷構造二叉樹

105.從前序遍歷與中序遍歷構建二叉樹

Vue中foreach陣列與js中遍歷陣列的寫法說明

Java中遍歷ConcurrentHashMap的四種方式詳解

pandas中read_csv的缺失值處理方式

python實現樹的深度優先遍歷與廣度優先遍歷詳解

C# 窗體中遍歷文字框

Jquery使用each函式實現遍歷及陣列處理

在Ruby和SHELL中遍歷指定目錄的方法

現實中遍歷二叉樹

樹和圖的寬度優先遍歷與深度優先遍歷

【C++】【十一】二叉樹遞迴遍歷與非遞迴遍歷的實現及思路

Pandas 中的遍歷與並行處理

1. 準備示例資料

2. 遍歷

2.1. iterrows

2.2. itertuples

2.3. iteritems

3. 並行處理

3.1. map 方法

3.1.1. 字典對映

3.1.2. lambda 函式

3.1.3. 常規函式

3.1.4. 特殊物件

3.2. apply 方法

3.2.1. 單列資料

3.2.2. 輸入多列資料

3.2.3. 輸出多列資料

3.3. applymap 方法

附：結合 tqdm 給 apply 過程新增進度條

參考

相關推薦

3.1.2. `lambda` 函式