如何叠代pandas dataframe的行
from:https://blog.csdn.net/tanzuozhev/article/details/76713387
How to iterate over rows in a DataFrame in Pandas-DataFrame按行叠代
https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas
http://stackoverflow.com/questions/7837722/what-is-the-most-efficient-way-to-loop-through-dataframes-with-pandas
在對DataFrame進行操作時,我們不可避免的需要逐行查看或操作數據,那麽有什麽高效、快捷的方法呢?
index序號索引
import pandas as pd
inp = [{‘c1‘:10, ‘c2‘:100}, {‘c1‘:11,‘c2‘:110}, {‘c1‘:12,‘c2‘:120}]
df = pd.DataFrame(inp)
for x in xrange(len(df.index)):
print df[‘c1‘].iloc[x]
這似乎是最常規的辦法,而且可以在叠代的過程中對DataFrame進行操作。
enumerate
for i, row in enumerate(df.values):
index= df.index[i]
print row
df.values 是 numpy.ndarray 類型
這裏 i 是index的序號, row是numpy.ndarray類型。
iterrows
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html
import pandas as pd
inp = [{‘c1‘:10, ‘c2‘:100}, {‘c1‘:11,‘c2‘:110}, {‘c1‘:12,‘c2‘:120}]
df = pd.DataFrame(inp)
for index, row in df.iterrows():
print row[‘c1‘], row[‘c2‘]
#10 100
#11 110
#12 120
df.iterrows() 的每次叠代都是一個tuple
類型,包含了index和每行的數據。
- 采用iterrows的方法,得到的 row 是一個Series,DataFrame的dtypes不會被保留。
- 返回的Series只是一個原始DataFrame的復制,不可以對原始DataFrame進行修改;
itertuples
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html
import pandas as pd
inp = [{‘c1‘:10, ‘c2‘:100}, {‘c1‘:11,‘c2‘:110}, {‘c1‘:12,‘c2‘:120}]
df = pd.DataFrame(inp)
for row in df.itertuples():
# print row[0], row[1], row[2] 等同於
print row.Index, row.c1, row.c2
itertuples 返回的是一個 pandas.core.frame.Pandas 類型。
普遍認為itertuples 比 iterrows的速度要快。
zip / itertools.izip
zip 和 itertools.izip的用法是相似的, 但是zip返回一個list,而izip返回一個叠代器。 如果數據量很大,zip的性能不及izip
from itertools import izip
import pandas as pd
inp = [{‘c1‘:10, ‘c2‘:100}, {‘c1‘:11,‘c2‘:110}, {‘c1‘:12,‘c2‘:120}]
df = pd.DataFrame(inp)
for row in izip(df.index, df[‘c1‘], df[‘c2‘]):
print row
時間測評
import time
from numpy.random import randn
df = pd.DataFrame({‘a‘: randn(100000), ‘b‘: randn(100000)})
time_stat = []
# range(index)
test_list = []
t = time.time()
for r in xrange(len(df)):
test_list.append((df.index[r], df.iloc[r,0], df.iloc[r,1]))
time_stat.append(time.time()-t)
# enumerate
test_list = []
t = time.time()
for i, r in enumerate(df.values):
test_list.append((df.index[i], r[0], r[1]))
time_stat.append(time.time()-t)
# iterrows
test_list = []
t = time.time()
for i,r in df.iterrows():
test_list.append((df.index[i], r[‘a‘], r[‘b‘]))
time_stat.append(time.time()-t)
#itertuples
test_list = []
t = time.time()
for ir in df.itertuples():
test_list.append((ir[0], ir[1], ir[2]))
time_stat.append(time.time()-t)
# zip
test_list = []
t = time.time()
for r in zip(df.index, df[‘a‘], df[‘b‘]):
test_list.append((r[0], r[1], r[2]))
time_stat.append(time.time()-t)
# izip
test_list = []
t = time.time()
from itertools import izip
for r in izip(df.index, df[‘a‘], df[‘b‘]):
test_list.append((r[0], r[1], r[2]))
time_stat.append(time.time()-t)
time_df = pd.DataFrame({‘items‘:[‘range(index)‘, ‘enumerate‘, ‘iterrows‘, ‘itertuples‘ , ‘zip‘, ‘izip‘], ‘time‘:time_stat})
time_df.sort_values(‘time‘)
items time
5 izip 0.034869
4 zip 0.040440
3 itertuples 0.072604
1 enumerate 0.174094
2 iterrows 4.026293
0 range(index) 21.921407
可以發現在時間花銷上, izip > zip > itertuples > enumerate > iterrows > range(index)
如何叠代pandas dataframe的行