DataFrame的一些常用運算
阿新 • • 發佈:2021-10-11
DataFrame的算術運算
- 當物件相加時,如果存在某個索引對不相同,則返回結果的索引將是索引對的並集。這個特性類似於資料庫操作中,對索引標籤的自動外連線(outer join),不重疊的位置將出現NA值
In [4]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)),columns=list('bcd'), index=['Ohio', 'Texas' ...: , 'Colorado']) In [5]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)),columns=list('bde'), index=['Utah', 'Ohio' ...: , 'Texas', 'Oregon']) In [6]: df1 Out[6]: b c d Ohio 0.0 1.0 2.0 Texas 3.0 4.0 5.0 Colorado 6.0 7.0 8.0 In [7]: df2 Out[7]: b d e Utah 0.0 1.0 2.0 Ohio 3.0 4.0 5.0 Texas 6.0 7.0 8.0 Oregon 9.0 10.0 11.0 In [8]: df1 + df2 Out[8]: b c d e Colorado NaN NaN NaN NaN Ohio 3.0 NaN 6.0 NaN Oregon NaN NaN NaN NaN Texas 9.0 NaN 12.0 NaN Utah NaN NaN NaN NaN
- 按行或列索引進行字典型排序,可以使用sort_index,返回一個新的、排序好的物件。注意sort_index排序的是索引,而不是內容
In [15]: obj = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one ...: '], columns=['d', 'a', 'b', 'c']) In [16]: obj Out[16]: d a b c three 0 1 2 3 one 4 5 6 7 In [17]: obj.sort_index() Out[17]: d a b c one 4 5 6 7 three 0 1 2 3 In [18]: obj.sort_index(axis=1) Out[18]: a b c d three 1 2 3 0 one 5 6 7 4
- 按內容來排序列,使用sort_values,需要使用by來指定按哪一列排序
In [20]: obj.sort_values(by='b')
Out[20]:
d a b c
three 0 1 2 3
one 4 5 6 7
In [21]: obj.sort_values(by='b', ascending=False)
Out[21]:
d a b c
one 4 5 6 7
three 0 1 2 3
DataFrame的交併補集
In [3]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': ...: [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 3.9]} In [4]: data Out[4]: {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 3.9]} In [5]: df1 = pd.DataFrame(data) In [6]: df1 Out[6]: state year pop 0 Ohio 2000 1.5 1 Ohio 2001 1.7 2 Ohio 2002 3.6 3 Nevada 2001 2.4 4 Nevada 2002 3.9 In [11]: data1 = {'state': ['Ohio1', 'Ohio1', 'Ohio', 'Nevada1', 'Nevada1'], 'year': [2000, 2021 ...: , 2022, 2021, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 3.9]} In [12]: df2 = pd.DataFrame(data1) In [14]: df2 Out[14]: state year pop 0 Ohio1 2000 1.5 1 Ohio1 2021 1.7 2 Ohio 2022 3.6 3 Nevada1 2021 2.4 4 Nevada1 2002 3.9
- 交集
In [22]: pd.merge(df1, df2, how='inner')
Out[22]:
state year pop
0 Ohio 2002 3.6
- 並集
In [16]: pd.merge(df1, df2, how='outer')
Out[16]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 3.9
5 Ohio1 2000 1.5
6 Ohio1 2021 1.7
7 Ohio 2022 3.6
8 Nevada1 2021 2.4
9 Nevada1 2002 3.9
DataFrame的布林索引
- 從某一列中找到值大於2的項
In [23]: df1
Out[23]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 3.9
In [24]: df1[df1['pop'] > 2]
Out[24]:
state year pop
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 3.9
# 可以使用邏輯運算
In [27]: df1[(df1['pop'] > 2) & (df1['year'] > 2001)]
Out[27]:
state year pop
2 Ohio 2002 3.6
4 Nevada 2002 3.9
時來天地皆同力,運去英雄不自由。