1. 程式人生 > 其它 >DataFrame的一些常用運算

DataFrame的一些常用運算

DataFrame的算術運算

  • 當物件相加時,如果存在某個索引對不相同,則返回結果的索引將是索引對的並集。這個特性類似於資料庫操作中,對索引標籤的自動外連線(outer join),不重疊的位置將出現NA值
In [4]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)),columns=list('bcd'), index=['Ohio', 'Texas'
   ...: , 'Colorado'])

In [5]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)),columns=list('bde'), index=['Utah', 'Ohio'
   ...: , 'Texas', 'Oregon'])

In [6]: df1
Out[6]: 
            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0

In [7]: df2
Out[7]: 
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

In [8]: df1 + df2
Out[8]: 
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN
  • 按行或列索引進行字典型排序,可以使用sort_index,返回一個新的、排序好的物件。注意sort_index排序的是索引,而不是內容
In [15]: obj = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one
    ...: '], columns=['d', 'a', 'b', 'c'])

In [16]: obj
Out[16]: 
       d  a  b  c
three  0  1  2  3
one    4  5  6  7

In [17]: obj.sort_index()
Out[17]: 
       d  a  b  c
one    4  5  6  7
three  0  1  2  3

In [18]: obj.sort_index(axis=1)
Out[18]: 
       a  b  c  d
three  1  2  3  0
one    5  6  7  4
  • 按內容來排序列,使用sort_values,需要使用by來指定按哪一列排序
In [20]: obj.sort_values(by='b')
Out[20]: 
       d  a  b  c
three  0  1  2  3
one    4  5  6  7

In [21]: obj.sort_values(by='b', ascending=False)
Out[21]: 
       d  a  b  c
one    4  5  6  7
three  0  1  2  3

DataFrame的交併補集

In [3]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year':
   ...:  [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 3.9]}

In [4]: data
Out[4]: 
{'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002],
 'pop': [1.5, 1.7, 3.6, 2.4, 3.9]}

In [5]: df1 = pd.DataFrame(data)

In [6]: df1
Out[6]: 
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  3.9

In [11]: data1 = {'state': ['Ohio1', 'Ohio1', 'Ohio', 'Nevada1', 'Nevada1'], 'year': [2000, 2021
    ...: , 2022, 2021, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 3.9]}

In [12]: df2 = pd.DataFrame(data1)
  
In [14]: df2
Out[14]: 
     state  year  pop
0    Ohio1  2000  1.5
1    Ohio1  2021  1.7
2     Ohio  2022  3.6
3  Nevada1  2021  2.4
4  Nevada1  2002  3.9
  • 交集
In [22]: pd.merge(df1, df2, how='inner')
Out[22]: 
  state  year  pop
0  Ohio  2002  3.6
  • 並集
In [16]: pd.merge(df1, df2, how='outer')
Out[16]: 
     state  year  pop
0     Ohio  2000  1.5
1     Ohio  2001  1.7
2     Ohio  2002  3.6
3   Nevada  2001  2.4
4   Nevada  2002  3.9
5    Ohio1  2000  1.5
6    Ohio1  2021  1.7
7     Ohio  2022  3.6
8  Nevada1  2021  2.4
9  Nevada1  2002  3.9

DataFrame的布林索引

  • 從某一列中找到值大於2的項
In [23]: df1
Out[23]: 
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  3.9

In [24]: df1[df1['pop'] > 2]
Out[24]: 
    state  year  pop
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  3.9

# 可以使用邏輯運算
In [27]: df1[(df1['pop'] > 2) & (df1['year'] > 2001)]
Out[27]: 
    state  year  pop
2    Ohio  2002  3.6
4  Nevada  2002  3.9
時來天地皆同力,運去英雄不自由。