1. 程式人生 > >Pandas DataFrame 數據選取和過濾

Pandas DataFrame 數據選取和過濾

lam read 1.4 大於 -c stack ati title 1.2

This would allow chaining operations like:

pd.read_csv(‘imdb.txt‘)
  .sort(columns=‘year‘)
  .filter(lambda x: x[‘year‘]>1990)   # <---this is missing in Pandas
  .to_csv(‘filtered.csv‘)

For current alternatives see:

http://stackoverflow.com/questions/11869910/pandas-filter-rows-of-dataframe-with-operator-chaining

可以這樣:

df = pd.read_csv(‘imdb.txt‘).sort(columns=‘year‘)
df[df[‘year‘]>1990].to_csv(‘filtered.csv‘)

  

# however, could potentially do something like this:

pd.read_csv(‘imdb.txt‘)
  .sort(columns=‘year‘)
  .[lambda x: x[‘year‘]>1990]
  .to_csv(‘filtered.csv‘)
or

pd.read_csv(‘imdb.txt‘)
  .sort(columns=‘year‘)
  .loc[lambda x: x[‘year‘]>1990]
  .to_csv(‘filtered.csv‘)

  

from:https://yangjin795.github.io/pandas_df_selection.html

Pandas 是 Python Data Analysis Library, 是基於 numpy 庫的一個為了數據分析而設計的一個 Python 庫。它提供了很多工具和方法,使得使用 python 操作大量的數據變得高效而方便。

本文專門介紹 Pandas 中對 DataFrame 的一些對數據進行過濾、選取的方法和工具。 首先,本文所用的原始數據如下:

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list(‘ABCD‘))
    Out[9]: 
                     A         B         C         D
    2017-04-01  0.522241  0.495106 -0.268194 -0.035003
    2017-04-02  2.104572 -0.977768 -0.139632 -0.735926
    2017-04-03  0.480507  1.215048  1.313314 -0.072320
    2017-04-04  1.700309  0.287588 -0.012103  0.525291
    2017-04-05  0.526615 -0.417645  0.405853 -0.835213
    2017-04-06  1.143858 -0.326720  1.425379  0.531037

選取

通過 [] 來選取

選取一列或者幾列:

df[‘A‘]
Out:
    2017-04-01    0.522241
    2017-04-02    2.104572
    2017-04-03    0.480507
    2017-04-04    1.700309
    2017-04-05    0.526615
    2017-04-06    1.143858
df[[‘A‘,‘B‘]]
Out:
                       A         B
    2017-04-01  0.522241  0.495106
    2017-04-02  2.104572 -0.977768
    2017-04-03  0.480507  1.215048
    2017-04-04  1.700309  0.287588
    2017-04-05  0.526615 -0.417645
    2017-04-06  1.143858 -0.326720

選取某一行或者幾行:

df[‘2017-04-01‘:‘2017-04-01‘]
Out:
                       A         B         C         D
    2017-04-01  0.522241  0.495106 -0.268194 -0.03500   
df[‘2017-04-01‘:‘2017-04-03‘]
                       A         B         C         D
    2017-04-01  0.522241  0.495106 -0.268194 -0.035003
    2017-04-02  2.104572 -0.977768 -0.139632 -0.735926
    2017-04-03  0.480507  1.215048  1.313314 -0.072320

loc, 通過行標簽選取數據

df.loc[‘2017-04-01‘,‘A‘]
df.loc[‘2017-04-01‘]
Out:
    A    0.522241
    B    0.495106
    C   -0.268194
    D   -0.035003
df.loc[‘2017-04-01‘:‘2017-04-03‘]
Out:
                       A         B         C         D
    2017-04-01  0.522241  0.495106 -0.268194 -0.035003
    2017-04-02  2.104572 -0.977768 -0.139632 -0.735926
    2017-04-03  0.480507  1.215048  1.313314 -0.072320
df.loc[‘2017-04-01‘:‘2017-04-04‘,[‘A‘,‘B‘]]
Out:
                       A         B
    2017-04-01  0.522241  0.495106
    2017-04-02  2.104572 -0.977768
    2017-04-03  0.480507  1.215048
    2017-04-04  1.700309  0.287588
df.loc[:,[‘A‘,‘B‘]]
Out:
                       A         B
    2017-04-01  0.522241  0.495106
    2017-04-02  2.104572 -0.977768
    2017-04-03  0.480507  1.215048
    2017-04-04  1.700309  0.287588
    2017-04-05  0.526615 -0.417645
    2017-04-06  1.143858 -0.326720

iloc, 通過行號獲取數據

df.iloc[2]
Out:
    A    0.480507
    B    1.215048
    C    1.313314
    D   -0.072320
df.iloc[1:3]
Out:
                       A         B         C         D
    2017-04-02  2.104572 -0.977768 -0.139632 -0.735926
    2017-04-03  0.480507  1.215048  1.313314 -0.072320
df.iloc[1,1]

df.iloc[1:3,1]

df.iloc[1:3,1:2]

df.iloc[[1,3],[2,3]]
Out:
                       C         D
    2017-04-02 -0.139632 -0.735926
    2017-04-04 -0.012103  0.525291

df.iloc[[1,3],:]

df.iloc[:,[2,3]]

iat, 獲取某一個 cell 的值

df.iat[1,2]
Out:
    -0.13963224781812655

過濾

使用 [] 過濾

[]中是一個boolean 表達式,凡是計算為 True 的就會被選取。

df[df.A>1]
Out:
                       A         B         C         D
    2017-04-02  2.104572 -0.977768 -0.139632 -0.735926
    2017-04-04  1.700309  0.287588 -0.012103  0.525291
    2017-04-06  1.143858 -0.326720  1.425379  0.531037
df[df>1]
Out:
                       A         B         C   D
    2017-04-01       NaN       NaN       NaN NaN
    2017-04-02  2.104572       NaN       NaN NaN
    2017-04-03       NaN  1.215048  1.313314 NaN
    2017-04-04  1.700309       NaN       NaN NaN
    2017-04-05       NaN       NaN       NaN NaN
    2017-04-06  1.143858       NaN  1.425379 NaN

df[df.A+df.B>1.5]
Out:
                       A         B         C         D      
    2017-04-03  0.480507  1.215048  1.313314 -0.072320  
    2017-04-04  1.700309  0.287588 -0.012103  0.525291  

下面是一個更加復雜的例子,選取的是 index 在 ‘2017-04-01‘中‘2017-04-04‘的,一行的數據的和大於1的行:

df.loc[‘2017-04-01‘:‘2017-04-04‘,df.sum()>1]

還可以通過和 apply 方法結合,構造更加復雜的過濾,實現將某個返回值為 boolean 的方法作為過濾條件:

df[df.apply(lambda x: x[‘b‘] > x[‘c‘], axis=1)]

使用 isin

df[‘E‘]=[‘one‘, ‘one‘,‘two‘,‘three‘,‘four‘,‘three‘]
                       A         B         C         D      E
    2017-04-01  0.522241  0.495106 -0.268194 -0.035003    one
    2017-04-02  2.104572 -0.977768 -0.139632 -0.735926    one
    2017-04-03  0.480507  1.215048  1.313314 -0.072320    two
    2017-04-04  1.700309  0.287588 -0.012103  0.525291  three
    2017-04-05  0.526615 -0.417645  0.405853 -0.835213   four
    2017-04-06  1.143858 -0.326720  1.425379  0.531037  three

df[df.E.isin([‘one‘])]
    Out:
                       A         B         C         D    E
    2017-04-01  0.522241  0.495106 -0.268194 -0.035003  one
    2017-04-02  2.104572 -0.977768 -0.139632 -0.735926  one

Pandas DataFrame 數據選取和過濾