1. 程式人生 > >《Pandas CookBook》---- 第五章 布爾索引

《Pandas CookBook》---- 第五章 布爾索引

... float 索引 with enter houston lar perf unit

布爾索引

簡書大神SeanCheney的譯作,我作了些格式調整和文章目錄結構的變化,更適合自己閱讀,以後翻閱是更加方便自己查找吧

import pandas as pd
import numpy as np

設定最大列數和最大行數

pd.set_option(‘max_columns‘,5 , ‘max_rows‘, 5)

1 布爾值統計信息

movie = pd.read_csv(‘data/movie.csv‘, index_col=‘movie_title‘)
movie.head()
color director_name ... aspect_ratio movie_facebook_likes
movie_title
Avatar Color James Cameron ... 1.78 33000
Pirates of the Caribbean: At World‘s End Color Gore Verbinski ... 2.35 0
Spectre Color Sam Mendes ... 2.35 85000
The Dark Knight Rises Color Christopher Nolan ... 2.35 164000
Star Wars: Episode VII - The Force Awakens NaN Doug Walker ... NaN 0

5 rows × 27 columns

1.1 基礎方法

判斷電影時長是否超過兩小時

movie_2_hours = movie[‘duration‘] > 120
movie_2_hours.head(10)
movie_title
Avatar                                      True
Pirates of the Caribbean: At World‘s End    True
                                            ... 
Avengers: Age of Ultron                     True
Harry Potter and the Half-Blood Prince      True
Name: duration, Length: 10, dtype: bool

有多少時長超過兩小時的電影

movie_2_hours.sum()
1039

超過兩小時的電影的比例

movie_2_hours.mean()
0.2113506916192026

實際上,dureation這列是有缺失值的,要想獲得真正的超過兩小時的電影的比例,需要先刪掉缺失值

movie[‘duration‘].dropna().gt(120).mean()
0.21199755152009794

1.2 統計信息

用describe()輸出一些該布爾Series信息

movie_2_hours.describe()
count      4916
unique        2
top       False
freq       3877
Name: duration, dtype: object

統計False和True值的比例

 movie_2_hours.value_counts(normalize=True)
False    0.788649
True     0.211351
Name: duration, dtype: float64

2 布爾索引

2.1 布爾條件

在Pandas中,位運算符(&, |, ~)的優先級高於比較運算符

2.1.1 創建多個布爾條件

criteria1 = movie.imdb_score > 8
criteria2 = movie.content_rating == ‘PG-13‘
criteria3 = (movie.title_year < 2000) | (movie.title_year >= 2010)
criteria3.head()
movie_title
Avatar                                        False
Pirates of the Caribbean: At World‘s End      False
Spectre                                        True
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
Name: title_year, dtype: bool

2.1.2 將這些布爾條件合並成一個

criteria_final = criteria1 & criteria2 & criteria3
criteria_final.head()
movie_title
Avatar                                        False
Pirates of the Caribbean: At World‘s End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
dtype: bool

2.2 布爾過濾

創建第一個布爾條件

 crit_a1 = movie.imdb_score > 8
 crit_a2 = movie.content_rating == ‘PG-13‘
 crit_a3 = (movie.title_year < 2000) | (movie.title_year > 2009)
 final_crit_a = crit_a1 & crit_a2 & crit_a3

創建第二個布爾條件

crit_b1 = movie.imdb_score < 5
crit_b2 = movie.content_rating == ‘R‘
crit_b3 = (movie.title_year >= 2000) & (movie.title_year <= 2010)
final_crit_b = crit_b1 & crit_b2 & crit_b3

合並布爾條件

final_crit_all = final_crit_a | final_crit_b
final_crit_all.head()
movie_title
Avatar                                        False
Pirates of the Caribbean: At World‘s End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
dtype: bool

過濾數據

movie[final_crit_all].head()
color director_name ... aspect_ratio movie_facebook_likes
movie_title
The Dark Knight Rises Color Christopher Nolan ... 2.35 164000
The Avengers Color Joss Whedon ... 1.85 123000
Captain America: Civil War Color Anthony Russo ... 2.35 72000
Guardians of the Galaxy Color James Gunn ... 2.35 96000
Interstellar Color Christopher Nolan ... 2.35 349000

5 rows × 27 columns

驗證過濾

cols = [‘imdb_score‘, ‘content_rating‘, ‘title_year‘]
movie_filtered = movie.loc[final_crit_all, cols]
movie_filtered.head(10)
imdb_score content_rating title_year
movie_title
The Dark Knight Rises 8.5 PG-13 2012.0
The Avengers 8.1 PG-13 2012.0
... ... ... ...
Sex and the City 2 4.3 R 2010.0
Rollerball 3.0 R 2002.0

10 rows × 3 columns

2.3 與標簽索引對比

college = pd.read_csv(‘data/college.csv‘)
college2 = college.set_index(‘STABBR‘)

2.3.1 單個標簽

college2中STABBR作為行索引,用loc選取

college2.loc[‘TX‘].head()
INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
STABBR
TX Abilene Christian University Abilene ... 40200 25985
TX Alvin Community College Alvin ... 34500 6750
TX Amarillo College Amarillo ... 31700 10950
TX Angelina College Lufkin ... 26900 PrivacySuppressed
TX Angelo State University San Angelo ... 37700 21319.5

5 rows × 26 columns

college中,用布爾索引選取所有得克薩斯州的學校

college[college[‘STABBR‘] == ‘TX‘].head()
INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
3610 Abilene Christian University Abilene ... 40200 25985
3611 Alvin Community College Alvin ... 34500 6750
3612 Amarillo College Amarillo ... 31700 10950
3613 Angelina College Lufkin ... 26900 PrivacySuppressed
3614 Angelo State University San Angelo ... 37700 21319.5

5 rows × 27 columns

比較二者的速度

法一

%timeit college[college[‘STABBR‘] == ‘TX‘]
937 μs ± 58.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

法二

%timeit college2.loc[‘TX‘]
520 μs ± 21.2 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit college2 = college.set_index(‘STABBR‘)
2.11 ms ± 185 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

2.3.2 多個標簽

布爾索引和標簽選取多列

states =[‘TX‘, ‘CA‘, ‘NY‘]
college[college[‘STABBR‘].isin(states)]
INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
192 Academy of Art University San Francisco ... 36000 35093
193 ITT Technical Institute-Rancho Cordova Rancho Cordova ... 38800 25827.5
... ... ... ... ... ...
7533 Bay Area Medical Academy - San Jose Satellite ... San Jose ... NaN PrivacySuppressed
7534 Excel Learning Center-San Antonio South San Antonio ... NaN 12125

1704 rows × 27 columns

college2.loc[states].head()
INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
STABBR
TX Abilene Christian University Abilene ... 40200 25985
TX Alvin Community College Alvin ... 34500 6750
TX Amarillo College Amarillo ... 31700 10950
TX Angelina College Lufkin ... 26900 PrivacySuppressed
TX Angelo State University San Angelo ... 37700 21319.5

5 rows × 26 columns

3 查詢方法

使用查詢方法提高布爾索引的可讀性

# 讀取employee數據,確定選取的部門和列
employee = pd.read_csv(‘data/employee.csv‘)
depts = [‘Houston Police Department-HPD‘, ‘Houston Fire Department (HFD)‘]
select_columns = [‘UNIQUE_ID‘, ‘DEPARTMENT‘, ‘GENDER‘, ‘BASE_SALARY‘]
# 創建查詢字符串,並執行query方法
qs = "DEPARTMENT in @depts and GENDER == ‘Female‘ and 80000 <= BASE_SALARY <= 120000"
emp_filtered = employee.query(qs)
emp_filtered[select_columns].head()
UNIQUE_ID DEPARTMENT GENDER BASE_SALARY
61 61 Houston Fire Department (HFD) Female 96668.0
136 136 Houston Police Department-HPD Female 81239.0
367 367 Houston Police Department-HPD Female 86534.0
474 474 Houston Police Department-HPD Female 91181.0
513 513 Houston Police Department-HPD Female 81239.0

4 唯一和有序索引

4.1 單列索引

college = pd.read_csv(‘data/college.csv‘)
college2 = college.set_index(‘STABBR‘)
college2.index.is_monotonic
False

將college2排序,存儲成另一個對象,查看其是否有序

college3 = college2.sort_index()
college3.index.is_monotonic
True

使用INSTNM作為行索引,檢測行索引是否唯一

college_unique = college.set_index(‘INSTNM‘)
college_unique.index.is_unique
True

4.2 拼裝索引

使用CITY和STABBR兩列作為行索引,並進行排序

college.index = college[‘CITY‘] + ‘, ‘ + college[‘STABBR‘]
college = college.sort_index()
college.head()
INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
ARTESIA, CA Angeles Institute ARTESIA ... NaN 16850
Aberdeen, SD Presentation College Aberdeen ... 35900 25000
Aberdeen, SD Northern State University Aberdeen ... 33600 24847
Aberdeen, WA Grays Harbor College Aberdeen ... 27000 11490
Abilene, TX Hardin-Simmons University Abilene ... 38700 25864

5 rows × 27 columns

college.index.is_unique
False

選取所有Miami, FL的大學

法一

college.loc[‘Miami, FL‘].head()
INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
Miami, FL New Professions Technical Institute Miami ... 18700 8682
Miami, FL Management Resources College Miami ... PrivacySuppressed 12182
Miami, FL Strayer University-Doral Miami ... 49200 36173.5
Miami, FL Keiser University- Miami Miami ... 29700 26063
Miami, FL George T Baker Aviation Technical College Miami ... 38600 PrivacySuppressed

5 rows × 27 columns

法二

crit1 = college[‘CITY‘] == ‘Miami‘ 
crit2 = college[‘STABBR‘] == ‘FL‘
college[crit1 & crit2]
INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
Miami, FL New Professions Technical Institute Miami ... 18700 8682
Miami, FL Management Resources College Miami ... PrivacySuppressed 12182
... ... ... ... ... ...
Miami, FL Advanced Technical Centers Miami ... PrivacySuppressed PrivacySuppressed
Miami, FL Lindsey Hopkins Technical College Miami ... 29800 PrivacySuppressed

50 rows × 27 columns

5 loc/iloc中使用布爾

movie = pd.read_csv(‘data/movie.csv‘, index_col=‘movie_title‘)

5.1 行

c1 = movie[‘content_rating‘] == ‘G‘
c2 = movie[‘imdb_score‘] < 4
criteria = c1 & c2
bool_movie = movie[criteria]
bool_movie
color director_name ... aspect_ratio movie_facebook_likes
movie_title
The True Story of Puss‘N Boots Color Jér?me Deschamps ... NaN 90
Doogal Color Dave Borthwick ... 1.85 346
... ... ... ... ... ...
Justin Bieber: Never Say Never Color Jon M. Chu ... 1.85 62000
Sunday School Musical Color Rachel Goldenberg ... 1.85 777

6 rows × 27 columns

loc使用bool

法一

movie_loc = movie.loc[criteria]

檢查loc條件和布爾條件創建出來的兩個DataFrame是否一樣

movie_loc.equals(movie[criteria])
True

法二

movie_loc2 = movie.loc[criteria.values]
movie_loc2.equals(movie[criteria])
True

iloc使用bool

因為criteria是包含行索引的一個Series,必須要使用底層的ndarray,才能使用,iloc

movie_iloc = movie.iloc[criteria.values]
movie_iloc.equals(movie_loc)
True

5.2 列

布爾索引也可以用來選取列

criteria_col = movie.dtypes == np.int64
criteria_col.head()
color                      False
director_name              False
num_critic_for_reviews     False
duration                   False
director_facebook_likes    False
dtype: bool
movie.loc[:, criteria_col].head()
num_voted_users cast_total_facebook_likes movie_facebook_likes
movie_title
Avatar 886204 4834 33000
Pirates of the Caribbean: At World‘s End 471220 48350 0
Spectre 275868 11700 85000
The Dark Knight Rises 1144337 106759 164000
Star Wars: Episode VII - The Force Awakens 8 143 0
movie.iloc[:, criteria_col.values].head()
num_voted_users cast_total_facebook_likes movie_facebook_likes
movie_title
Avatar 886204 4834 33000
Pirates of the Caribbean: At World‘s End 471220 48350 0
Spectre 275868 11700 85000
The Dark Knight Rises 1144337 106759 164000
Star Wars: Episode VII - The Force Awakens 8 143 0

6 使用布爾值 - where/mask

mask() is the inverse boolean operation of where.

DataFrame.where(cond, other=nan, inplace=False **kwgs)
Parameters:

  • cond : boolean NDFrame, array-like, or callable

    • Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
    • cond是一個與df通型的dataframe,當dataframe與cond對應的位置是true是,保留原值。否則便為other對應的值
  • other : scalar, NDFrame, or callable
  • inplace : boolean, default False
    • Whether to perform the operation in place on the data

6.1 Series使用where

movie = pd.read_csv(‘data/movie.csv‘, index_col=‘movie_title‘)
fb_likes = movie[‘actor_1_facebook_likes‘].dropna()
fb_likes.head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World‘s End      40000.0
Spectre                                       11000.0
The Dark Knight Rises                         27000.0
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

使用describe獲得對數據的認知

fb_likes.describe(percentiles=[.1, .25, .5, .75, .9]).astype(int)
count      4909
mean       6494
          ...  
90%       18000
max      640000
Name: actor_1_facebook_likes, Length: 10, dtype: int64

檢測小於20000個喜歡的的比例

criteria_high = fb_likes < 20000
criteria_high.mean().round(2)
0.91

where條件可以返回一個同樣大小的Series,但是所有False會被替換成缺失值

fb_likes.where(criteria_high).head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World‘s End          NaN
Spectre                                       11000.0
The Dark Knight Rises                             NaN
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

第二個參數other,可以讓你控制替換值

fb_likes.where(criteria_high, other=20000).head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World‘s End      20000.0
Spectre                                       11000.0
The Dark Knight Rises                         20000.0
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

通過where條件,設定上下限的值

criteria_low = fb_likes > 300
fb_likes_cap = fb_likes.where(criteria_high, other=20000).where(criteria_low, 300)
fb_likes_cap.head()
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World‘s End      20000.0
Spectre                                       11000.0
The Dark Knight Rises                         20000.0
Star Wars: Episode VII - The Force Awakens      300.0
Name: actor_1_facebook_likes, dtype: float64

原始Series和修改過的Series的長度是一樣的

len(fb_likes), len(fb_likes_cap)
(4909, 4909)

6.2 dataframe使用where

df = pd.DataFrame({‘vals‘: [1, 2, 3, 4], ‘ids‘: [‘a‘, ‘b‘, ‘f‘, ‘n‘],‘ids2‘: [‘a‘, ‘n‘, ‘c‘, ‘n‘]})
print(df)
print(df < 2)
df.where(df<2,1000)
   vals ids ids2
0     1   a    a
1     2   b    n
2     3   f    c
3     4   n    n
    vals   ids  ids2
0   True  True  True
1  False  True  True
2  False  True  True
3  False  True  True
vals ids ids2
0 1 a a
1 1000 b n
2 1000 f c
3 1000 n n

下面的代碼等價於 df.where(df < 0,1000).

print(df[df < 2])
df[df < 2].fillna(1000)
   vals ids ids2
0   1.0   a    a
1   NaN   b    n
2   NaN   f    c
3   NaN   n    n
vals ids ids2
0 1.0 a a
1 1000.0 b n
2 1000.0 f c
3 1000.0 n n

《Pandas CookBook》---- 第五章 布爾索引