《Pandas CookBook》---- DataFrame基礎操作
Pandas基礎操作
簡書大神SeanCheney的譯作,我作了些格式調整和文章目錄結構的變化,更適合自己閱讀,以後翻閱是更加方便自己查找吧
import pandas as pd
import numpy as np
設定最大列數和最大行數
pd.set_option(‘max_columns‘,5 , ‘max_rows‘, 5)
選取多個DataFrame列
用列表選取多個列
movie = pd.read_csv(‘data/movie.csv‘)
cols =[‘actor_1_name‘, ‘actor_2_name‘, ‘actor_3_name‘, ‘director_name‘]
movie_actor_director = movie[cols]
movie_actor_director
actor_1_name | actor_2_name | actor_3_name | director_name | |
---|---|---|---|---|
0 | CCH Pounder | Joel David Moore | Wes Studi | James Cameron |
1 | Johnny Depp | Orlando Bloom | Jack Davenport | Gore Verbinski |
... | ... | ... | ... | ... |
4914 | Alan Ruck | Daniel Henney | Eliza Coupe | Daniel Hsia |
4915 | John August | Brian Herzlinger | Jon Gunn | Jon Gunn |
4916 rows × 4 columns
使用select_dtypes選取類型
select_dtypes(include=None, exclude=None)
- To select all numeric types, use np.number or ‘number‘
- To select strings you must use the object dtype, but note that this will return all object dtype columns,See the numpy dtype hierarchy
- To select datetimes, use np.datetime64, ‘datetime‘ or ‘datetime64‘
- To select timedeltas, use np.timedelta64, ‘timedelta‘ or ‘timedelta64‘
- To select Pandas categorical dtypes, use ‘category‘
movie.shape
(4916, 28)
選取整數列
movie.select_dtypes(include=[‘int‘]).head()
num_voted_users | cast_total_facebook_likes | movie_facebook_likes | |
---|---|---|---|
0 | 886204 | 4834 | 33000 |
1 | 471220 | 48350 | 0 |
2 | 275868 | 11700 | 85000 |
3 | 1144337 | 106759 | 164000 |
4 | 8 | 143 | 0 |
選取非整數列
movie.select_dtypes(exclude=[‘int‘]).head()
color | director_name | ... | imdb_score | aspect_ratio | |
---|---|---|---|---|---|
0 | Color | James Cameron | ... | 7.9 | 1.78 |
1 | Color | Gore Verbinski | ... | 7.1 | 2.35 |
2 | Color | Sam Mendes | ... | 6.8 | 2.35 |
3 | Color | Christopher Nolan | ... | 8.5 | 2.35 |
4 | NaN | Doug Walker | ... | 7.1 | NaN |
5 rows × 25 columns
通過filter函數過濾選取多列
filter(items=None, like=None, regex=None, axis=None)
- items : list-like
- List of info axis to restrict to (must not all be present)
- 傳遞個列名或行名列表
- like : string
- Keep info axis where “arg in col == True”
- 類似Python裏面字符串的find()函數,col.find(arg)
- regex : string (regular expression)
- Keep info axis with re.search(regex, col) == True
通過filter()函數過濾選取多列
movie.filter(like=‘facebook‘).head()
director_facebook_likes | actor_3_facebook_likes | ... | actor_2_facebook_likes | movie_facebook_likes | |
---|---|---|---|---|---|
0 | 0.0 | 855.0 | ... | 936.0 | 33000 |
1 | 563.0 | 1000.0 | ... | 5000.0 | 0 |
2 | 0.0 | 161.0 | ... | 393.0 | 85000 |
3 | 22000.0 | 23000.0 | ... | 23000.0 | 164000 |
4 | 131.0 | NaN | ... | 12.0 | 0 |
5 rows × 6 columns
通過正則表達式選取多列
movie.filter(regex=‘\d‘).head()
actor_3_facebook_likes | actor_2_name | ... | actor_3_name | actor_2_facebook_likes | |
---|---|---|---|---|---|
0 | 855.0 | Joel David Moore | ... | Wes Studi | 936.0 |
1 | 1000.0 | Orlando Bloom | ... | Jack Davenport | 5000.0 |
2 | 161.0 | Rory Kinnear | ... | Stephanie Sigman | 393.0 |
3 | 23000.0 | Christian Bale | ... | Joseph Gordon-Levitt | 23000.0 |
4 | NaN | Rob Walker | ... | NaN | 12.0 |
5 rows × 6 columns
filter()函數,傳遞列表到參數items,選取多列
movie.filter(items=[‘actor_1_name‘, ‘actor_3_name‘]).head()
actor_1_name | actor_3_name | |
---|---|---|
0 | CCH Pounder | Wes Studi |
1 | Johnny Depp | Jack Davenport |
2 | Christoph Waltz | Stephanie Sigman |
3 | Tom Hardy | Joseph Gordon-Levitt |
4 | Doug Walker | NaN |
DataFrame上操作
基本方法
數據的個數 數據集的維度 數據集的長度
movie.shape,movie.size,movie.ndim
((4916, 28), 137648, 2)
各個列的非空值的個數
movie.count()
color 4897
director_name 4814
...
aspect_ratio 4590
movie_facebook_likes 4916
Length: 28, dtype: int64
統計信息
movie.shape
(4916, 28)
最大 最小值
數值類型
# min max quantile
movie_min = movie.min()
movie_min.name = ‘最小值‘
movie_min
num_critic_for_reviews 1.00
duration 7.00
...
aspect_ratio 1.18
movie_facebook_likes 0.00
Name: 最小值, Length: 16, dtype: float64
計算是默認會跳過缺失值的,可設置skipna=False使其包含缺失,但這樣不具有意義
movie.min(skipna=False)
num_critic_for_reviews NaN
duration NaN
...
aspect_ratio NaN
movie_facebook_likes 0.0
Length: 16, dtype: float64
字符串類型
當字符串類型的列包含缺失值時,聚合方法min、max、sum,不會返回任何值。
movie[[‘color‘, ‘movie_title‘, ‘color‘]].max()
Series([], dtype: float64)
要讓pandas強行返回每列的值,必須填入缺失值。下面填入的是空字符串
movie[[‘color‘, ‘movie_title‘, ‘color‘]].fillna(‘‘).max()
color Color
movie_title ?on Flux
color Color
dtype: object
統計信息
數值型
使用percentiles參數指定分位數
movie.describe(percentiles=[.01, .3, .99])
num_critic_for_reviews | duration | ... | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|
count | 4867.000000 | 4901.000000 | ... | 4590.000000 | 4916.000000 |
mean | 137.988905 | 107.090798 | ... | 2.222349 | 7348.294142 |
... | ... | ... | ... | ... | ... |
99% | 546.680000 | 189.000000 | ... | 4.000000 | 93850.000000 |
max | 813.000000 | 511.000000 | ... | 16.000000 | 349000.000000 |
9 rows × 16 columns
字符串型
movie.select_dtypes(include=‘object‘).describe()
color | director_name | ... | country | content_rating | |
---|---|---|---|---|---|
count | 4897 | 4814 | ... | 4911 | 4616 |
unique | 2 | 2397 | ... | 65 | 18 |
top | Color | Steven Spielberg | ... | USA | R |
freq | 4693 | 26 | ... | 3710 | 2067 |
4 rows × 12 columns
方法的組合
使用isnull方法將每個值轉變為布爾值
movie.isnull().head()
color | director_name | ... | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|
0 | False | False | ... | False | False |
1 | False | False | ... | False | False |
2 | False | False | ... | False | False |
3 | False | False | ... | False | False |
4 | True | False | ... | True | False |
5 rows × 28 columns
sum統計布爾值,返回的是Series
movie.isnull().sum().head()
color 19
director_name 102
num_critic_for_reviews 49
duration 15
director_facebook_likes 102
dtype: int64
對這個Series再使用sum,返回整個DataFrame的缺失值的個數,返回值是個標量
movie.isnull().sum().sum()
2654
判斷整個DataFrame有沒有缺失值,方法是連著使用兩個any
movie.isnull().any().any()
True
運算符
行索引名設為INSTNM,用UGDS_過濾出本科生的種族比例
college = pd.read_csv(‘data/college.csv‘, index_col=‘INSTNM‘)
college_ugds_ = college.filter(like=‘UGDS_‘)
college_ugds_
UGDS_WHITE | UGDS_BLACK | ... | UGDS_NRA | UGDS_UNKN | |
---|---|---|---|---|---|
INSTNM | |||||
Alabama A & M University | 0.0333 | 0.9353 | ... | 0.0059 | 0.0138 |
University of Alabama at Birmingham | 0.5922 | 0.2600 | ... | 0.0179 | 0.0100 |
... | ... | ... | ... | ... | ... |
Bay Area Medical Academy - San Jose Satellite Location | NaN | NaN | ... | NaN | NaN |
Excel Learning Center-San Antonio South | NaN | NaN | ... | NaN | NaN |
7535 rows × 9 columns
college_ugds_的數值類型都是float,可以進行整數運算
college_ugds_.dtypes
UGDS_WHITE float64
UGDS_BLACK float64
...
UGDS_NRA float64
UGDS_UNKN float64
Length: 9, dtype: object
加減乘除
college_ugds_.head() + .00501
UGDS_WHITE | UGDS_BLACK | ... | UGDS_NRA | UGDS_UNKN | |
---|---|---|---|---|---|
INSTNM | |||||
Alabama A & M University | 0.03831 | 0.94031 | ... | 0.01091 | 0.01881 |
University of Alabama at Birmingham | 0.59721 | 0.26501 | ... | 0.02291 | 0.01501 |
Amridge University | 0.30401 | 0.42421 | ... | 0.00501 | 0.27651 |
University of Alabama in Huntsville | 0.70381 | 0.13051 | ... | 0.03821 | 0.04001 |
Alabama State University | 0.02081 | 0.92581 | ... | 0.02931 | 0.01871 |
5 rows × 9 columns
計算樣例數據的百分比
方式一
college_ugds_op_round = (college_ugds_ + .00501) // .01 / 100
college_ugds_op_round.head()
UGDS_WHITE | UGDS_BLACK | ... | UGDS_NRA | UGDS_UNKN | |
---|---|---|---|---|---|
INSTNM | |||||
Alabama A & M University | 0.03 | 0.94 | ... | 0.01 | 0.01 |
University of Alabama at Birmingham | 0.59 | 0.26 | ... | 0.02 | 0.01 |
Amridge University | 0.30 | 0.42 | ... | 0.00 | 0.27 |
University of Alabama in Huntsville | 0.70 | 0.13 | ... | 0.03 | 0.04 |
Alabama State University | 0.02 | 0.92 | ... | 0.02 | 0.01 |
5 rows × 9 columns
方式二
college_ugds_round = (college_ugds_ + .00001).round(2)
college_ugds_round.head()
UGDS_WHITE | UGDS_BLACK | ... | UGDS_NRA | UGDS_UNKN | |
---|---|---|---|---|---|
INSTNM | |||||
Alabama A & M University | 0.03 | 0.94 | ... | 0.01 | 0.01 |
University of Alabama at Birmingham | 0.59 | 0.26 | ... | 0.02 | 0.01 |
Amridge University | 0.30 | 0.42 | ... | 0.00 | 0.27 |
University of Alabama in Huntsville | 0.70 | 0.13 | ... | 0.03 | 0.04 |
Alabama State University | 0.02 | 0.92 | ... | 0.02 | 0.01 |
5 rows × 9 columns
方式三
college_ugds_op_round_methods = college_ugds_.add(.00501).floordiv(.01).div(100)
college_ugds_op_round_methods.head()
UGDS_WHITE | UGDS_BLACK | ... | UGDS_NRA | UGDS_UNKN | |
---|---|---|---|---|---|
INSTNM | |||||
Alabama A & M University | 0.03 | 0.94 | ... | 0.01 | 0.01 |
University of Alabama at Birmingham | 0.59 | 0.26 | ... | 0.02 | 0.01 |
Amridge University | 0.30 | 0.42 | ... | 0.00 | 0.27 |
University of Alabama in Huntsville | 0.70 | 0.13 | ... | 0.03 | 0.04 |
Alabama State University | 0.02 | 0.92 | ... | 0.02 | 0.01 |
5 rows × 9 columns
比較缺失值
Pandas使用NumPy NaN(np.nan)對象表示缺失值。這是一個不等於自身的特殊對象:
np.nan == np.nan
False
所有和np.nan的比較都返回False,除了不等於:
5 > np.nan
False
5 != np.nan
True
無法通過直接比較比較,含有缺失值的df是否一致
movie_equal = movie == movie
movie_equal.all().all()
False
movie_equal.size - movie_equal.sum().sum()
2654
movie.isnull().sum().sum()
2654
比較兩個DataFrame最直接的方法是使用equals()方法
from pandas.testing import assert_frame_equal
assert_frame_equal(movie, movie)
《Pandas CookBook》---- DataFrame基礎操作