1. 程式人生 > >《Pandas CookBook》---- DataFrame基礎操作

《Pandas CookBook》---- DataFrame基礎操作

vertical stat dir vertica expr axis int avi joseph

Pandas基礎操作

簡書大神SeanCheney的譯作,我作了些格式調整和文章目錄結構的變化,更適合自己閱讀,以後翻閱是更加方便自己查找吧

import pandas as pd
import numpy as np

設定最大列數和最大行數

pd.set_option(‘max_columns‘,5 , ‘max_rows‘, 5)

選取多個DataFrame列

用列表選取多個列

movie = pd.read_csv(‘data/movie.csv‘)
cols =[‘actor_1_name‘, ‘actor_2_name‘, ‘actor_3_name‘, ‘director_name‘]
movie_actor_director = movie[cols]
movie_actor_director
actor_1_name actor_2_name actor_3_name director_name
0 CCH Pounder Joel David Moore Wes Studi James Cameron
1 Johnny Depp Orlando Bloom Jack Davenport Gore Verbinski
... ... ... ... ...
4914 Alan Ruck Daniel Henney Eliza Coupe Daniel Hsia
4915 John August Brian Herzlinger Jon Gunn Jon Gunn

4916 rows × 4 columns

使用select_dtypes選取類型

select_dtypes(include=None, exclude=None)

  • To select all numeric types, use np.number or ‘number‘
  • To select strings you must use the object dtype, but note that this will return all object dtype columns,See the numpy dtype hierarchy
  • To select datetimes, use np.datetime64, ‘datetime‘ or ‘datetime64‘
  • To select timedeltas, use np.timedelta64, ‘timedelta‘ or ‘timedelta64‘
  • To select Pandas categorical dtypes, use ‘category‘
movie.shape
(4916, 28)

選取整數列

movie.select_dtypes(include=[‘int‘]).head()
num_voted_users cast_total_facebook_likes movie_facebook_likes
0 886204 4834 33000
1 471220 48350 0
2 275868 11700 85000
3 1144337 106759 164000
4 8 143 0

選取非整數列

movie.select_dtypes(exclude=[‘int‘]).head()
color director_name ... imdb_score aspect_ratio
0 Color James Cameron ... 7.9 1.78
1 Color Gore Verbinski ... 7.1 2.35
2 Color Sam Mendes ... 6.8 2.35
3 Color Christopher Nolan ... 8.5 2.35
4 NaN Doug Walker ... 7.1 NaN

5 rows × 25 columns

通過filter函數過濾選取多列

filter(items=None, like=None, regex=None, axis=None)

  • items : list-like
    • List of info axis to restrict to (must not all be present)
    • 傳遞個列名或行名列表
  • like : string
    • Keep info axis where “arg in col == True”
    • 類似Python裏面字符串的find()函數,col.find(arg)
  • regex : string (regular expression)
    • Keep info axis with re.search(regex, col) == True

通過filter()函數過濾選取多列

movie.filter(like=‘facebook‘).head()
director_facebook_likes actor_3_facebook_likes ... actor_2_facebook_likes movie_facebook_likes
0 0.0 855.0 ... 936.0 33000
1 563.0 1000.0 ... 5000.0 0
2 0.0 161.0 ... 393.0 85000
3 22000.0 23000.0 ... 23000.0 164000
4 131.0 NaN ... 12.0 0

5 rows × 6 columns

通過正則表達式選取多列

movie.filter(regex=‘\d‘).head()
actor_3_facebook_likes actor_2_name ... actor_3_name actor_2_facebook_likes
0 855.0 Joel David Moore ... Wes Studi 936.0
1 1000.0 Orlando Bloom ... Jack Davenport 5000.0
2 161.0 Rory Kinnear ... Stephanie Sigman 393.0
3 23000.0 Christian Bale ... Joseph Gordon-Levitt 23000.0
4 NaN Rob Walker ... NaN 12.0

5 rows × 6 columns

filter()函數,傳遞列表到參數items,選取多列

movie.filter(items=[‘actor_1_name‘, ‘actor_3_name‘]).head()
actor_1_name actor_3_name
0 CCH Pounder Wes Studi
1 Johnny Depp Jack Davenport
2 Christoph Waltz Stephanie Sigman
3 Tom Hardy Joseph Gordon-Levitt
4 Doug Walker NaN

DataFrame上操作

基本方法

數據的個數 數據集的維度 數據集的長度

 movie.shape,movie.size,movie.ndim
((4916, 28), 137648, 2)

各個列的非空值的個數

movie.count()
color                   4897
director_name           4814
                        ... 
aspect_ratio            4590
movie_facebook_likes    4916
Length: 28, dtype: int64

統計信息

movie.shape
(4916, 28)

最大 最小值

數值類型

# min max quantile
movie_min = movie.min()
movie_min.name = ‘最小值‘
movie_min
num_critic_for_reviews    1.00
duration                  7.00
                          ... 
aspect_ratio              1.18
movie_facebook_likes      0.00
Name: 最小值, Length: 16, dtype: float64

計算是默認會跳過缺失值的,可設置skipna=False使其包含缺失,但這樣不具有意義

movie.min(skipna=False)
num_critic_for_reviews    NaN
duration                  NaN
                         ... 
aspect_ratio              NaN
movie_facebook_likes      0.0
Length: 16, dtype: float64

字符串類型

當字符串類型的列包含缺失值時,聚合方法min、max、sum,不會返回任何值。

movie[[‘color‘, ‘movie_title‘, ‘color‘]].max()
Series([], dtype: float64)

要讓pandas強行返回每列的值,必須填入缺失值。下面填入的是空字符串

movie[[‘color‘, ‘movie_title‘, ‘color‘]].fillna(‘‘).max()
color             Color
movie_title    ?on Flux
color             Color
dtype: object

統計信息

數值型

使用percentiles參數指定分位數

movie.describe(percentiles=[.01, .3, .99])
num_critic_for_reviews duration ... aspect_ratio movie_facebook_likes
count 4867.000000 4901.000000 ... 4590.000000 4916.000000
mean 137.988905 107.090798 ... 2.222349 7348.294142
... ... ... ... ... ...
99% 546.680000 189.000000 ... 4.000000 93850.000000
max 813.000000 511.000000 ... 16.000000 349000.000000

9 rows × 16 columns

字符串型

movie.select_dtypes(include=‘object‘).describe()
color director_name ... country content_rating
count 4897 4814 ... 4911 4616
unique 2 2397 ... 65 18
top Color Steven Spielberg ... USA R
freq 4693 26 ... 3710 2067

4 rows × 12 columns

方法的組合

使用isnull方法將每個值轉變為布爾值

movie.isnull().head()
color director_name ... aspect_ratio movie_facebook_likes
0 False False ... False False
1 False False ... False False
2 False False ... False False
3 False False ... False False
4 True False ... True False

5 rows × 28 columns

sum統計布爾值,返回的是Series

movie.isnull().sum().head()
color                       19
director_name              102
num_critic_for_reviews      49
duration                    15
director_facebook_likes    102
dtype: int64

對這個Series再使用sum,返回整個DataFrame的缺失值的個數,返回值是個標量

movie.isnull().sum().sum()
2654

判斷整個DataFrame有沒有缺失值,方法是連著使用兩個any

movie.isnull().any().any()
True

運算符

行索引名設為INSTNM,用UGDS_過濾出本科生的種族比例

college = pd.read_csv(‘data/college.csv‘, index_col=‘INSTNM‘)
college_ugds_ = college.filter(like=‘UGDS_‘)
college_ugds_
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM
Alabama A & M University 0.0333 0.9353 ... 0.0059 0.0138
University of Alabama at Birmingham 0.5922 0.2600 ... 0.0179 0.0100
... ... ... ... ... ...
Bay Area Medical Academy - San Jose Satellite Location NaN NaN ... NaN NaN
Excel Learning Center-San Antonio South NaN NaN ... NaN NaN

7535 rows × 9 columns

college_ugds_的數值類型都是float,可以進行整數運算

college_ugds_.dtypes
UGDS_WHITE    float64
UGDS_BLACK    float64
               ...   
UGDS_NRA      float64
UGDS_UNKN     float64
Length: 9, dtype: object

加減乘除

college_ugds_.head() + .00501
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM
Alabama A & M University 0.03831 0.94031 ... 0.01091 0.01881
University of Alabama at Birmingham 0.59721 0.26501 ... 0.02291 0.01501
Amridge University 0.30401 0.42421 ... 0.00501 0.27651
University of Alabama in Huntsville 0.70381 0.13051 ... 0.03821 0.04001
Alabama State University 0.02081 0.92581 ... 0.02931 0.01871

5 rows × 9 columns

計算樣例數據的百分比

方式一

college_ugds_op_round = (college_ugds_ + .00501) // .01 / 100
college_ugds_op_round.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM
Alabama A & M University 0.03 0.94 ... 0.01 0.01
University of Alabama at Birmingham 0.59 0.26 ... 0.02 0.01
Amridge University 0.30 0.42 ... 0.00 0.27
University of Alabama in Huntsville 0.70 0.13 ... 0.03 0.04
Alabama State University 0.02 0.92 ... 0.02 0.01

5 rows × 9 columns

方式二

college_ugds_round = (college_ugds_ + .00001).round(2)
college_ugds_round.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM
Alabama A & M University 0.03 0.94 ... 0.01 0.01
University of Alabama at Birmingham 0.59 0.26 ... 0.02 0.01
Amridge University 0.30 0.42 ... 0.00 0.27
University of Alabama in Huntsville 0.70 0.13 ... 0.03 0.04
Alabama State University 0.02 0.92 ... 0.02 0.01

5 rows × 9 columns

方式三

college_ugds_op_round_methods = college_ugds_.add(.00501).floordiv(.01).div(100)
college_ugds_op_round_methods.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM
Alabama A & M University 0.03 0.94 ... 0.01 0.01
University of Alabama at Birmingham 0.59 0.26 ... 0.02 0.01
Amridge University 0.30 0.42 ... 0.00 0.27
University of Alabama in Huntsville 0.70 0.13 ... 0.03 0.04
Alabama State University 0.02 0.92 ... 0.02 0.01

5 rows × 9 columns

比較缺失值

Pandas使用NumPy NaN(np.nan)對象表示缺失值。這是一個不等於自身的特殊對象:

np.nan == np.nan
False

所有和np.nan的比較都返回False,除了不等於:

5 > np.nan
False
5 != np.nan
True

無法通過直接比較比較,含有缺失值的df是否一致

movie_equal = movie == movie
movie_equal.all().all()
False
movie_equal.size - movie_equal.sum().sum()
2654
movie.isnull().sum().sum()
2654

比較兩個DataFrame最直接的方法是使用equals()方法

from pandas.testing import assert_frame_equal
assert_frame_equal(movie, movie)

《Pandas CookBook》---- DataFrame基礎操作