pandas基礎屬性方法隨機整理(四)---例項梳理(多知識點)
源資料格式:
“”
Yr Mo Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
61 1 1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
61 1 2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
“”“匯入:
data = pd.read_table(‘https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/Wind_Stats/wind.data‘,sep=’\s+’, parse_dates = [ [0,1,2] ])
注:
pd.read_table() 引數解析: parse_dates
list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.更改column列名
data.rename(columns = {‘Yr_Mo_Dy’:’DATE’}, inplace=True)
data.rename(columns = {'Yr_Mo_Dy':'DATE'}, inplace=True)
data.iloc[:,0:9].head(3)
Out[158]:
DATE RPT VAL ROS KIL SHA BIR DUB CLA
0 1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25
1 1961-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04
2 1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN
- 更改日期資料型別: object – > datetime64[ns]
data.info():
DATE 列為日期,但是資料型別是’object’, 需將其更改為 ‘datetime64[ns]’型別
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6574 entries, 0 to 6573
Data columns (total 13 columns):
DATE 6574 non-null object
RPT 6568 non-null float64
VAL 6571 non-null float64
ROS 6572 non-null float64
KIL 6569 non-null float64
SHA 6572 non-null float64
BIR 6574 non-null float64
DUB 6571 non-null float64
CLA 6572 non-null float64
MUL 6571 non-null float64
CLO 6573 non-null float64
BEL 6574 non-null float64
MAL 6570 non-null float64
dtypes: float64(12), object(1)
memory usage: 719.0+ KB
方法1: .astype()
data['DATE'].astype('datetime64[ns]').dtype
Out[169]: dtype('<M8[ns]')
方法2:pd.to_datetime(…) # 特殊格式,推薦使用
data['DATE'] = pd.to_datetime(data['DATE'])
data['DATE'].dtype
Out[178]: dtype('<M8[ns]')
- 將日期列設定為索引(column –> index):
data = data.set_index(‘DATE’)
data = data.set_index('DATE')
data.iloc[:,0:9].head(3)
Out[186]:
RPT VAL ROS KIL SHA BIR DUB CLA MUL
DATE
1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83
1961-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79
1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50
- 缺失值數量:
data.isnull().sum() # notnull isnull的否定式
data.isnull().sum()
Out[192]:
RPT 6
VAL 3
ROS 2
KIL 5
SHA 2
BIR 0
DUB 3
CLA 2
MUL 3
CLO 1
BEL 0
MAL 4
dtype: int64
- 描述與統計資訊:
問題1:
Create a DataFrame called loc_stats and calculate the min, max and mean windspeeds and standard deviations of the windspeeds at each location over all the days
方法:data.describe().loc[ [‘min’,’max’,’mean’,’std’], :].T
loc_stats = data.describe().loc[['min','max','mean','std'],:].T
loc_stats
Out[207]:
min max mean std
RPT 0.67 35.80 12.362987 5.618413
VAL 0.21 33.37 10.644314 5.267356
ROS 1.50 33.84 11.660526 5.008450
KIL 0.00 28.46 6.306468 3.605811
SHA 0.13 37.54 10.455834 4.936125
BIR 0.00 26.16 7.092254 3.968683
DUB 0.00 30.37 9.797343 4.977555
CLA 0.00 31.08 8.495053 4.499449
MUL 0.00 25.88 8.493590 4.166872
CLO 0.04 28.21 8.707332 4.503954
BEL 0.13 42.38 13.121007 5.835037
MAL 0.67 42.54 15.599079 6.699794
問題2:
Create a DataFrame called day_stats and calculate the min, max and mean windspeed and standard deviations of the windspeeds across all the locations at each day.
方法:data.T.describe().loc[ [‘min’,’max’,’mean’,’std’], :].T
note: 巧妙利用轉置函式 df.T 實現維度轉換
days_stats = data.T.describe().loc[['min','max','mean','std'],:].T
days_stats.head()
Out[229]:
min max mean std
DATE
1961-01-01 9.29 18.50 13.018182 2.808875
1961-01-02 6.50 17.54 11.336364 3.188994
1961-01-03 6.17 18.50 11.641818 3.681912
1961-01-04 1.79 11.75 6.619167 3.198126
1961-01-05 6.17 13.33 10.630000 2.445356
- 條件查詢: df.query(‘cond…’)
問題:
Find the average windspeed in January for each location.
方法1:輔助列 data[‘Mon’], data[
data['Mon']==1
].mean() # 利用掩碼
,即bool作為篩選條件
即 陣列[關係表示式]:
關係表示式是一個布林型書序,其中為True的元素對應於陣列中滿足關係表示式的元素,以上下標運算的值就是從陣列中挑選與布林陣列中為True的元素相對應的元素- 方法2:輔助列data[‘Mon’], # 查詢.query, 功能與索引一樣,有時更方便
- 方法3: 不新增輔助列,利用groupby()方法實現降取樣
data.groupby (lambda x: x.month
). mean(). T[1]
data['Mon'] = data['date_col'].apply(lambda x: x.month)
data[data['Mon']==1].mean() # data['Mon'==1] 掩碼
Out[248]:
RPT 14.847325
VAL 12.914560
ROS 13.299624
KIL 7.199498
SHA 11.667734
BIR 8.054839
DUB 11.819355
CLA 9.512047
MUL 9.543208
CLO 10.053566
BEL 14.550520
MAL 18.028763
Mon 1.000000
Year 1969.500000
day 16.000000
dtype: float64
相比於條件查詢方式data[data[‘Mon’]==1].mean(), .query查詢方式更簡潔直觀data.query(‘Mon == 1’).mean()
data.query('Mon == 1').mean()
Out[267]:
RPT 14.847325
VAL 12.914560
ROS 13.299624
KIL 7.199498
SHA 11.667734
BIR 8.054839
DUB 11.819355
CLA 9.512047
MUL 9.543208
CLO 10.053566
BEL 14.550520
MAL 18.028763
Mon 1.000000
Year 1969.500000
day 16.000000
dtype: float64
方法3說明:
a) 刪除輔助列:
data.drop(‘Mon’], axis=1, inplace=True) #
b) .groupby() 傳入lambda函式處理時間序列進行重取樣
- mondata_loc_per = data.groupby(lambda x: x.month).mean().T
mondata_loc_per.head()
Out[311]:
1 2 3 4 5 6 \
RPT 14.847325 13.710906 13.158687 12.555648 11.724032 10.451317
VAL 12.914560 12.111122 11.505842 10.429759 10.145619 8.949704
ROS 13.299624 12.879132 12.648118 12.204815 11.550394 10.361315
KIL 7.199498 6.942411 7.265907 6.898037 6.307487 5.652278
SHA 11.667734 11.551772 11.554516 10.677667 10.224301 9.529926
7 8 9 10 11 12
RPT 9.992007 10.213411 11.458519 12.660610 13.200722 14.446398
VAL 8.357778 8.415143 9.981002 11.010681 11.639500 12.353602
ROS 9.349642 9.993441 10.756883 11.453943 12.293407 13.212276
KIL 5.416935 5.270681 5.615176 6.065215 6.247611 6.829910
SHA 9.302634 8.901559 9.766315 10.550251 10.501130 11.301254
c) 索引:
mondata_loc_per[4]: [N] 中N的數字代表月份Mon
mondata_loc_per[4]
Out[312]:
RPT 12.555648
VAL 10.429759
ROS 12.204815
KIL 6.898037
SHA 10.677667
BIR 7.441389
DUB 10.221315
CLA 8.909056
MUL 8.930870
CLO 9.158019
BEL 12.664759
MAL 14.937611
Mon 4.000000
Year 1969.500000
day 15.500000
Name: 4, dtype: float64
- resample重取樣:
data_rew = data.resample('W',closed='right',kind='period').agg(['min','max','mean','std'])
data_rew.iloc[:,0:7].head()
Out[418]:
RPT VAL \
min max mean std min max
DATE
1960-12-26/1961-01-01 15.04 15.04 15.040000 NaN 14.96 14.96
1961-01-02/1961-01-08 10.58 18.50 13.541429 2.631321 6.63 16.88
1961-01-09/1961-01-15 9.04 19.75 12.468571 3.555392 3.54 12.08
1961-01-16/1961-01-22 4.92 19.83 13.204286 5.337402 3.42 14.37
1961-01-23/1961-01-29 13.62 25.04 19.880000 4.619061 9.96 23.91
data.resample('A', kind='period',axis=0,label='right').mean().head(3)
Out[411]:
RPT VAL ROS KIL SHA BIR \
DATE
1961 12.299583 10.351796 11.362369 6.958227 10.881763 7.729726
1962 12.246923 10.110438 11.732712 6.960440 10.657918 7.393068
1963 12.813452 10.836986 12.541151 7.330055 11.724110 8.434712
help()程式碼資訊擷取:
parse_dates : boolean or list of ints or names or list of lists or dict, default False
* boolean. If True -> try parsing the index.
* list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
each as a separate date column.
* list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as
a single date column.
* dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result
'foo'
If a column or index contains an unparseable date, the entire column or
index will be returned unaltered as an object data type. For non-standard
datetime parsing, use ``pd.to_datetime`` after ``pd.read_csv``