Pandas高階處理
阿新 • • 發佈:2021-01-23
一、缺失值處理
1.1 缺失值處理方式
1.1.1 思路
1.刪除含有缺失值的樣本(nan)
2.替換/插補
1.1.2 處理nan
1.判斷資料中是否存在nan
- pd.isnull(df) 判斷是否為空,是空的話置為True
- pd.notnull(df) 判斷是否為空,不是空的話置為True
#返回True,資料中存在缺失值
np.any(movie.isnull())
True
#返回False,資料中存在缺失值
np.all(movie.notnull())
False
movie.isnull().any()
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) True
Metascore True
dtype: bool
pd.notnull(movie).all()
Rank True
Title True
Genre True
Description True
Director True
Actors True
Year True
Runtime (Minutes) True
Rating True
Votes True
Revenue (Millions) False
Metascore False
dtype: bool
2.處理nan
刪除含有缺失值的樣本
- df.dropna( )
- axis=‘row’ 預設按行
- inplace= (True:刪除原始DataFrame缺失值樣本,False:不改變原始資料,生成新的刪除缺失值樣本的陣列,預設為False)
替換缺失值
- df.fillna(value,inplace=)
- value,替換成的值
- inplace= (True:刪除原始DataFrame缺失值樣本,False:不改變原始資料,生成新的刪除缺失值樣本的陣列,預設為False)
movie["Revenue (Millions)"].fillna(movie["Revenue (Millions)"].mean(),inplace=True)
movie["Metascore"].fillna(movie["Metascore"].mean(),inplace=True)
pd.isnull(movie).any()
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) False
Metascore False
dtype: bool
1.1.3 缺失值不為nan,有預設標記
1.替換?為np.nan
- df.replace(to_replace="?", value=np. nan)
2.處理np.nan
- 與nan值處理方式相同
二、資料離散化
2.1 定義及意義
定義:連續屬性的離散化就是將連續屬性的值域上,將值域劃分為若干個離散的區間,最後用不同的符號或整數值代表落在每個子區間中的屬性值。
意義:連續屬性離散化的目的是為了簡化資料結構,資料離散化技術可以用來減少給定連續屬性值
的個數。離散化方法經常作為資料探勘的工具。
2.2 實現
2.2.1 分組
1.自動分組
- sr=pd.qcut(data,bins) 資料,組數
data = pd.Series([165,174,160,180,159,163,192,184],index=['No1:165','No1:174','No1:160','No1:180','No1:159','No1:163','No1:192','No1:184'])
No1:165 165
No1:174 174
No1:160 160
No1:180 180
No1:159 159
No1:163 163
No1:192 192
No1:184 184
dtype: int64
# 分組
sr = pd.qcut(data,3)
#檢視每組個數
sr.value_counts()
(178.0, 192.0] 3
(158.999, 163.667] 3
(163.667, 178.0] 2
dtype: int64
#轉換成one-hot編碼
pd.get_dummies(sr,prefix="height")
height_(158.999, 163.667] height_(163.667, 178.0] height_(178.0, 192.0]
No1:165 0 1 0
No1:174 0 1 0
No1:160 1 0 0
No1:180 0 0 1
No1:159 1 0 0
No1:163 1 0 0
No1:192 0 0 1
No1:184 0 0 1
2.自定義分組
- sr=pd.cut(data,[ ]) 資料,分組列表
#自定義分組
bins = [150,165,180,195]
sr = pd.cut(data,bins)
No1:165 (150, 165]
No1:174 (165, 180]
No1:160 (150, 165]
No1:180 (165, 180]
No1:159 (150, 165]
No1:163 (150, 165]
No1:192 (180, 195]
No1:184 (180, 195]
dtype: category
Categories (3, interval[int64]): [(150, 165] < (165, 180] < (180, 195]]
pd.get_dummies(sr,prefix="身高")
身高_(150, 165] 身高_(165, 180] 身高_(180, 195]
No1:165 1 0 0
No1:174 0 1 0
No1:160 1 0 0
No1:180 0 1 0
No1:159 1 0 0
No1:163 1 0 0
No1:192 0 0 1
No1:184 0 0 1
3.將分組好的結果轉換成one-hot編碼
- pd.get_dummies(sr,prefix= )
三、合併
3.1 按方向拼接
pd.concat([data1, data2], axis=1)
- axis=0:按列拼接 縱向拼接
- axis=1:按行拼接 橫向拼接
3.2 按索引拼接
pd.merge(left, right, how=“inner”, on=[索引])
- how=: left(左連線)、right(右連線)、outer(外連線)、inner(內連線)
left=pd.DataFrame({"key1":["K0","K0","K1","K2"],
"key2":["K0","K1","K0","K1"],
"A":["A0","A1","A2","A3"],
"B":["B0","B1","B2","B3"]})
right = pd.DataFrame({'key1':["K0","K1","K1","K2"],
"key2":["K0","K0","K0","K0"],
"C":["C0","C1",'C2',"C3"],
"D":["D0","D1","D2","D3"]})
left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
right
keyl key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
#內連線
pd.merge(left,right,how="inner",on=["key1","key2"])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
#左連線
pd.merge(left,right,how="left",on=["key1","key2"])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
#右連線
pd.merge(left,right,how="right",on=["key1","key2"])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
#外連線
pd.merge(left,right,how="outer",on=["key1","key2"])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3
四、交叉表與透視表
找到、探索兩個變數之間的關係
4.1 使用crosstab(交叉表)實現
交叉表用於計算一列資料對於另外一列資料的分組個數(尋找兩個列之間的關係)
pd.crosstab(value1, value2)
date.weekday
Int64Index([1, 0, 4, 3, 2, 1, 0, 4, 3, 2,
...
4, 3, 2, 1, 0, 4, 3, 2, 1, 0],
dtype='int64', length=643)
stock["week"]= date.weekday
#漲跌幅資料列
stock["pona"] = np.where(stock["p_change"]> 0,1,0)
#星期列與漲跌幅資料列形成交叉表
data = pd.crosstab(stock["week"],stock["pona"])
#得到每天漲跌幅頻數
pona 0 1
week
0 63 62
1 55 76
2 61 71
3 63 65
4 59 68
#得到百分比
data.div(data.sum(axis=1),axis=0)
pona 0 1
week
0 0.504000 0.496000
1 0.419847 0.580153
2 0.462121 0.537879
3 0.492188 0.507812
4 0.464567 0.535433
4.2 透視表piovt_table
DataFrame.pivot_table([ ], index=[ ])
直接得到百分比
stock.pivot_table(["pona"],index=["week"])
pona
week
0 0.496000
1 0.580153
2 0.537879
3 0.507812
4 0.535433
五、分組與聚合
5.1 DataFrame方法
DataFrame.groupby(key, as_ index=False)
key:分組的列資料,可以多個
注:分組並聚合後才能顯示
col = pd.DataFrame({'color':['white','red','green','red','green'],
'object':['pen','pencil','pencil','ashtray','pen'],
'price1':['5.56','4.20','1.30','0.56','2.75'],
'price2':['4.75','4.12','1.60','0.75','3.15']})
color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
#對顏色分組,對price1進行聚合
#DataFrame的方法進行分組
col.groupby(by="color")["price1"].max()
color
green 2.75
red 4.20
white 5.56
Name: price1, dtype: object
5.2 Series方法
Series.groupby(key, as_ index=False)
#用Series方法
col["price1"].groupby(col["color"]).max()
color
green 2.75
red 4.20
white 5.56
Name: price1, dtype: object