1. 程式人生 > 其它 >Pandas高階處理

Pandas高階處理

技術標籤:深度學習pandas高階處理機器學習

一、缺失值處理

1.1 缺失值處理方式

1.1.1 思路

1.刪除含有缺失值的樣本(nan)

2.替換/插補

1.1.2 處理nan

1.判斷資料中是否存在nan

  • pd.isnull(df) 判斷是否為空,是空的話置為True
  • pd.notnull(df) 判斷是否為空,不是空的話置為True
#返回True,資料中存在缺失值
np.any(movie.isnull())
True
#返回False,資料中存在缺失值
np.all(movie.notnull())
False

movie.isnull().any()
Rank                  False
Title False Genre False Description False Director False Actors False Year False Runtime (Minutes) False Rating False Votes False Revenue (Millions) True Metascore True
dtype: bool pd.notnull(movie).all() Rank True Title True Genre True Description True Director True Actors True Year True Runtime (Minutes) True Rating True Votes True
Revenue (Millions) False Metascore False dtype: bool

2.處理nan

刪除含有缺失值的樣本

  • df.dropna( )
    • axis=‘row’ 預設按行
    • inplace= (True:刪除原始DataFrame缺失值樣本,False:不改變原始資料,生成新的刪除缺失值樣本的陣列,預設為False)

替換缺失值

  • df.fillna(value,inplace=)
    • value,替換成的值
    • inplace= (True:刪除原始DataFrame缺失值樣本,False:不改變原始資料,生成新的刪除缺失值樣本的陣列,預設為False)
movie["Revenue (Millions)"].fillna(movie["Revenue (Millions)"].mean(),inplace=True)
movie["Metascore"].fillna(movie["Metascore"].mean(),inplace=True)
pd.isnull(movie).any()
Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)    False
Metascore             False
dtype: bool

1.1.3 缺失值不為nan,有預設標記

1.替換?為np.nan

  • df.replace(to_replace="?", value=np. nan)

2.處理np.nan

  • 與nan值處理方式相同

二、資料離散化

2.1 定義及意義

定義:連續屬性的離散化就是將連續屬性的值域上,將值域劃分為若干個離散的區間,最後用不同的符號或整數值代表落在每個子區間中的屬性值。

意義:連續屬性離散化的目的是為了簡化資料結構,資料離散化技術可以用來減少給定連續屬性值

的個數。離散化方法經常作為資料探勘的工具。

2.2 實現

2.2.1 分組

1.自動分組

  • sr=pd.qcut(data,bins) 資料,組數
data = pd.Series([165,174,160,180,159,163,192,184],index=['No1:165','No1:174','No1:160','No1:180','No1:159','No1:163','No1:192','No1:184'])
No1:165    165
No1:174    174
No1:160    160
No1:180    180
No1:159    159
No1:163    163
No1:192    192
No1:184    184
dtype: int64
# 分組
sr = pd.qcut(data,3)
#檢視每組個數
sr.value_counts()

(178.0, 192.0]        3
(158.999, 163.667]    3
(163.667, 178.0]      2
dtype: int64
#轉換成one-hot編碼
pd.get_dummies(sr,prefix="height")
		height_(158.999, 163.667]	height_(163.667, 178.0]	height_(178.0, 192.0]
No1:165	0							1						0
No1:174	0							1						0
No1:160	1							0						0
No1:180	0							0						1
No1:159	1							0						0
No1:163	1							0						0
No1:192	0							0						1
No1:184	0							0						1

2.自定義分組

  • sr=pd.cut(data,[ ]) 資料,分組列表
#自定義分組
bins = [150,165,180,195]
sr = pd.cut(data,bins)
No1:165    (150, 165]
No1:174    (165, 180]
No1:160    (150, 165]
No1:180    (165, 180]
No1:159    (150, 165]
No1:163    (150, 165]
No1:192    (180, 195]
No1:184    (180, 195]
dtype: category
Categories (3, interval[int64]): [(150, 165] < (165, 180] < (180, 195]]
                                                             pd.get_dummies(sr,prefix="身高")
                                                             								身高_(150, 165]	身高_(165, 180]	身高_(180, 195]
No1:165	1				 0				   0
No1:174	0				 1	  			   0
No1:160	1				 0	    		   0
No1:180	0				 1				   0
No1:159	1			     0				   0
No1:163	1				 0				   0
No1:192	0				 0				   1
No1:184	0				 0				   1

3.將分組好的結果轉換成one-hot編碼

  • pd.get_dummies(sr,prefix= )

三、合併

3.1 按方向拼接

pd.concat([data1, data2], axis=1)

  • axis=0:按列拼接 縱向拼接
  • axis=1:按行拼接 橫向拼接

3.2 按索引拼接

pd.merge(left, right, how=“inner”, on=[索引])

  • how=: left(左連線)、right(右連線)、outer(外連線)、inner(內連線)
left=pd.DataFrame({"key1":["K0","K0","K1","K2"],
                   "key2":["K0","K1","K0","K1"],
                   "A":["A0","A1","A2","A3"],
                   "B":["B0","B1","B2","B3"]})
right = pd.DataFrame({'key1':["K0","K1","K1","K2"],
                      "key2":["K0","K0","K0","K0"],
                       "C":["C0","C1",'C2',"C3"],
                       "D":["D0","D1","D2","D3"]})
left
	key1 key2	A	B
0	K0	K0	A0	B0
1	K0	K1	A1	B1
2	K1	K0	A2	B2
3	K2	K1	A3	B3
right
	keyl	key2	C	D
0	K0	K0	C0	D0
1	K1	K0	C1	D1
2	K1	K0	C2	D2
3	K2	K0	C3	D3
#內連線
pd.merge(left,right,how="inner",on=["key1","key2"])
	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K1	K0	A2	B2	C1	D1
2	K1	K0	A2	B2	C2	D2
#左連線
pd.merge(left,right,how="left",on=["key1","key2"])
	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN
#右連線
pd.merge(left,right,how="right",on=["key1","key2"])
	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K1	K0	A2	B2	C1	D1
2	K1	K0	A2	B2	C2	D2
3	K2	K0	NaN	NaN	C3	D3
#外連線
pd.merge(left,right,how="outer",on=["key1","key2"])
	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN
5	K2	K0	NaN	NaN	C3	D3

四、交叉表與透視表

找到、探索兩個變數之間的關係

4.1 使用crosstab(交叉表)實現

交叉表用於計算一列資料對於另外一列資料的分組個數(尋找兩個列之間的關係)

pd.crosstab(value1, value2)

date.weekday
Int64Index([1, 0, 4, 3, 2, 1, 0, 4, 3, 2,
            ...
            4, 3, 2, 1, 0, 4, 3, 2, 1, 0],
           dtype='int64', length=643)
stock["week"]= date.weekday
#漲跌幅資料列
stock["pona"] = np.where(stock["p_change"]> 0,1,0)
#星期列與漲跌幅資料列形成交叉表
data = pd.crosstab(stock["week"],stock["pona"])
#得到每天漲跌幅頻數
pona 0	1
week		
0	63	62
1	55	76
2	61	71
3	63	65
4	59	68
#得到百分比
data.div(data.sum(axis=1),axis=0)
	pona	0	1
	week		
	0	0.504000	0.496000
	1	0.419847	0.580153
	2	0.462121	0.537879
	3	0.492188	0.507812
	4	0.464567	0.535433

4.2 透視表piovt_table

DataFrame.pivot_table([ ], index=[ ])

直接得到百分比

stock.pivot_table(["pona"],index=["week"])
	pona
week	
0	0.496000
1	0.580153
2	0.537879
3	0.507812
4	0.535433

五、分組與聚合

5.1 DataFrame方法

DataFrame.groupby(key, as_ index=False)

key:分組的列資料,可以多個

注:分組並聚合後才能顯示

col = pd.DataFrame({'color':['white','red','green','red','green'],
                    'object':['pen','pencil','pencil','ashtray','pen'],
                    'price1':['5.56','4.20','1.30','0.56','2.75'],
                    'price2':['4.75','4.12','1.60','0.75','3.15']})

color	object			price1	price2
0	white		pen		5.56	4.75
1	red			pencil	4.20	4.12
2	green		pencil	1.30	1.60
3	red			ashtray	0.56	0.75
4	green		pen		2.75	3.15
#對顏色分組,對price1進行聚合
#DataFrame的方法進行分組
col.groupby(by="color")["price1"].max()
color
green    2.75
red      4.20
white    5.56
Name: price1, dtype: object

5.2 Series方法

Series.groupby(key, as_ index=False)

#用Series方法
col["price1"].groupby(col["color"]).max()
color
green    2.75
red      4.20
white    5.56
Name: price1, dtype: object