資料清洗與準備:資料轉換
1.1刪除重複值
由於各種原因,DataFrame中會出現重複行:
import pandas as pd
data = pd.DataFrame({'k1':['one','two'] * 3 + ['two'],
'k2':[1,1,2,3,3,4,1]})
print(data)
----------
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 1
DataFrame的duplicated方法返回的是一個布林值的Series,這個Series反映的是每一行是否存在重複(與之前出現過的行相同)情況:
print(data.duplicated())
-----------
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
drop_duplicates返回的是DataFrame,內容是duplicated返回陣列中為False的部分:
print(data.drop_duplicates()) ---------- k1 k2 0 one 1 1 two 1 2 one 2 3 two 3 4 one 3 5 two 4
假設我們有一個額外的列,並想基於‘k1’列去除重複值:
data['v1'] = range(7)
print(data.drop_duplicates(['k1']))
--------------
k1 k2 v1
0 one 1 0
1 two 1 1
duplicated和drop_duplicates預設都是保留第一個觀測到的值,傳入引數keep='last’將會返回最後一個:
print(data.drop_duplicates(['k1','k2'],keep = 'last')) -------------- k1 k2 v1 0 one 1 0 2 one 2 2 3 two 3 3 4 one 3 4 5 two 4 5 6 two 1 6
1.2使用函式或對映進行資料轉換
考慮下面這些收集到的關於肉類的假設資料:
data = pd.DataFrame({'food':['bacon','pulled pork','bacon','Pastrami','corned beef','Bacon','pastrami','honey ham','nova lox'],
'ounces':[4,3,12,6,7.5,8,3,5,6]})
print(data)
----------------------
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
假設想要新增一列用於表明每種食物的動物肉型別。讓我們先寫下一個食物和型別的對映:
meat_to_animal = {
'bacon':'pig',
'pulled pork':'pig',
'corned beef':'cow',
'honey ham':'pig',
'nova lox':'salmon'
}
Series的map方法接收一個函式或一個包含對映關係的字典型物件,但是這裡我們有一個小的問題在於一些肉類大寫了,而另一部分肉類沒有。因此,我們需要使用Series的str.lower方法將每個值都轉換為小寫:
meat_to_animal = {
'bacon':'pig',
'pulled pork':'pig',
'pastrami':'cow',
'corned beef':'cow',
'honey ham':'pig',
'nova lox':'salmon'
}
lowercased = data['food'].str.lower()
print(lowercased)
----------------
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
data['animal'] = lowercased.map(meat_to_animal)
print(data)
------------------------------
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
1.3替代值
使用fillna填充缺失值是通用值替換的特殊案例。讓我們考慮下面的Series:
data = pd.Series([1.,-999.,2.,-999.,-1000.,3.])
print(data)
--------------
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
-999可能是缺失值的標識。如果要使用NA來替代這些值,我們可以使用replace方法生成新的Series(除非傳入了inplace=True):
data = pd.Series([1.,-999.,2.,-999.,-1000.,3.])
print(data)
--------------
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
print(data.replace(-999,np.nan))
--------------
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
如果想要替代一次替代多個值,可以傳入一個列表和替代值:
print(data.replace([-999,-1000],np.nan))
--------------
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
要求不同的值替換為不同的值,可以傳入替代值的列表:
print(data.replace([-999,-1000],[np.nan,0]))
---------------
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
引數也可以通過字典傳遞:
print(data.replace({-999:np.nan,-1000:0}))
--------------
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
1.4重新命名軸索引
和Series中的值一樣,可以通過函式或某種形式的對映對軸標籤進行類似的轉換,生成新的且帶有不同標籤的物件。我們也可以不生成新的資料結構的情況下修改軸。下面是簡單的示例:
data = pd.DataFrame(np.arange(12).reshape(3,4),index = ['Ohio','Colorado','New York'],
columns = ['one','two','three','four'])
與Series類似,軸索引也有一個map方法:
transform = lambda x:x[:4].upper()
print(data.index.map(transform))
-----------------------------------------------
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
我們也可以賦值給index,修改DataFrame:
data.index = data.index.map(transform)
print(data)
---------------------------
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
如果我們想要建立資料集轉換後的版本,並且不修改原有的資料集,一個有用的方法是rename:
print(data.rename(index = str.title,columns = str.upper))
-------------------------------
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
值得注意的是,rename可以結合字典型物件使用,為軸標籤的子集提供新的值:
print(data.rename(index = {'Ohio':'INDIANA'},columns={'three':'peekaboo'}))
----------------------------------
one two peekaboo four
INDIANA 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
rename可以讓你從手動複製DataFrame併為其分配索引和列屬性的繁瑣工作中解放出來。如果想要修改原有的資料集,傳入inplace=True:
data.rename(index = {'Ohio':'INDIANA'},inplace = True)
print(data)
-------------------------------
one two three four
INDIANA 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
1.5離散化和分箱
連續值經常需要離散化,或者分箱成“箱子”進行分析。假設你有某項研究中一組人群的資料,你想將他們進行分組,放入離散的年齡框中:
ages = [20,22,25,27,21,23,37,31,61,45,41,32]
讓我們將這些年齡分為18-25、26-35、36-60以及61及以上等若干組。為了實現這個,我們可以使用pandas中的cut:
ages = [20,22,25,27,21,23,37,31,61,45,41,32]
bins = [18,25,35,60,100]
cats = pd.cut(ages,bins)
print(cats)
---------------------------------------------------------------------------------------------------------
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
pandas返回的物件是一個特殊的Categorical物件。我們看到的輸出描述了由pandas.cut計算出的箱。我們可以將它當作一個表示箱名的字串陣列;它在內部包含一個categories(類別)陣列,它指定了不同的類別名稱以及cods屬性中的ages(年齡)資料標籤:
print(cats.codes)
-------------------------
[0 0 0 1 0 0 2 1 3 2 2 1]
print(cats.categories)
-------------------------------------------------------
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
closed='right',
dtype='interval[int64]')
print(pd.value_counts(cats))
--------------
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
這裡pd.value_counts(cats)是對pandas.cut的結果中的箱數量的計數。
與區間的數學符號一致,小括號表示邊是開放的,中括號表示它是封閉的(包括邊),這裡可以傳入right=False來改變哪一邊是封閉的:
print(pd.cut(ages,[18,26,36,61,100],right = False))
---------------------------------------------------------------------------------------------------------
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
也可以通過向labels選項傳遞一個列表或陣列來傳入自定義的箱名:
print(pd.cut(ages,[18,26,36,61,100],right = False))
group_names = ['Youth','YoungAdult','MiddleAged','Senior']
print(pd.cut(ages,bins,labels=group_names))
-----------------------------------------------------------------------------
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
如果你傳給cut整數個的箱來代替顯示的箱邊,pandas將根據資料中的最小值和最大值計算出等長的箱。請考慮一些均勻分佈的資料被切成四份的情況:
data = np.random.rand(20)
print(pd.cut(data,4,precision=2))
---------------------------------------------------------------------------------------------------------------------------------------------------
[(0.72, 0.95], (0.25, 0.48], (0.019, 0.25], (0.72, 0.95], (0.25, 0.48], ..., (0.72, 0.95], (0.48, 0.72], (0.72, 0.95], (0.019, 0.25], (0.25, 0.48]]
Length: 20
Categories (4, interval[float64]): [(0.019, 0.25] < (0.25, 0.48] < (0.48, 0.72] < (0.72, 0.95]]
precision=2的選項將十進位制精度限制在兩位。
qcut是一個與分箱密切相關的函式,它基於樣本分位數進行分箱。取決於資料的分佈,使用cut通常不會使每個箱具有相同資料量的資料點。由於qcut使用樣本的分位數,我們可以通過qcut獲得等長的箱:
data = np.random.randn(1000)
cats = pd.qcut(data,4)
print(cats)
---------------------------------------------------------------------------------------------------------------------------------------------------
[(0.29, 0.51], (0.51, 0.73], (0.29, 0.51], (0.73, 0.95], (0.069, 0.29], ..., (0.069, 0.29], (0.069, 0.29], (0.73, 0.95], (0.73, 0.95], (0.73, 0.95]]
Length: 20
Categories (4, interval[float64]): [(0.069, 0.29] < (0.29, 0.51] < (0.51, 0.73] < (0.73, 0.95]]
[(-0.633, 0.0281], (0.732, 2.705], (0.0281, 0.732], (-3.221, -0.633], (-3.221, -0.633], ..., (-0.633, 0.0281], (0.0281, 0.732], (-3.221, -0.633], (-0.633, 0.0281], (-3.221, -0.633]]
Length: 1000
Categories (4, interval[float64]): [(-3.221, -0.633] < (-0.633, 0.0281] < (0.0281, 0.732] <
(0.732, 2.705]]
print(cats.value_counts())
------------------------------------
(-3.2929999999999997, -0.694] 250
(-0.694, 0.0107] 250
(0.0107, 0.662] 250
(0.662, 2.736] 250
dtype: int64
1.6檢測和過濾異常值
過濾或轉換異常值在很大程度上是應用陣列操作的事情。考慮一個具有正態分佈資料的DataFrame:
data = pd.DataFrame(np.random.randn(1000,4))
print(data.describe())
---------------------------------------------------------
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.000511 -0.021196 -0.031225 -0.010533
std 0.990803 0.990293 0.947879 1.009659
min -3.577692 -2.771380 -2.759002 -3.021722
25% -0.668135 -0.704517 -0.700162 -0.671600
50% 0.008604 -0.060045 -0.072687 -0.035403
75% 0.693142 0.618066 0.601413 0.693484
max 3.419051 3.714937 3.881218 3.615359
假設我們想要找出一列中絕對值大於三的值:
col = data[2]
print(col[np.abs(col) > 3])
-----------------------
721 3.052446
Name: 2, dtype: float64
要選出所有值大於3或小於-3的行,我們可以對布林值DataFrame使用any方法:
print(data[(np.abs(data) > 3).any(1)])
-------------------------------------------
0 1 2 3
66 0.442000 -3.667465 0.134274 1.486624
296 0.111119 0.253513 3.058447 -2.098602
535 0.787788 3.190138 -0.741357 0.391135
538 -0.591268 -1.335684 -3.062085 0.679055
979 3.112996 0.512741 -1.307721 -0.389606
值可以根據這些標準來設定,下面程式碼限制了-3到3之間的數值:
data[np.abs(data) > 3] = np.sign(data) * 3
print(data.describe())
---------------------------------------------------------
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.040334 0.011034 -0.005806 -0.041304
std 1.041037 1.016363 0.969064 1.005545
min -3.000000 -3.000000 -3.000000 -3.000000
25% -0.676272 -0.691892 -0.652303 -0.679903
50% 0.035862 0.000163 0.002989 -0.018314
75% 0.761366 0.700559 0.647850 0.646082
max 3.000000 3.000000 3.000000 3.000000
語句np.sign(data)根據資料中的值的正負分別生成1和-1的數值:
print(np.sign(data).head())
---------------------
0 1 2 3
0 1.0 1.0 -1.0 -1.0
1 -1.0 1.0 1.0 1.0
2 -1.0 1.0 -1.0 1.0
3 1.0 -1.0 1.0 1.0
4 -1.0 -1.0 -1.0 -1.0
1.7置換和隨機取樣
使用numpy.random.permutation對DataFrame中的Series或行進行置換(隨機重新排序)是非常方便的。在呼叫permutation時根據你想要的軸長度可以產生一個表示新順序的整數陣列:
df = pd.DataFrame(np.arange(5 * 4).reshape((5,4)))
sampler = np.random.permutation(5)
print(sampler)
-----------
[4 0 2 3 1]
整數陣列可以用在基於iloc的索引或等價的take函式中:
print(df)
-----------------
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
print(df.take(sampler))
-----------------
0 1 2 3
4 16 17 18 19
0 0 1 2 3
2 8 9 10 11
3 12 13 14 15
1 4 5 6 7
1.8計算指標/虛擬變數
將分類變數轉換為“虛擬”或“指標”矩陣是另一種用於統計建模或機器學習的轉換操作。如果DataFrame中的一列有k個不同的值,則可以衍生一個k列的值為1和0的矩陣或DataFrame。pandas有一個get_dummies函式用於實現該功能:
df = pd.DataFrame({'key':['b','b','a','c','a','b'],
'data1':range(6)})
print(pd.get_dummies(df['key']))
----------
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
在某些情況下,我們可能想在指標DataFrame的列上加入字首,然後與其他資料合併。在get_dummies方法中有一個字首引數用於實現該功能:
dummies = pd.get_dummies(df['key'],prefix = 'key')
df_with_dummy = df[['data1']].join(dummies)
print(df_with_dummy)
-----------------------------
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
將get_dummies與cut等離散化函式結合使用時統計應用的一個有用方法:
np.random.seed(12345)
values = np.random.randn(10)
print(values)
------------------------------------------------------------------------
[-0.20470766 0.47894334 -0.51943872 -0.5557303 1.96578057 1.39340583
0.09290788 0.28174615 0.76902257 1.24643474]
bins = [0,0.2,0.4,0.6,0.8,1]
print(pd.get_dummies(pd.cut(values,bins)))
-------------------------------------------------------------
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
0 0 0 0 0 0
1 0 0 1 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 1 0 0 0
8 0 0 0 1 0
9 0 0 0 0 0
我們使用numpy.random.seed來設定隨機種子以確保示例的正確性。