利用Python進行資料分析——資料規整化:清理、轉換、合併、重塑(七)(4) .
阿新 • • 發佈:2019-01-17
1、資料轉換
目前為止介紹的都是資料的重排。另一類重要操作則是過濾、清理以及其他的轉換工作。
2、移除重複資料
DataFrame中常常會出現重複行。下面就是一個例子:
[python] view plaincopyprint?- In [4]: data = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4,
- 'k2':[1, 1, 2, 3, 3, 4, 4]})
- In [5]: data
- Out[5]:
- k1 k2
- 0 one 1
- 1 one 1
- 2 one 2
- 3 two
- 4 two 3
- 5 two 4
- 6 two 4
- [7 rows x 2 columns]
DataFrame的duplicated方法返回一個布林型Series,表示各行是否是重複行: [python] view plaincopyprint?In [4]: data = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4, 'k2':[1, 1, 2, 3, 3, 4, 4]}) In [5]: data Out[5]: k1 k2 0 one 1 1 one 1 2 one 2 3 two 3 4 two 3 5 two 4 6 two 4 [7 rows x 2 columns]
- In [6]: data.duplicated()
- Out[6]:
- 0False
- 1True
- 2False
- 3False
- 4True
- 5False
- 6True
- dtype: bool
還有一個與此相關的drop_duplicates方法,它用於返回一個移除了重複行的DataFrame: [python] view plaincopyprint?In [6]: data.duplicated() Out[6]: 0 False 1 True 2 False 3 False 4 True 5 False 6 True dtype: bool
- In [7]: data.drop_duplicates()
- Out[7]:
- k1 k2
- 0 one 1
- 2 one 2
- 3 two 3
- 5 two 4
- [4 rows x 2 columns]
In [7]: data.drop_duplicates()
Out[7]:
k1 k2
0 one 1
2 one 2
3 two 3
5 two 4
[4 rows x 2 columns]
這兩個方法預設會判斷全部列,你也可以指定部分列進行重複項判斷。假設你還有一列值,且只希望根據k1列過濾重複項:
[python]
view plaincopyprint?
- In [8]: data['v1'] = range(7)
- In [9]: data
- Out[9]:
- k1 k2 v1
- 0 one 10
- 1 one 11
- 2 one 22
- 3 two 33
- 4 two 34
- 5 two 45
- 6 two 46
- [7 rows x 3 columns]
- In [10]: data.drop_duplicates(['k1'])
- Out[10]:
- k1 k2 v1
- 0 one 10
- 3 two 33
- [2 rows x 3 columns]
In [8]: data['v1'] = range(7)
In [9]: data
Out[9]:
k1 k2 v1
0 one 1 0
1 one 1 1
2 one 2 2
3 two 3 3
4 two 3 4
5 two 4 5
6 two 4 6
[7 rows x 3 columns]
In [10]: data.drop_duplicates(['k1'])
Out[10]:
k1 k2 v1
0 one 1 0
3 two 3 3
[2 rows x 3 columns]
duplicated和drop_duplicates預設保留的是第一個出現的值組合。傳入take_last=True則保留最後一個:
[python]
view plaincopyprint?
- In [11]: data.drop_duplicates(['k1', 'k2'], take_last=True)
- Out[11]:
- k1 k2 v1
- 1 one 11
- 2 one 22
- 4 two 34
- 6 two 46
- [4 rows x 3 columns]
In [11]: data.drop_duplicates(['k1', 'k2'], take_last=True)
Out[11]:
k1 k2 v1
1 one 1 1
2 one 2 2
4 two 3 4
6 two 4 6
[4 rows x 3 columns]
3、利用函式或對映進行資料轉換
在對資料集進行轉換時,你可能希望根據陣列、Series或DataFrame列中的值來實現該轉換工作。我們來看看下面這組有關肉類的資料:
[python] view plaincopyprint?- In [12]: data = pd.DataFrame({'food':['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon',
- 'pastrami', 'honey ham', 'nova lox'],
- ....: 'ounces':[4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
- In [13]: data
- Out[13]:
- food ounces
- 0 bacon 4.0
- 1 pulled pork 3.0
- 2 bacon 12.0
- 3 Pastrami 6.0
- 4 corned beef 7.5
- 5 Bacon 8.0
- 6 pastrami 3.0
- 7 honey ham 5.0
- 8 nova lox 6.0
- [9 rows x 2 columns]
In [12]: data = pd.DataFrame({'food':['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
....: 'ounces':[4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
In [13]: data
Out[13]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
[9 rows x 2 columns]
假設你想要新增一列表示該肉類食物來源的動物型別。我們先編寫一個肉類到動物的對映:
[python]
view plaincopyprint?
- In [14]: meat_to_animal = {
- ....: 'bacon': 'pig',
- ....: 'pulled pork': 'pig',
- ....: 'pastrami': 'cow',
- ....: 'corned beef': 'cow',
- ....: 'honey ham': 'pig',
- ....: 'nova lox': 'salmon'
- ....: }
In [14]: meat_to_animal = {
....: 'bacon': 'pig',
....: 'pulled pork': 'pig',
....: 'pastrami': 'cow',
....: 'corned beef': 'cow',
....: 'honey ham': 'pig',
....: 'nova lox': 'salmon'
....: }
Series的map方法可以接受一個函式或含有對映關係的字典型物件,但是這裡有一個小問題,即有些肉類的首字母大寫了,而另一些則沒有。因此,我們還需要將各個值轉換為小寫:
[python]
view plaincopyprint?
- In [15]: data['animal'] = data['food'].map(str.lower).map(meat_to_animal)
- In [16]: data
- Out[16]:
- food ounces animal
- 0 bacon 4.0 pig
- 1 pulled pork 3.0 pig
- 2 bacon 12.0 pig
- 3 Pastrami 6.0 cow
- 4 corned beef 7.5 cow
- 5 Bacon 8.0 pig
- 6 pastrami 3.0 cow
- 7 honey ham 5.0 pig
- 8 nova lox 6.0 salmon
- [9 rows x 3 columns]
In [15]: data['animal'] = data['food'].map(str.lower).map(meat_to_animal)
In [16]: data
Out[16]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
[9 rows x 3 columns]
我們也可以傳入一個能夠完成全部這些工作的函式:
[python]
view plaincopyprint?
- In [17]: data['food'].map(lambda x: meat_to_animal[x.lower()])
- Out[17]:
- 0 pig
- 1 pig
- 2 pig
- 3 cow
- 4 cow
- 5 pig
- 6 cow
- 7 pig
- 8 salmon
- Name: food, dtype: object
In [17]: data['food'].map(lambda x: meat_to_animal[x.lower()])
Out[17]:
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
說明:
使用map是一種實現元素級轉換以及其他資料清理工作的便捷方式。
4、替換值
利用fillna方法填充缺失資料可以看做值替換的一種特殊情況。雖然前面提到的mao可用於修改物件的資料子集,而replace則提供了一種實現該功能的更簡單、更靈活的方式。我們來看看下面這個Series:
[python] view plaincopyprint?- In [18]: data = pd.Series([1., -999, 2., -999, -1000., 3.])
- In [19]: data
- Out[19]:
- 01
- 1 -999
- 22
- 3 -999
- 4 -1000
- 53
- dtype: float64
In [18]: data = pd.Series([1., -999, 2., -999, -1000., 3.])
In [19]: data
Out[19]:
0 1
1 -999
2 2
3 -999
4 -1000
5 3
dtype: float64
-999這個值可能是一個表示缺失資料的標記值。要將其替換為pandas能夠理解的NA值,我們可以利用replace來產生一個新的Series:
[python]
view plaincopyprint?
- In [20]: data.replace(-999, np.nan)
- Out[20]:
- 01
- 1 NaN
- 22
- 3 NaN
- 4 -1000
- 53
- dtype: float64
In [20]: data.replace(-999, np.nan)
Out[20]:
0 1
1 NaN
2 2
3 NaN
4 -1000
5 3
dtype: float64
如果你希望一次性替換多個值,可以傳入一個由待替換值組成的列表以及一個替換值:
[python]
view plaincopyprint?
- In [21]: data.replace([-999, -1000], np.nan)
- Out[21]:
- 01
- 1 NaN
- 22
- 3 NaN
- 4 NaN
- 53
- dtype: float64
In [21]: data.replace([-999, -1000], np.nan)
Out[21]:
0 1
1 NaN
2 2
3 NaN
4 NaN
5 3
dtype: float64
如果希望對不同的值進行不同的替換,則傳入一個由替換關係組成的列表即可:
[python]
view plaincopyprint?
- In [22]: data.replace([-999, -1000], [np.nan, 0])
- Out[22]:
- 01
- 1 NaN
- 22
- 3 NaN
- 40
- 53
- dtype: float64
In [22]: data.replace([-999, -1000], [np.nan, 0])
Out[22]:
0 1
1 NaN
2 2
3 NaN
4 0
5 3
dtype: float64
傳入的引數也可以是字典:
[python]
view plaincopyprint?
- In [23]: data.replace({-999: np.nan, -1000: 0})
- Out[23]:
- 01
- 1 NaN
- 22
- 3 NaN
- 40
- 53
- dtype: float64
In [23]: data.replace({-999: np.nan, -1000: 0})
Out[23]:
0 1
1 NaN
2 2
3 NaN
4 0
5 3
dtype: float64
5、重新命名軸索引
跟Series中的值一樣,軸標籤也可以通過函式或對映進行轉換,從而得到一個新物件。軸還可以被就地修改,而無需新建一個數據結構。接下來看看下面這個簡單的例子:
[python] view plaincopyprint?- In [24]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),
- ....: index=['Ohio', 'Colorado', 'New York'],
- ....: columns=['one', 'two', 'three', 'four'])
In [24]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),
....: index=['Ohio', 'Colorado', 'New York'],
....: columns=['one', 'two', 'three', 'four'])
跟Series一樣,軸標籤也有一個map方法:
[python]
view plaincopyprint?
- In [25]: data.index.map(str.upper)
- Out[25]: array(['OHIO', 'COLORADO', 'NEW YORK'], dtype=object)
In [25]: data.index.map(str.upper)
Out[25]: array(['OHIO', 'COLORADO', 'NEW YORK'], dtype=object)
你可以將其賦值給index,這樣就可以對DataFrame進行就地修改了:
[python]
view plaincopyprint?
- In [26]: data.index = data.index.map(str.upper)
- In [27]: data
- Out[27]:
- one two three four
- OHIO 0123
- COLORADO 4567
- NEW YORK 891011
- [3 rows x 4 columns]
In [26]: data.index = data.index.map(str.upper)
In [27]: data
Out[27]:
one two three four
OHIO 0 1 2 3
COLORADO 4 5 6 7
NEW YORK 8 9 10 11
[3 rows x 4 columns]
如果想要建立資料集的轉換版(而不是修改原始資料),比較使用的方式是rename:
[python]
view plaincopyprint?
- In [28]: data.rename(index=str.title, columns=str.upper)
- Out[28]:
- ONE TWO THREE FOUR
- Ohio 0123
- Colorado 4567
- New York 891011
- [3 rows x 4 columns]
In [28]: data.rename(index=str.title, columns=str.upper)
Out[28]:
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
[3 rows x 4 columns]
特別說明一下,rename可以結合字典型物件實現對部分軸標籤的更新:
[python]
view plaincopyprint?
- In [31]: data.rename(index={'OHIO': 'INDIANA'},
- columns={'three': 'peekaboo'})
- Out[31]:
- one two peekaboo four
- INDIANA 0123
- COLORADO 4567
- NEW YORK 891011
- [3 rows x 4 columns]
In [31]: data.rename(index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'})
Out[31]:
one two peekaboo four
INDIANA 0 1 2 3
COLORADO 4 5 6 7
NEW YORK 8 9 10 11
[3 rows x 4 columns]
rename幫我們實現了:複製DataFrame並對其索引和列標籤進行賦值。如果希望就地修改某個資料集,傳入inplace=True即可:
[python]
view plaincopyprint?
- In [32]: # 總是返回DataFrame的引用
- In [33]: _ = data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
- In [34]: data
- Out[34]:
- one two three four
- INDIANA 0123
- COLORADO 4567
- NEW YORK 891011
- [3 rows x 4 columns]
In [32]: # 總是返回DataFrame的引用
In [33]: _ = data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
In [34]: data
Out[34]:
one two three four
INDIANA 0 1 2 3
COLORADO 4 5 6 7
NEW YORK 8 9 10 11
[3 rows x 4 columns]
6、離散化和麵元劃分
為了便於分析,連續資料常常被離散化或拆分為“面元”(bin)。假設有一組人員資料,而你希望將它們劃分為不同的年齡組:
[python] view plaincopyprint?- In [35]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
In [35]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
接下來將這些資料劃分為“18到25”、“26到35”、“35到60”以及“60以上”幾個面元。要實現該功能,你需要使用pandas的cut函式:
[python]
view plaincopyprint?
- In [36]: bins = [18, 25, 35