數據清洗、合並、轉化和重構
-
數據清洗是數據分析關鍵的一步,直接影響之後的處理工作
-
數據需要修改嗎?有什麽需要修改的嗎?數據應該怎麽調整才能適用於接下來的分析和挖掘?
-
是一個叠代的過程,實際項目中可能需要不止一次地執行這些清洗操作
-
處理缺失數據:pd.fillna(),pd.dropna()
1.數據連接(pd.merge)
-
pd.merge
-
根據單個或多個鍵將不同DataFrame的行連接起來
-
類似數據庫的連接操作
示例代碼:
import pandas as pd
import numpy as np
df_obj1 = pd.DataFrame({‘key‘: [‘b‘, ‘ b‘, ‘a‘, ‘c‘, ‘a‘, ‘a‘, ‘b‘],
‘data1‘ : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({‘key‘: [‘a‘, ‘b‘, ‘d‘],
‘data2‘ : np.random.randint(0,10,3)})
print(df_obj1)
print(df_obj2)
運行結果:
data1 key
data1 key
0 8 b
1 8 b
2 3 a
3 5 c
4 4 a
5 9 a
6 6 b
data2 key
0 9 a
1 0 b
2 3 d
1. 默認將重疊列的列名作為“外鍵”進行連接
示例代碼:
# 默認將重疊列的列名作為“外鍵”進行連接
print(pd.merge(df_obj1, df_obj2))
運行結果:
data1 key data2 0 8 b 0 1 8 b 0 2 6 b 0 3 3 a 9 4 4 a 9 5 9 a 9
2. on顯示指定“外鍵”
示例代碼:
# on顯示指定“外鍵”
print(pd.merge(df_obj1, df_obj2, on=‘key‘))
運行結果:
data1 key data2
0 8 b 0
1 8 b 0
2 6 b 0
3 3 a 9
4 4 a 9
5 9 a 9
3. left_on,左側數據的“外鍵”,right_on,右側數據的“外鍵”
示例代碼:
# left_on,right_on分別指定左側數據和右側數據的“外鍵”
# 更改列名
df_obj1 = df_obj1.rename(columns={‘key‘:‘key1‘})
df_obj2 = df_obj2.rename(columns={‘key‘:‘key2‘})
print(pd.merge(df_obj1, df_obj2, left_on=‘key1‘, right_on=‘key2‘))
運行結果:
data1 key1 data2 key2
0 8 b 0 b
1 8 b 0 b
2 6 b 0 b
3 3 a 9 a
4 4 a 9 a
5 9 a 9 a
默認是“內連接”(inner),即結果中的鍵是交集
how指定連接方式
4. “外連接”(outer),結果中的鍵是並集
示例代碼:
# “外連接”
print(pd.merge(df_obj1, df_obj2, left_on=‘key1‘, right_on=‘key2‘, how=‘outer‘))
運行結果:
data1 key1 data2 key2
0 8.0 b 0.0 b
1 8.0 b 0.0 b
2 6.0 b 0.0 b
3 3.0 a 9.0 a
4 4.0 a 9.0 a
5 9.0 a 9.0 a
6 5.0 c NaN NaN
7 NaN NaN 3.0 d
5. “左連接”(left)
示例代碼:
# 左連接
print(pd.merge(df_obj1, df_obj2, left_on=‘key1‘, right_on=‘key2‘, how=‘left‘))
運行結果:
data1 key1 data2 key2
0 8 b 0.0 b
1 8 b 0.0 b
2 3 a 9.0 a
3 5 c NaN NaN
4 4 a 9.0 a
5 9 a 9.0 a
6 6 b 0.0 b
6. “右連接”(right)
示例代碼:
# 右連接
print(pd.merge(df_obj1, df_obj2, left_on=‘key1‘, right_on=‘key2‘, how=‘right‘))
運行結果:
data1 key1 data2 key2
0 8.0 b 0 b
1 8.0 b 0 b
2 6.0 b 0 b
3 3.0 a 9 a
4 4.0 a 9 a
5 9.0 a 9 a
6 NaN NaN 3 d
7. 處理重復列名
suffixes,默認為_x, _y
示例代碼:
# 處理重復列名
df_obj1 = pd.DataFrame({‘key‘: [‘b‘, ‘b‘, ‘a‘, ‘c‘, ‘a‘, ‘a‘, ‘b‘],
‘data‘ : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({‘key‘: [‘a‘, ‘b‘, ‘d‘],
‘data‘ : np.random.randint(0,10,3)})
print(pd.merge(df_obj1, df_obj2, on=‘key‘, suffixes=(‘_left‘, ‘_right‘)))
運行結果:
data_left key data_right
0 9 b 1
1 5 b 1
2 1 b 1
3 2 a 8
4 2 a 8
5 5 a 8
8. 按索引連接
left_index=True或right_index=True
示例代碼:
# 按索引連接
df_obj1 = pd.DataFrame({‘key‘: [‘b‘, ‘b‘, ‘a‘, ‘c‘, ‘a‘, ‘a‘, ‘b‘],
‘data1‘ : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({‘data2‘ : np.random.randint(0,10,3)}, index=[‘a‘, ‘b‘, ‘d‘])
print(pd.merge(df_obj1, df_obj2, left_on=‘key‘, right_index=True))
運行結果:
data1 key data2
0 3 b 6
1 4 b 6
6 8 b 6
2 6 a 0
4 3 a 0
5 0 a 0
2.數據合並(pd.concat)
- 沿軸方向將多個對象合並到一起
1. NumPy的concat
np.concatenate
示例代碼:
import numpy as np
import pandas as pd
arr1 = np.random.randint(0, 10, (3, 4))
arr2 = np.random.randint(0, 10, (3, 4))
print(arr1)
print(arr2)
print(np.concatenate([arr1, arr2]))
print(np.concatenate([arr1, arr2], axis=1))
運行結果:
# print(arr1)
[[3 3 0 8]
[2 0 3 1]
[4 8 8 2]]
# print(arr2)
[[6 8 7 3]
[1 6 8 7]
[1 4 7 1]]
# print(np.concatenate([arr1, arr2]))
[[3 3 0 8]
[2 0 3 1]
[4 8 8 2]
[6 8 7 3]
[1 6 8 7]
[1 4 7 1]]
# print(np.concatenate([arr1, arr2], axis=1))
[[3 3 0 8 6 8 7 3]
[2 0 3 1 1 6 8 7]
[4 8 8 2 1 4 7 1]]
2. pd.concat
-
註意指定軸方向,默認axis=0
-
join指定合並方式,默認為outer
-
Series合並時查看行索引有無重復
1) index 沒有重復的情況
示例代碼:
# index 沒有重復的情況
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(0,5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(5,9))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(9,12))
print(ser_obj1)
print(ser_obj2)
print(ser_obj3)
print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))
運行結果:
# print(ser_obj1)
0 1
1 8
2 4
3 9
4 4
dtype: int64
# print(ser_obj2)
5 2
6 6
7 4
8 2
dtype: int64
# print(ser_obj3)
9 6
10 2
11 7
dtype: int64
# print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
0 1
1 8
2 4
3 9
4 4
5 2
6 6
7 4
8 2
9 6
10 2
11 7
dtype: int64
# print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))
0 1 2
0 1.0 NaN NaN
1 5.0 NaN NaN
2 3.0 NaN NaN
3 2.0 NaN NaN
4 4.0 NaN NaN
5 NaN 9.0 NaN
6 NaN 8.0 NaN
7 NaN 3.0 NaN
8 NaN 6.0 NaN
9 NaN NaN 2.0
10 NaN NaN 3.0
11 NaN NaN 3.0
2) index 有重復的情況
示例代碼:
# index 有重復的情況
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(4))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(3))
print(ser_obj1)
print(ser_obj2)
print(ser_obj3)
print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
運行結果:
# print(ser_obj1)
0 0
1 3
2 7
3 2
4 5
dtype: int64
# print(ser_obj2)
0 5
1 1
2 9
3 9
dtype: int64
# print(ser_obj3)
0 8
1 7
2 9
dtype: int64
# print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
0 0
1 3
2 7
3 2
4 5
0 5
1 1
2 9
3 9
0 8
1 7
2 9
dtype: int64
# print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1, join=‘inner‘))
# join=‘inner‘ 將去除NaN所在的行或列
0 1 2
0 0 5 8
1 3 1 7
2 7 9 9
3) DataFrame合並時同時查看行索引和列索引有無重復
示例代碼:
df_obj1 = pd.DataFrame(np.random.randint(0, 10, (3, 2)), index=[‘a‘, ‘b‘, ‘c‘],
columns=[‘A‘, ‘B‘])
df_obj2 = pd.DataFrame(np.random.randint(0, 10, (2, 2)), index=[‘a‘, ‘b‘],
columns=[‘C‘, ‘D‘])
print(df_obj1)
print(df_obj2)
print(pd.concat([df_obj1, df_obj2]))
print(pd.concat([df_obj1, df_obj2], axis=1, join=‘inner‘))
運行結果:
# print(df_obj1)
A B
a 3 3
b 5 4
c 8 6
# print(df_obj2)
C D
a 1 9
b 6 8
# print(pd.concat([df_obj1, df_obj2]))
A B C D
a 3.0 3.0 NaN NaN
b 5.0 4.0 NaN NaN
c 8.0 6.0 NaN NaN
a NaN NaN 1.0 9.0
b NaN NaN 6.0 8.0
# print(pd.concat([df_obj1, df_obj2], axis=1, join=‘inner‘))
A B C D
a 3 3 1 9
b 5 4 6 8
3.數據重構
1. stack
-
將列索引旋轉為行索引,完成層級索引
-
DataFrame->Series
示例代碼:
import numpy as np
import pandas as pd
df_obj = pd.DataFrame(np.random.randint(0,10, (5,2)), columns=[‘data1‘, ‘data2‘])
print(df_obj)
stacked = df_obj.stack()
print(stacked)
運行結果:
# print(df_obj)
data1 data2
0 7 9
1 7 8
2 8 9
3 4 1
4 1 2
# print(stacked)
0 data1 7
data2 9
1 data1 7
data2 8
2 data1 8
data2 9
3 data1 4
data2 1
4 data1 1
data2 2
dtype: int64
2. unstack
-
將層級索引展開
-
Series->DataFrame
-
認操作內層索引,即level=-1
示例代碼:
# 默認操作內層索引
print(stacked.unstack())
# 通過level指定操作索引的級別
print(stacked.unstack(level=0))
運行結果:
# print(stacked.unstack())
data1 data2
0 7 9
1 7 8
2 8 9
3 4 1
4 1 2
# print(stacked.unstack(level=0))
0 1 2 3 4
data1 7 7 8 4 1
data2 9 8 9 1 2
4.數據轉換
一、 處理重復數據
1 duplicated() 返回布爾型Series表示每行是否為重復行
示例代碼:
import numpy as np
import pandas as pd
df_obj = pd.DataFrame({‘data1‘ : [‘a‘] * 4 + [‘b‘] * 4,
‘data2‘ : np.random.randint(0, 4, 8)})
print(df_obj)
print(df_obj.duplicated())
運行結果:
# print(df_obj)
data1 data2
0 a 3
1 a 2
2 a 3
3 a 3
4 b 1
5 b 0
6 b 3
7 b 0
# print(df_obj.duplicated())
0 False
1 False
2 True
3 True
4 False
5 False
6 False
7 True
dtype: bool
2 drop_duplicates() 過濾重復行
默認判斷全部列
可指定按某些列判斷
示例代碼:
print(df_obj.drop_duplicates())
print(df_obj.drop_duplicates(‘data2‘))
運行結果:
# print(df_obj.drop_duplicates())
data1 data2
0 a 3
1 a 2
4 b 1
5 b 0
6 b 3
# print(df_obj.drop_duplicates(‘data2‘))
data1 data2
0 a 3
1 a 2
4 b 1
5 b 0
3. 根據map傳入的函數對每行或每列進行轉換
- Series根據map傳入的函數對每行或每列進行轉換
示例代碼:
ser_obj = pd.Series(np.random.randint(0,10,10))
print(ser_obj)
print(ser_obj.map(lambda x : x ** 2))
運行結果:
# print(ser_obj)
0 1
1 4
2 8
3 6
4 8
5 6
6 6
7 4
8 7
9 3
dtype: int64
# print(ser_obj.map(lambda x : x ** 2))
0 1
1 16
2 64
3 36
4 64
5 36
6 36
7 16
8 49
9 9
dtype: int64
二、數據替換
replace根據值的內容進行替換
示例代碼:
# 單個值替換單個值
print(ser_obj.replace(1, -100))
# 多個值替換一個值
print(ser_obj.replace([6, 8], -100))
# 多個值替換多個值
print(ser_obj.replace([4, 7], [-100, -200]))
運行結果:
# print(ser_obj.replace(1, -100))
0 -100
1 4
2 8
3 6
4 8
5 6
6 6
7 4
8 7
9 3
dtype: int64
# print(ser_obj.replace([6, 8], -100))
0 1
1 4
2 -100
3 -100
4 -100
5 -100
6 -100
7 4
8 7
9 3
dtype: int64
# print(ser_obj.replace([4, 7], [-100, -200]))
0 1
1 -100
2 8
3 6
4 8
5 6
6 6
7 -100
8 -200
9 3
dtype: int64
三、全球食品數據分析
項目參考:https://www.kaggle.com/bhouwens/d/openfoodfacts/world-food-facts/how-much-sugar-do-we-eat/discussion
# -*- coding : utf-8 -*-
# 處理zip壓縮文件
import zipfile
import os
import pandas as pd
import matplotlib.pyplot as plt
def unzip(zip_filepath, dest_path):
"""
解壓zip文件
"""
with zipfile.ZipFile(zip_filepath) as zf:
zf.extractall(path=dest_path)
def get_dataset_filename(zip_filepath):
"""
獲取數據集文件名
"""
with zipfile.ZipFile(zip_filepath) as zf:
return zf.namelist()[0]
def main():
"""
主函數
"""
# 聲明變量
dataset_path = ‘./data‘ # 數據集路徑
zip_filename = ‘open-food-facts.zip‘ # zip文件名
zip_filepath = os.path.join(dataset_path, zip_filename) # zip文件路徑
dataset_filename = get_dataset_filename(zip_filepath) # 數據集文件名(在zip中)
dataset_filepath = os.path.join(dataset_path, dataset_filename) # 數據集文件路徑
print(‘解壓zip...‘, end=‘‘)
unzip(zip_filepath, dataset_path)
print(‘完成.‘)
# 讀取數據
data = pd.read_csv(dataset_filepath, usecols=[‘countries_en‘, ‘additives_n‘])
# 分析各國家食物中的食品添加劑種類個數
# 1. 數據清理
# 去除缺失數據
data = data.dropna() # 或者data.dropna(inplace=True)
# 將國家名稱轉換為小寫
data[‘countries_en‘] = data[‘countries_en‘].str.lower()
# 2. 數據分組統計
country_additives = data[‘additives_n‘].groupby(data[‘countries_en‘]).mean()
# 3. 按值從大到小排序
result = country_additives.sort_values(ascending=False)
# 4. pandas可視化top10
result.iloc[:10].plot.bar()
plt.show()
# 5. 保存處理結果
result.to_csv(‘./country_additives.csv‘)
# 刪除解壓數據,清理空間(可選操作)
if os.path.exists(dataset_filepath):
os.remove(dataset_filepath)
if __name__ == ‘__main__‘:
main()
數據清洗、合並、轉化和重構