1. 程式人生 > >數據清洗、合並、轉化和重構

數據清洗、合並、轉化和重構

stack 8 8 2.0 str 食品添加劑 -m value lambda 聲明變量

  • 數據清洗是數據分析關鍵的一步,直接影響之後的處理工作

  • 數據需要修改嗎?有什麽需要修改的嗎?數據應該怎麽調整才能適用於接下來的分析和挖掘?

  • 是一個叠代的過程,實際項目中可能需要不止一次地執行這些清洗操作

  • 處理缺失數據:pd.fillna(),pd.dropna()

1.數據連接(pd.merge)

  • pd.merge

  • 根據單個或多個鍵將不同DataFrame的行連接起來

  • 類似數據庫的連接操作

示例代碼:

import pandas as pd
import numpy as np

df_obj1 = pd.DataFrame({key: [b, 
b, a, c, a, a, b], data1 : np.random.randint(0,10,7)}) df_obj2 = pd.DataFrame({key: [a, b, d], data2 : np.random.randint(0,10,3)}) print(df_obj1) print(df_obj2)

運行結果:

   data1 key
   data1 key
0      8   b
1      8   b
2      3   a
3 5 c 4 4 a 5 9 a 6 6 b data2 key 0 9 a 1 0 b 2 3 d

1. 默認將重疊列的列名作為“外鍵”進行連接

示例代碼:

# 默認將重疊列的列名作為“外鍵”進行連接
print(pd.merge(df_obj1, df_obj2))

運行結果:

   data1 key  data2
0      8   b      0
1      8   b      0
2      6   b      0
3      3   a      9
4      4   a      9
5      9   a      9

2. on顯示指定“外鍵”

示例代碼:

# on顯示指定“外鍵”
print(pd.merge(df_obj1, df_obj2, on=key))

運行結果:

   data1 key  data2
0      8   b      0
1      8   b      0
2      6   b      0
3      3   a      9
4      4   a      9
5      9   a      9

3. left_on,左側數據的“外鍵”,right_on,右側數據的“外鍵”

示例代碼:

# left_on,right_on分別指定左側數據和右側數據的“外鍵”

# 更改列名
df_obj1 = df_obj1.rename(columns={key:key1})
df_obj2 = df_obj2.rename(columns={key:key2})

print(pd.merge(df_obj1, df_obj2, left_on=key1, right_on=key2))

運行結果:

   data1 key1  data2 key2
0      8    b      0    b
1      8    b      0    b
2      6    b      0    b
3      3    a      9    a
4      4    a      9    a
5      9    a      9    a

默認是“內連接”(inner),即結果中的鍵是交集

how指定連接方式

4. “外連接”(outer),結果中的鍵是並集

示例代碼:

# “外連接”
print(pd.merge(df_obj1, df_obj2, left_on=key1, right_on=key2, how=outer))

運行結果:

   data1 key1  data2 key2
0    8.0    b    0.0    b
1    8.0    b    0.0    b
2    6.0    b    0.0    b
3    3.0    a    9.0    a
4    4.0    a    9.0    a
5    9.0    a    9.0    a
6    5.0    c    NaN  NaN
7    NaN  NaN    3.0    d

5. “左連接”(left)

示例代碼:

# 左連接
print(pd.merge(df_obj1, df_obj2, left_on=key1, right_on=key2, how=left))

運行結果:

   data1 key1  data2 key2
0      8    b    0.0    b
1      8    b    0.0    b
2      3    a    9.0    a
3      5    c    NaN  NaN
4      4    a    9.0    a
5      9    a    9.0    a
6      6    b    0.0    b

6. “右連接”(right)

示例代碼:

# 右連接
print(pd.merge(df_obj1, df_obj2, left_on=key1, right_on=key2, how=right))

運行結果:

   data1 key1  data2 key2
0    8.0    b      0    b
1    8.0    b      0    b
2    6.0    b      0    b
3    3.0    a      9    a
4    4.0    a      9    a
5    9.0    a      9    a
6    NaN  NaN      3    d

7. 處理重復列名

suffixes,默認為_x, _y

示例代碼:

# 處理重復列名
df_obj1 = pd.DataFrame({key: [b, b, a, c, a, a, b],
                        data : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({key: [a, b, d],
                        data : np.random.randint(0,10,3)})

print(pd.merge(df_obj1, df_obj2, on=key, suffixes=(_left, _right)))

運行結果:

   data_left key  data_right
0          9   b           1
1          5   b           1
2          1   b           1
3          2   a           8
4          2   a           8
5          5   a           8

8. 按索引連接

left_index=True或right_index=True

示例代碼:

# 按索引連接
df_obj1 = pd.DataFrame({key: [b, b, a, c, a, a, b],
                        data1 : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({data2 : np.random.randint(0,10,3)}, index=[a, b, d])

print(pd.merge(df_obj1, df_obj2, left_on=key, right_index=True))

運行結果:

   data1 key  data2
0      3   b      6
1      4   b      6
6      8   b      6
2      6   a      0
4      3   a      0
5      0   a      0

2.數據合並(pd.concat)

  • 沿軸方向將多個對象合並到一起

1. NumPy的concat

np.concatenate

示例代碼:

import numpy as np
import pandas as pd

arr1 = np.random.randint(0, 10, (3, 4))
arr2 = np.random.randint(0, 10, (3, 4))

print(arr1)
print(arr2)

print(np.concatenate([arr1, arr2]))
print(np.concatenate([arr1, arr2], axis=1))

運行結果:

# print(arr1)
[[3 3 0 8]
 [2 0 3 1]
 [4 8 8 2]]

# print(arr2)
[[6 8 7 3]
 [1 6 8 7]
 [1 4 7 1]]

# print(np.concatenate([arr1, arr2]))
 [[3 3 0 8]
 [2 0 3 1]
 [4 8 8 2]
 [6 8 7 3]
 [1 6 8 7]
 [1 4 7 1]]

# print(np.concatenate([arr1, arr2], axis=1)) 
[[3 3 0 8 6 8 7 3]
 [2 0 3 1 1 6 8 7]
 [4 8 8 2 1 4 7 1]]

2. pd.concat

  • 註意指定軸方向,默認axis=0

  • join指定合並方式,默認為outer

  • Series合並時查看行索引有無重復

1) index 沒有重復的情況
示例代碼:

# index 沒有重復的情況
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(0,5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(5,9))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(9,12))

print(ser_obj1)
print(ser_obj2)
print(ser_obj3)

print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))

運行結果:

# print(ser_obj1)
0    1
1    8
2    4
3    9
4    4
dtype: int64

# print(ser_obj2)
5    2
6    6
7    4
8    2
dtype: int64

# print(ser_obj3)
9     6
10    2
11    7
dtype: int64

# print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
0     1
1     8
2     4
3     9
4     4
5     2
6     6
7     4
8     2
9     6
10    2
11    7
dtype: int64

# print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))
      0    1    2
0   1.0  NaN  NaN
1   5.0  NaN  NaN
2   3.0  NaN  NaN
3   2.0  NaN  NaN
4   4.0  NaN  NaN
5   NaN  9.0  NaN
6   NaN  8.0  NaN
7   NaN  3.0  NaN
8   NaN  6.0  NaN
9   NaN  NaN  2.0
10  NaN  NaN  3.0
11  NaN  NaN  3.0

2) index 有重復的情況
示例代碼:

# index 有重復的情況
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(4))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(3))

print(ser_obj1)
print(ser_obj2)
print(ser_obj3)

print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))

運行結果:

# print(ser_obj1)
0    0
1    3
2    7
3    2
4    5
dtype: int64

# print(ser_obj2)
0    5
1    1
2    9
3    9
dtype: int64

# print(ser_obj3)
0    8
1    7
2    9
dtype: int64

# print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
0    0
1    3
2    7
3    2
4    5
0    5
1    1
2    9
3    9
0    8
1    7
2    9
dtype: int64

# print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1, join=‘inner‘)) 
# join=‘inner‘ 將去除NaN所在的行或列
   0  1  2
0  0  5  8
1  3  1  7
2  7  9  9

3) DataFrame合並時同時查看行索引和列索引有無重復
示例代碼:

df_obj1 = pd.DataFrame(np.random.randint(0, 10, (3, 2)), index=[a, b, c],
                       columns=[A, B])
df_obj2 = pd.DataFrame(np.random.randint(0, 10, (2, 2)), index=[a, b],
                       columns=[C, D])
print(df_obj1)
print(df_obj2)

print(pd.concat([df_obj1, df_obj2]))
print(pd.concat([df_obj1, df_obj2], axis=1, join=inner))

運行結果:

# print(df_obj1)
   A  B
a  3  3
b  5  4
c  8  6

# print(df_obj2)
   C  D
a  1  9
b  6  8

# print(pd.concat([df_obj1, df_obj2]))
     A    B    C    D
a  3.0  3.0  NaN  NaN
b  5.0  4.0  NaN  NaN
c  8.0  6.0  NaN  NaN
a  NaN  NaN  1.0  9.0
b  NaN  NaN  6.0  8.0

# print(pd.concat([df_obj1, df_obj2], axis=1, join=‘inner‘))
   A  B  C  D
a  3  3  1  9
b  5  4  6  8

3.數據重構

1. stack

  • 將列索引旋轉為行索引,完成層級索引

  • DataFrame->Series

示例代碼:

import numpy as np
import pandas as pd

df_obj = pd.DataFrame(np.random.randint(0,10, (5,2)), columns=[data1, data2])
print(df_obj)

stacked = df_obj.stack()
print(stacked)

運行結果:

# print(df_obj)
   data1  data2
0      7      9
1      7      8
2      8      9
3      4      1
4      1      2

# print(stacked)
0  data1    7
   data2    9
1  data1    7
   data2    8
2  data1    8
   data2    9
3  data1    4
   data2    1
4  data1    1
   data2    2
dtype: int64

2. unstack

  • 將層級索引展開

  • Series->DataFrame

  • 認操作內層索引,即level=-1

示例代碼:

# 默認操作內層索引
print(stacked.unstack())

# 通過level指定操作索引的級別
print(stacked.unstack(level=0))

運行結果:

# print(stacked.unstack())
   data1  data2
0      7      9
1      7      8
2      8      9
3      4      1
4      1      2

# print(stacked.unstack(level=0))
       0  1  2  3  4
data1  7  7  8  4  1
data2  9  8  9  1  2

4.數據轉換

一、 處理重復數據

1 duplicated() 返回布爾型Series表示每行是否為重復行
示例代碼:

import numpy as np
import pandas as pd

df_obj = pd.DataFrame({data1 : [a] * 4 + [b] * 4,
                       data2 : np.random.randint(0, 4, 8)})
print(df_obj)

print(df_obj.duplicated())

運行結果:

# print(df_obj)
  data1  data2
0     a      3
1     a      2
2     a      3
3     a      3
4     b      1
5     b      0
6     b      3
7     b      0

# print(df_obj.duplicated())
0    False
1    False
2     True
3     True
4    False
5    False
6    False
7     True
dtype: bool

2 drop_duplicates() 過濾重復行

默認判斷全部列

可指定按某些列判斷

示例代碼:

print(df_obj.drop_duplicates())
print(df_obj.drop_duplicates(data2))

運行結果:

# print(df_obj.drop_duplicates())
  data1  data2
0     a      3
1     a      2
4     b      1
5     b      0
6     b      3

# print(df_obj.drop_duplicates(‘data2‘))
  data1  data2
0     a      3
1     a      2
4     b      1
5     b      0

3. 根據map傳入的函數對每行或每列進行轉換

  • Series根據map傳入的函數對每行或每列進行轉換

示例代碼:

ser_obj = pd.Series(np.random.randint(0,10,10))
print(ser_obj)

print(ser_obj.map(lambda x : x ** 2))

運行結果:

# print(ser_obj)
0    1
1    4
2    8
3    6
4    8
5    6
6    6
7    4
8    7
9    3
dtype: int64

# print(ser_obj.map(lambda x : x ** 2))
0     1
1    16
2    64
3    36
4    64
5    36
6    36
7    16
8    49
9     9
dtype: int64

二、數據替換

replace根據值的內容進行替換

示例代碼:

# 單個值替換單個值
print(ser_obj.replace(1, -100))

# 多個值替換一個值
print(ser_obj.replace([6, 8], -100))

# 多個值替換多個值
print(ser_obj.replace([4, 7], [-100, -200]))

運行結果:

# print(ser_obj.replace(1, -100))
0   -100
1      4
2      8
3      6
4      8
5      6
6      6
7      4
8      7
9      3
dtype: int64

# print(ser_obj.replace([6, 8], -100))
0      1
1      4
2   -100
3   -100
4   -100
5   -100
6   -100
7      4
8      7
9      3
dtype: int64

# print(ser_obj.replace([4, 7], [-100, -200]))
0      1
1   -100
2      8
3      6
4      8
5      6
6      6
7   -100
8   -200
9      3
dtype: int64

三、全球食品數據分析

項目參考:https://www.kaggle.com/bhouwens/d/openfoodfacts/world-food-facts/how-much-sugar-do-we-eat/discussion

# -*- coding : utf-8 -*-

# 處理zip壓縮文件
import zipfile
import os
import pandas as pd
import matplotlib.pyplot as plt


def unzip(zip_filepath, dest_path):
    """
        解壓zip文件
    """
    with zipfile.ZipFile(zip_filepath) as zf:
        zf.extractall(path=dest_path)


def get_dataset_filename(zip_filepath):
    """
            獲取數據集文件名
    """
    with zipfile.ZipFile(zip_filepath) as zf:
        return zf.namelist()[0]


def main():
    """
        主函數
    """
    # 聲明變量
    dataset_path = ./data  # 數據集路徑
    zip_filename = open-food-facts.zip  # zip文件名
    zip_filepath = os.path.join(dataset_path, zip_filename)  # zip文件路徑
    dataset_filename = get_dataset_filename(zip_filepath)  # 數據集文件名(在zip中)
    dataset_filepath = os.path.join(dataset_path, dataset_filename)  # 數據集文件路徑

    print(解壓zip..., end=‘‘)
    unzip(zip_filepath, dataset_path)
    print(完成.)

    # 讀取數據
    data = pd.read_csv(dataset_filepath, usecols=[countries_en, additives_n])

    # 分析各國家食物中的食品添加劑種類個數
    # 1. 數據清理
    # 去除缺失數據
    data = data.dropna()    # 或者data.dropna(inplace=True)

    # 將國家名稱轉換為小寫
    data[countries_en] = data[countries_en].str.lower()

    # 2. 數據分組統計
    country_additives = data[additives_n].groupby(data[countries_en]).mean()

    # 3. 按值從大到小排序
    result = country_additives.sort_values(ascending=False)

    # 4. pandas可視化top10
    result.iloc[:10].plot.bar()
    plt.show()

    # 5. 保存處理結果
    result.to_csv(./country_additives.csv)

    # 刪除解壓數據,清理空間(可選操作)
    if os.path.exists(dataset_filepath):
        os.remove(dataset_filepath)

if __name__ == __main__:
    main()

數據清洗、合並、轉化和重構