4 Pandas--重複值&nan值 清洗
阿新 • • 發佈:2021-06-16
import pandas as pd
from pandas import DataFrame
import numpy as np
處理丟失資料
-
有兩種丟失資料:
- None
- np.nan(NaN)
-
兩種丟失資料的區別
type(None)
NoneType
type(np.nan)
float
- 為什麼在資料分析中需要用到的是浮點型別的空而不是物件型別?
- 資料分析中會常常使用某些形式的運算來處理原始資料,如果原數資料中的空值為NAN的形式,則不會干擾或者中斷運算。
- NAN可以參與運算的
- None是不可以參與運算
np.nan + 1
nan
None + 1
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-3fd8740bf8ab> in <module>
----> 1 None + 1
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
- 在pandas中如果遇到了None形式的空值則pandas會將其強轉成NAN的形式。
df = DataFrame(data=np.random.randint(0,100,size=(7,5)))
df.iloc[2,3] = None
df.iloc[4,2] = np.nan
df.iloc[5,4] = None
df
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 23 | 67 | 72.0 | 72.0 | 78.0 |
1 | 37 | 26 | 28.0 | 44.0 | 19.0 |
2 | 0 | 69 | 90.0 | NaN | 28.0 |
3 | 76 | 67 | 15.0 | 15.0 | 50.0 |
4 | 54 | 52 | NaN | 69.0 | 53.0 |
5 | 38 | 85 | 0.0 | 80.0 | NaN |
6 | 86 | 98 | 88.0 | 70.0 | 84.0 |
pandas處理空值操作
-
isnull
-
notnull
-
any
-
all
-
dropna
-
fillna
-
方式1:對空值進行過濾(刪除空所在的行資料)
- 技術:isnull,notnull,any,all
df.isnull()
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | False | False | False | False | False |
1 | False | False | False | False | False |
2 | False | False | False | True | False |
3 | False | False | False | False | False |
4 | False | False | True | False | False |
5 | False | False | False | False | True |
6 | False | False | False | False | False |
#哪些行中有空值
#any(axis=1)檢測哪些行中存有空值
df.isnull().any(axis=1) #any會作用isnull返回結果的每一行
#true對應的行就是存有缺失資料的行
0 False
1 False
2 True
3 False
4 True
5 True
6 False
dtype: bool
df.notnull()
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | True | True | True | True | True |
1 | True | True | True | True | True |
2 | True | True | True | False | True |
3 | True | True | True | True | True |
4 | True | True | False | True | True |
5 | True | True | True | True | False |
6 | True | True | True | True | True |
df.notnull().all(axis=1)
0 True
1 True
2 False
3 True
4 False
5 False
6 True
dtype: bool
#將布林值作為源資料的行索引
df.loc[df.notnull().all(axis=1)]
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 23 | 67 | 72.0 | 72.0 | 78.0 |
1 | 37 | 26 | 28.0 | 44.0 | 19.0 |
3 | 76 | 67 | 15.0 | 15.0 | 50.0 |
6 | 86 | 98 | 88.0 | 70.0 | 84.0 |
#獲取空對應的行資料
df.loc[df.isnull().any(axis=1)]
#獲取空對應行資料的行索引
indexs = df.loc[df.isnull().any(axis=1)].index
indexs
Int64Index([2, 4, 5], dtype='int64')
df.drop(labels=indexs,axis=0)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 23 | 67 | 72.0 | 72.0 | 78.0 |
1 | 37 | 26 | 28.0 | 44.0 | 19.0 |
3 | 76 | 67 | 15.0 | 15.0 | 50.0 |
6 | 86 | 98 | 88.0 | 70.0 | 84.0 |
- 方式2:
- dropna:可以直接將缺失的行或者列進行刪除
df.dropna(axis=0)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 23 | 67 | 72.0 | 72.0 | 78.0 |
1 | 37 | 26 | 28.0 | 44.0 | 19.0 |
3 | 76 | 67 | 15.0 | 15.0 | 50.0 |
6 | 86 | 98 | 88.0 | 70.0 | 84.0 |
- 對缺失值進行覆蓋
- fillna
df.fillna(value=999) #使用指定值將源資料中所有的空值進行填充
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 23 | 67 | 72.0 | 72.0 | 78.0 |
1 | 37 | 26 | 28.0 | 44.0 | 19.0 |
2 | 0 | 69 | 90.0 | 999.0 | 28.0 |
3 | 76 | 67 | 15.0 | 15.0 | 50.0 |
4 | 54 | 52 | 999.0 | 69.0 | 53.0 |
5 | 38 | 85 | 0.0 | 80.0 | 999.0 |
6 | 86 | 98 | 88.0 | 70.0 | 84.0 |
#使用空的近鄰值進行填充
#method=ffill向前填充,bfill向後填充 axis=o 列填充
df.fillna(axis=0,method='bfill')
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 23 | 67 | 72.0 | 72.0 | 78.0 |
1 | 37 | 26 | 28.0 | 44.0 | 19.0 |
2 | 0 | 69 | 90.0 | 15.0 | 28.0 |
3 | 76 | 67 | 15.0 | 15.0 | 50.0 |
4 | 54 | 52 | 0.0 | 69.0 | 53.0 |
5 | 38 | 85 | 0.0 | 80.0 | 84.0 |
6 | 86 | 98 | 88.0 | 70.0 | 84.0 |
-
什麼時候用dropna什麼時候用fillna
- 儘量使用dropna,如果刪除成本比較高,則使用fillna
-
使用空值對應列的均值進行空值填充
for col in df.columns:
#檢測哪些列中存有空值
if df[col].isnull().sum() > 0:#說明df[col]中存有空值
mean_value = df[col].mean()
df[col] = df[col].fillna(value=mean_value)
例項:
-
資料說明:
- 資料是1個冷庫的溫度資料,1-7對應7個溫度採集裝置,1分鐘採集一次。
-
資料處理目標:
- 用1-4對應的4個必須裝置,通過建立冷庫的溫度場關係模型,預估出5-7對應的資料。
- 最後每個冷庫中僅需放置4個裝置,取代放置7個裝置。
- f(1-4) --> y(5-7)
-
資料處理過程:
- 1、原始資料中有丟幀現象,需要做預處理;
- 2、matplotlib 繪圖;
- 3、建立邏輯迴歸模型。
-
無標準答案,按個人理解操作即可,請把自己的操作過程以文字形式簡單描述一下,謝謝配合。
-
測試資料為testData.xlsx
data = pd.read_excel('./data/testData.xlsx').drop(labels=['none','none1'],axis=1)
data
time | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|
0 | 2019-01-27 17:00:00 | -24.8 | -18.2 | -20.8 | -18.8 | NaN | NaN | NaN |
1 | 2019-01-27 17:01:00 | -23.5 | -18.8 | -20.5 | -19.8 | -15.2 | -14.5 | -16.0 |
2 | 2019-01-27 17:02:00 | -23.2 | -19.2 | NaN | NaN | -13.0 | NaN | -14.0 |
3 | 2019-01-27 17:03:00 | -22.8 | -19.2 | -20.0 | -20.5 | NaN | -12.2 | -9.8 |
4 | 2019-01-27 17:04:00 | -23.2 | -18.5 | -20.0 | -18.8 | -10.2 | -10.8 | -8.8 |
5 | 2019-01-27 17:05:00 | NaN | NaN | -19.0 | -18.2 | -10.0 | -10.5 | -10.8 |
6 | 2019-01-27 17:06:00 | NaN | -18.5 | -18.2 | -17.5 | NaN | NaN | NaN |
7 | 2019-01-27 17:07:00 | -24.8 | -18.0 | -17.5 | -17.2 | -14.2 | -14.0 | -12.5 |
8 | 2019-01-27 17:08:00 | -25.2 | -17.8 | NaN | NaN | -16.2 | NaN | -14.5 |
9 | 2019-01-27 17:09:00 | -24.8 | -18.2 | NaN | -17.5 | NaN | -15.5 | -16.0 |
10 | 2019-01-27 17:10:00 | -24.5 | -18.5 | -16.0 | -18.5 | -17.5 | -16.5 | -17.2 |
11 | 2019-01-27 17:11:00 | NaN | NaN | -16.0 | -18.5 | -17.8 | -16.8 | -12.0 |
12 | 2019-01-27 17:12:00 | NaN | -18.5 | -15.8 | -18.8 | NaN | NaN | NaN |
13 | 2019-01-27 17:13:00 | -23.8 | -18.5 | NaN | NaN | 4.5 | NaN | 0.0 |
14 | 2019-01-27 17:14:00 | -23.2 | -18.2 | NaN | -19.0 | NaN | 5.8 | 6.8 |
15 | 2019-01-27 17:15:00 | -23.5 | -17.8 | -15.0 | -18.0 | 10.5 | 10.5 | 10.8 |
16 | 2019-01-27 17:16:00 | NaN | NaN | -14.2 | -17.2 | 14.0 | 13.5 | 13.0 |
17 | 2019-01-27 17:17:00 | NaN | -18.2 | -13.8 | -17.8 | 15.8 | 15.2 | 14.2 |
18 | 2019-01-27 17:18:00 | -23.2 | -19.0 | -13.8 | -18.2 | NaN | NaN | NaN |
19 | 2019-01-27 17:19:00 | -23.2 | -19.5 | NaN | NaN | 17.8 | NaN | 15.2 |
20 | 2019-01-27 17:20:00 | -23.2 | -19.8 | NaN | -19.0 | 18.2 | 17.2 | 15.8 |
21 | 2019-01-27 17:21:00 | -23.5 | -20.0 | -13.8 | -19.5 | NaN | 17.8 | 16.0 |
22 | 2019-01-27 17:22:00 | NaN | NaN | -14.0 | -19.5 | 18.8 | 18.0 | 16.2 |
23 | 2019-01-27 17:23:00 | -23.2 | -20.2 | -14.0 | -19.5 | 19.0 | 18.2 | 16.5 |
24 | 2019-01-27 17:24:00 | NaN | -20.2 | -14.2 | -19.5 | NaN | NaN | NaN |
25 | 2019-01-27 17:25:00 | -22.8 | -20.5 | -14.5 | -19.5 | 19.2 | NaN | 16.5 |
26 | 2019-01-27 17:26:00 | -22.8 | -20.8 | -15.0 | -16.8 | NaN | 17.2 | 16.8 |
27 | 2019-01-27 17:27:00 | -22.0 | -16.0 | NaN | -16.0 | 18.8 | 17.2 | 16.2 |
28 | 2019-01-27 17:28:00 | -22.8 | -15.2 | -14.8 | -15.2 | 18.8 | 17.2 | 16.2 |
29 | 2019-01-27 17:29:00 | -22.5 | -15.0 | -14.8 | -15.2 | 18.8 | 17.2 | 16.5 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1030 | 2019-01-28 10:10:00 | -30.5 | -27.5 | -29.5 | -27.8 | -3.8 | -3.5 | -8.2 |
1031 | 2019-01-28 10:11:00 | -30.8 | -27.0 | -29.2 | -27.8 | -3.8 | -3.2 | -8.5 |
1032 | 2019-01-28 10:12:00 | -30.5 | -26.2 | -29.0 | -26.8 | -3.5 | -3.0 | -8.8 |
1033 | 2019-01-28 10:13:00 | -28.8 | -25.2 | -28.2 | -26.2 | -3.5 | -3.0 | -8.8 |
1034 | 2019-01-28 10:14:00 | -25.2 | -25.2 | -28.2 | -25.8 | -3.0 | -2.5 | -8.8 |
1035 | 2019-01-28 10:15:00 | -25.2 | -25.8 | -28.5 | -26.2 | -3.0 | -2.2 | -8.5 |
1036 | 2019-01-28 10:16:00 | -25.8 | -26.2 | -28.8 | -26.8 | -2.8 | -2.0 | -8.2 |
1037 | 2019-01-28 10:17:00 | -26.2 | -26.8 | -29.0 | -27.2 | -2.5 | -1.8 | -8.2 |
1038 | 2019-01-28 10:18:00 | -26.5 | -27.0 | -29.2 | -27.5 | NaN | NaN | NaN |
1039 | 2019-01-28 10:19:00 | -27.0 | -27.2 | -29.5 | -28.0 | -2.2 | -1.5 | -7.8 |
1040 | 2019-01-28 10:20:00 | -26.5 | -26.8 | -29.0 | -28.0 | -2.2 | -1.5 | -7.5 |
1041 | 2019-01-28 10:21:00 | -25.0 | -25.8 | -28.5 | -27.2 | -2.2 | -1.5 | -7.5 |
1042 | 2019-01-28 10:22:00 | -24.0 | -25.2 | -28.2 | -26.5 | -2.0 | -1.5 | -7.2 |
1043 | 2019-01-28 10:23:00 | -23.8 | -25.0 | -28.0 | -26.0 | -2.0 | -1.5 | -7.2 |
1044 | 2019-01-28 10:24:00 | -24.0 | -25.2 | -28.0 | -25.5 | -2.0 | -1.5 | -7.0 |
1045 | 2019-01-28 10:25:00 | -25.0 | -26.0 | -28.2 | -26.2 | -2.2 | -1.5 | -6.8 |
1046 | 2019-01-28 10:26:00 | -25.8 | -26.5 | -28.8 | -26.8 | -2.2 | -1.5 | -6.5 |
1047 | 2019-01-28 10:27:00 | -26.2 | -26.5 | -28.8 | -27.2 | -2.2 | -1.5 | -6.5 |
1048 | 2019-01-28 10:28:00 | -25.0 | -25.8 | -28.5 | -27.0 | -2.2 | -1.8 | -6.2 |
1049 | 2019-01-28 10:29:00 | -24.8 | -25.2 | -28.0 | -26.2 | -2.2 | -1.8 | -6.0 |
1050 | 2019-01-28 10:30:00 | -24.5 | -24.8 | -27.8 | -25.8 | -2.0 | -2.0 | -6.0 |
1051 | 2019-01-28 10:31:00 | -24.0 | -24.8 | -27.8 | -25.5 | -2.0 | -2.0 | -5.8 |
1052 | 2019-01-28 10:32:00 | -24.2 | -25.5 | -28.0 | -26.0 | -2.0 | -2.0 | -5.5 |
1053 | 2019-01-28 10:33:00 | -25.0 | -26.2 | -28.2 | -26.8 | -2.0 | -2.0 | -5.2 |
1054 | 2019-01-28 10:34:00 | -25.8 | -26.8 | -28.5 | -27.0 | -2.0 | -2.2 | -5.2 |
1055 | 2019-01-28 10:35:00 | -26.2 | -27.2 | -28.8 | -27.5 | -2.0 | NaN | -5.0 |
1056 | 2019-01-28 10:36:00 | -26.8 | -27.5 | -29.0 | -27.8 | -2.2 | NaN | -5.0 |
1057 | 2019-01-28 10:37:00 | -27.2 | -27.8 | -29.0 | -28.0 | -2.2 | NaN | -5.0 |
1058 | 2019-01-28 10:38:00 | -27.5 | -27.0 | -29.0 | -28.0 | -3.5 | -3.2 | -5.8 |
1059 | 2019-01-28 10:39:00 | -27.0 | -27.2 | -29.0 | -27.8 | -5.0 | NaN | -7.0 |
1060 rows × 8 columns
data.shape
(1060, 8)
#刪除空對應的行資料
data.dropna(axis=0).shape
(927, 8)
#填充
data.fillna(method='ffill',axis=0).fillna(method='bfill',axis=0)
time | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|
0 | 2019-01-27 17:00:00 | -24.8 | -18.2 | -20.8 | -18.8 | -15.2 | -14.5 | -16.0 |
1 | 2019-01-27 17:01:00 | -23.5 | -18.8 | -20.5 | -19.8 | -15.2 | -14.5 | -16.0 |
2 | 2019-01-27 17:02:00 | -23.2 | -19.2 | -20.5 | -19.8 | -13.0 | -14.5 | -14.0 |
3 | 2019-01-27 17:03:00 | -22.8 | -19.2 | -20.0 | -20.5 | -13.0 | -12.2 | -9.8 |
4 | 2019-01-27 17:04:00 | -23.2 | -18.5 | -20.0 | -18.8 | -10.2 | -10.8 | -8.8 |
5 | 2019-01-27 17:05:00 | -23.2 | -18.5 | -19.0 | -18.2 | -10.0 | -10.5 | -10.8 |
6 | 2019-01-27 17:06:00 | -23.2 | -18.5 | -18.2 | -17.5 | -10.0 | -10.5 | -10.8 |
7 | 2019-01-27 17:07:00 | -24.8 | -18.0 | -17.5 | -17.2 | -14.2 | -14.0 | -12.5 |
8 | 2019-01-27 17:08:00 | -25.2 | -17.8 | -17.5 | -17.2 | -16.2 | -14.0 | -14.5 |
9 | 2019-01-27 17:09:00 | -24.8 | -18.2 | -17.5 | -17.5 | -16.2 | -15.5 | -16.0 |
10 | 2019-01-27 17:10:00 | -24.5 | -18.5 | -16.0 | -18.5 | -17.5 | -16.5 | -17.2 |
11 | 2019-01-27 17:11:00 | -24.5 | -18.5 | -16.0 | -18.5 | -17.8 | -16.8 | -12.0 |
12 | 2019-01-27 17:12:00 | -24.5 | -18.5 | -15.8 | -18.8 | -17.8 | -16.8 | -12.0 |
13 | 2019-01-27 17:13:00 | -23.8 | -18.5 | -15.8 | -18.8 | 4.5 | -16.8 | 0.0 |
14 | 2019-01-27 17:14:00 | -23.2 | -18.2 | -15.8 | -19.0 | 4.5 | 5.8 | 6.8 |
15 | 2019-01-27 17:15:00 | -23.5 | -17.8 | -15.0 | -18.0 | 10.5 | 10.5 | 10.8 |
16 | 2019-01-27 17:16:00 | -23.5 | -17.8 | -14.2 | -17.2 | 14.0 | 13.5 | 13.0 |
17 | 2019-01-27 17:17:00 | -23.5 | -18.2 | -13.8 | -17.8 | 15.8 | 15.2 | 14.2 |
18 | 2019-01-27 17:18:00 | -23.2 | -19.0 | -13.8 | -18.2 | 15.8 | 15.2 | 14.2 |
19 | 2019-01-27 17:19:00 | -23.2 | -19.5 | -13.8 | -18.2 | 17.8 | 15.2 | 15.2 |
20 | 2019-01-27 17:20:00 | -23.2 | -19.8 | -13.8 | -19.0 | 18.2 | 17.2 | 15.8 |
21 | 2019-01-27 17:21:00 | -23.5 | -20.0 | -13.8 | -19.5 | 18.2 | 17.8 | 16.0 |
22 | 2019-01-27 17:22:00 | -23.5 | -20.0 | -14.0 | -19.5 | 18.8 | 18.0 | 16.2 |
23 | 2019-01-27 17:23:00 | -23.2 | -20.2 | -14.0 | -19.5 | 19.0 | 18.2 | 16.5 |
24 | 2019-01-27 17:24:00 | -23.2 | -20.2 | -14.2 | -19.5 | 19.0 | 18.2 | 16.5 |
25 | 2019-01-27 17:25:00 | -22.8 | -20.5 | -14.5 | -19.5 | 19.2 | 18.2 | 16.5 |
26 | 2019-01-27 17:26:00 | -22.8 | -20.8 | -15.0 | -16.8 | 19.2 | 17.2 | 16.8 |
27 | 2019-01-27 17:27:00 | -22.0 | -16.0 | -15.0 | -16.0 | 18.8 | 17.2 | 16.2 |
28 | 2019-01-27 17:28:00 | -22.8 | -15.2 | -14.8 | -15.2 | 18.8 | 17.2 | 16.2 |
29 | 2019-01-27 17:29:00 | -22.5 | -15.0 | -14.8 | -15.2 | 18.8 | 17.2 | 16.5 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1030 | 2019-01-28 10:10:00 | -30.5 | -27.5 | -29.5 | -27.8 | -3.8 | -3.5 | -8.2 |
1031 | 2019-01-28 10:11:00 | -30.8 | -27.0 | -29.2 | -27.8 | -3.8 | -3.2 | -8.5 |
1032 | 2019-01-28 10:12:00 | -30.5 | -26.2 | -29.0 | -26.8 | -3.5 | -3.0 | -8.8 |
1033 | 2019-01-28 10:13:00 | -28.8 | -25.2 | -28.2 | -26.2 | -3.5 | -3.0 | -8.8 |
1034 | 2019-01-28 10:14:00 | -25.2 | -25.2 | -28.2 | -25.8 | -3.0 | -2.5 | -8.8 |
1035 | 2019-01-28 10:15:00 | -25.2 | -25.8 | -28.5 | -26.2 | -3.0 | -2.2 | -8.5 |
1036 | 2019-01-28 10:16:00 | -25.8 | -26.2 | -28.8 | -26.8 | -2.8 | -2.0 | -8.2 |
1037 | 2019-01-28 10:17:00 | -26.2 | -26.8 | -29.0 | -27.2 | -2.5 | -1.8 | -8.2 |
1038 | 2019-01-28 10:18:00 | -26.5 | -27.0 | -29.2 | -27.5 | -2.5 | -1.8 | -8.2 |
1039 | 2019-01-28 10:19:00 | -27.0 | -27.2 | -29.5 | -28.0 | -2.2 | -1.5 | -7.8 |
1040 | 2019-01-28 10:20:00 | -26.5 | -26.8 | -29.0 | -28.0 | -2.2 | -1.5 | -7.5 |
1041 | 2019-01-28 10:21:00 | -25.0 | -25.8 | -28.5 | -27.2 | -2.2 | -1.5 | -7.5 |
1042 | 2019-01-28 10:22:00 | -24.0 | -25.2 | -28.2 | -26.5 | -2.0 | -1.5 | -7.2 |
1043 | 2019-01-28 10:23:00 | -23.8 | -25.0 | -28.0 | -26.0 | -2.0 | -1.5 | -7.2 |
1044 | 2019-01-28 10:24:00 | -24.0 | -25.2 | -28.0 | -25.5 | -2.0 | -1.5 | -7.0 |
1045 | 2019-01-28 10:25:00 | -25.0 | -26.0 | -28.2 | -26.2 | -2.2 | -1.5 | -6.8 |
1046 | 2019-01-28 10:26:00 | -25.8 | -26.5 | -28.8 | -26.8 | -2.2 | -1.5 | -6.5 |
1047 | 2019-01-28 10:27:00 | -26.2 | -26.5 | -28.8 | -27.2 | -2.2 | -1.5 | -6.5 |
1048 | 2019-01-28 10:28:00 | -25.0 | -25.8 | -28.5 | -27.0 | -2.2 | -1.8 | -6.2 |
1049 | 2019-01-28 10:29:00 | -24.8 | -25.2 | -28.0 | -26.2 | -2.2 | -1.8 | -6.0 |
1050 | 2019-01-28 10:30:00 | -24.5 | -24.8 | -27.8 | -25.8 | -2.0 | -2.0 | -6.0 |
1051 | 2019-01-28 10:31:00 | -24.0 | -24.8 | -27.8 | -25.5 | -2.0 | -2.0 | -5.8 |
1052 | 2019-01-28 10:32:00 | -24.2 | -25.5 | -28.0 | -26.0 | -2.0 | -2.0 | -5.5 |
1053 | 2019-01-28 10:33:00 | -25.0 | -26.2 | -28.2 | -26.8 | -2.0 | -2.0 | -5.2 |
1054 | 2019-01-28 10:34:00 | -25.8 | -26.8 | -28.5 | -27.0 | -2.0 | -2.2 | -5.2 |
1055 | 2019-01-28 10:35:00 | -26.2 | -27.2 | -28.8 | -27.5 | -2.0 | -2.2 | -5.0 |
1056 | 2019-01-28 10:36:00 | -26.8 | -27.5 | -29.0 | -27.8 | -2.2 | -2.2 | -5.0 |
1057 | 2019-01-28 10:37:00 | -27.2 | -27.8 | -29.0 | -28.0 | -2.2 | -2.2 | -5.0 |
1058 | 2019-01-28 10:38:00 | -27.5 | -27.0 | -29.0 | -28.0 | -3.5 | -3.2 | -5.8 |
1059 | 2019-01-28 10:39:00 | -27.0 | -27.2 | -29.0 | -27.8 | -5.0 | -3.2 | -7.0 |
1060 rows × 8 columns
處理重複資料
df = DataFrame(data=np.random.randint(0,100,size=(8,6)))
df.iloc[1] = [1,1,1,1,1,1]
df.iloc[3] = [1,1,1,1,1,1]
df.iloc[5] = [1,1,1,1,1,1]
df
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 3 | 29 | 47 | 11 | 69 | 7 |
1 | 1 | 1 | 1 | 1 | 1 | 1 |
2 | 6 | 43 | 65 | 79 | 52 | 82 |
3 | 1 | 1 | 1 | 1 | 1 | 1 |
4 | 2 | 67 | 90 | 8 | 96 | 76 |
5 | 1 | 1 | 1 | 1 | 1 | 1 |
6 | 38 | 6 | 56 | 50 | 71 | 30 |
7 | 9 | 16 | 12 | 67 | 32 | 0 |
#檢測哪些行存有重複的資料
df.duplicated(keep='first')
0 False
1 False
2 False
3 True
4 False
5 True
6 False
7 False
dtype: bool
df.loc[~df.duplicated(keep='first')]
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 3 | 29 | 47 | 11 | 69 | 7 |
1 | 1 | 1 | 1 | 1 | 1 | 1 |
2 | 6 | 43 | 65 | 79 | 52 | 82 |
4 | 2 | 67 | 90 | 8 | 96 | 76 |
6 | 38 | 6 | 56 | 50 | 71 | 30 |
7 | 9 | 16 | 12 | 67 | 32 | 0 |
#非同步到位刪除
df.drop_duplicates(keep='first')
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 3 | 29 | 47 | 11 | 69 | 7 |
1 | 1 | 1 | 1 | 1 | 1 | 1 |
2 | 6 | 43 | 65 | 79 | 52 | 82 |
4 | 2 | 67 | 90 | 8 | 96 | 76 |
6 | 38 | 6 | 56 | 50 | 71 | 30 |
7 | 9 | 16 | 12 | 67 | 32 | 0 |
處理異常資料
- 自定義一個1000行3列(A,B,C)取值範圍為0-1的資料來源,然後將C列中的值大於其兩倍標準差的異常值進行清洗
df = DataFrame(data=np.random.random(size=(1000,3)),columns=['A','B','C'])
df.head()
A | B | C | |
---|---|---|---|
0 | 0.093731 | 0.358117 | 0.599607 |
1 | 0.415552 | 0.124499 | 0.820207 |
2 | 0.278742 | 0.222851 | 0.786832 |
3 | 0.537346 | 0.339687 | 0.276611 |
4 | 0.414794 | 0.179321 | 0.094958 |
#制定判定異常值的條件
twice_std = df['C'].std() * 2
twice_std
0.5664309886908782
df.loc[~(df['C'] > twice_std)]
A | B | C | |
---|---|---|---|
3 | 0.537346 | 0.339687 | 0.276611 |
4 | 0.414794 | 0.179321 | 0.094958 |
5 | 0.397169 | 0.610316 | 0.420824 |
7 | 0.740718 | 0.730160 | 0.302804 |
8 | 0.483627 | 0.146781 | 0.126308 |
9 | 0.530481 | 0.426309 | 0.126464 |
11 | 0.450225 | 0.399873 | 0.564089 |
12 | 0.149199 | 0.706758 | 0.271892 |
14 | 0.503967 | 0.098280 | 0.239464 |
15 | 0.630785 | 0.661415 | 0.139548 |
18 | 0.055859 | 0.917004 | 0.285184 |
19 | 0.153187 | 0.876151 | 0.499308 |
20 | 0.743452 | 0.679769 | 0.563917 |
21 | 0.151130 | 0.469532 | 0.382700 |
22 | 0.941487 | 0.967286 | 0.234784 |
24 | 0.459680 | 0.080923 | 0.249299 |
26 | 0.973956 | 0.553966 | 0.094996 |
28 | 0.084031 | 0.689861 | 0.067293 |
29 | 0.484953 | 0.188724 | 0.160719 |
31 | 0.512300 | 0.392566 | 0.100742 |
32 | 0.237209 | 0.004658 | 0.449384 |
34 | 0.345122 | 0.658354 | 0.391020 |
35 | 0.709282 | 0.236713 | 0.499434 |
39 | 0.055908 | 0.924785 | 0.490992 |
41 | 0.047936 | 0.205894 | 0.160804 |
45 | 0.929671 | 0.799439 | 0.335439 |
46 | 0.169237 | 0.037535 | 0.494065 |
50 | 0.808850 | 0.804284 | 0.223080 |
51 | 0.320385 | 0.385899 | 0.189706 |
52 | 0.926416 | 0.748975 | 0.183758 |
... | ... | ... | ... |
949 | 0.596174 | 0.224759 | 0.280238 |
951 | 0.382255 | 0.906360 | 0.401972 |
954 | 0.434587 | 0.983834 | 0.118732 |
955 | 0.914734 | 0.279118 | 0.078705 |
957 | 0.977942 | 0.194291 | 0.253350 |
958 | 0.139654 | 0.683716 | 0.118146 |
959 | 0.308191 | 0.612879 | 0.445845 |
961 | 0.081917 | 0.100586 | 0.116678 |
962 | 0.390555 | 0.762205 | 0.083272 |
964 | 0.712109 | 0.870591 | 0.393287 |
965 | 0.107085 | 0.056523 | 0.304899 |
968 | 0.588549 | 0.535405 | 0.248742 |
971 | 0.498968 | 0.489234 | 0.080411 |
973 | 0.165779 | 0.110859 | 0.384091 |
974 | 0.778701 | 0.489504 | 0.533272 |
975 | 0.057621 | 0.839546 | 0.275676 |
976 | 0.605409 | 0.293276 | 0.482304 |
977 | 0.555336 | 0.287849 | 0.468799 |
978 | 0.484669 | 0.993484 | 0.151512 |
979 | 0.418097 | 0.858759 | 0.220208 |
983 | 0.033246 | 0.539796 | 0.128987 |
984 | 0.973549 | 0.277905 | 0.311013 |
986 | 0.469728 | 0.046535 | 0.274008 |
987 | 0.037183 | 0.136681 | 0.279782 |
989 | 0.824013 | 0.938513 | 0.022778 |
992 | 0.683002 | 0.567619 | 0.003076 |
993 | 0.493820 | 0.617086 | 0.202174 |
996 | 0.177156 | 0.248502 | 0.096410 |
998 | 0.914179 | 0.470827 | 0.129195 |
999 | 0.276110 | 0.942467 | 0.510295 |
573 rows × 3 columns