pandas duplicated() 重複行標記與drop_duplicates()刪除
阿新 • • 發佈:2021-01-13
DataFrame.duplicated
(subset=None,keep='first')
返回表示重複行的布林序列。
Parameters:
1)subsetcolumn label or sequence of labels, optional
#用來指定特定的列,預設所有列
Only consider certain columns for identifying duplicates, by default use all of the columns.
2)keep{‘first’, ‘last’, False}, default ‘first’
#刪除重複項並保留第一次出現的項
Determines which duplicates (if any) to mark.
-
first
: Mark duplicates asTrue
except for the first occurrence. -
last
: Mark duplicates asTrue
except for the last occurrence.
#keep='last'引數就是讓系統從後向前開始篩查,這樣索引小的重複行會返回 'True'。
-
False : Mark all duplicates as
True
.
栗子:
import pandas as pd data=pd.DataFrame({'district':['A','A','B','B','C','C'],'count':[50,50,60,60,80,80]})
重複行返回“True”
data.duplicated()
用drop_duplicates()刪除重複行
data.drop_duplicates()
去除後的行索引沒有更新,所以用reset_index(drop=True)進行行索引更新
data.drop_duplicates().reset_index(drop=True)