pandas資料處理實踐三(DataFrame.apply資料預處理、DataFrame.drop_duplicates去重)
阿新 • • 發佈:2018-12-13
通過apply進行資料的預處理:
DataFrame.
apply
(func,axis = 0,broadcast = None,raw = False,reduce = None,result_type = None,args =(),** kwds )
In [70]: df = pd.read_csv('apply_demo.csv') In [71]: df.head() # 預設取前5行 Out[71]: time data 0 1473411962 Symbol: APPL Seqno: 0 Price: 1623 1 1473411962 Symbol: APPL Seqno: 0 Price: 1623 2 1473411963 Symbol: APPL Seqno: 0 Price: 1623 3 1473411963 Symbol: APPL Seqno: 0 Price: 1623 4 1473411963 Symbol: APPL Seqno: 1 Price: 1649 In [72]: df.shape # 表示有3989個樣本,每個樣本有兩個特徵(資料) Out[72]: (3989, 2) In [73]: df.size # 返回的是資料的元素個數,即3989*2 = 7978 Out[73]: 7978 In [74]: s1 = Series(['a']* 3992) # 注意經多次試驗,如果s1的長度多於df的長度即3989,則最後和新增以後^M ...: # 和df相同,反之還是以df為準,不夠的使用nan填充 In [75]: df['A'] = s1 In [76]: df.head() Out[76]: time data A 0 1473411962 Symbol: APPL Seqno: 0 Price: 1623 a 1 1473411962 Symbol: APPL Seqno: 0 Price: 1623 a 2 1473411963 Symbol: APPL Seqno: 0 Price: 1623 a 3 1473411963 Symbol: APPL Seqno: 0 Price: 1623 a 4 1473411963 Symbol: APPL Seqno: 1 Price: 1649 a In [77]: df['A'] = df['A'].apply(str.upper) # 輸入一個功能函式,應用於每個列或行進行迭代,對A # 這一列把小寫變為大寫,預設是行進行迭代 In [78]: df.head() Out[78]: time data A 0 1473411962 Symbol: APPL Seqno: 0 Price: 1623 A 1 1473411962 Symbol: APPL Seqno: 0 Price: 1623 A 2 1473411963 Symbol: APPL Seqno: 0 Price: 1623 A 3 1473411963 Symbol: APPL Seqno: 0 Price: 1623 A 4 1473411963 Symbol: APPL Seqno: 1 Price: 1649 A In [79]: # data中的資料有三種值,想把data中的三種值提取出來單獨用作多列 In [80]: l1 = df['data'][0].strip().split(' ') # .strip()是去除空格,split(" ")是以空格為分隔符進行分割 In [81]: l1 Out[81]: ['Symbol:', 'APPL', 'Seqno:', '0', 'Price:', '1623'] In [82]: l1[1], l1[3],l1[5] Out[82]: ('APPL', '0', '1623') In [83]: # 定義一個函式進行提取想要提取的資料,並返回Series結構資料 In [84]: def foo(line):^M ...: items = line.strip().split(' ')^M ...: return Series([items[1], items[3], items[5]]) ...: ...: In [85]: df_tmp = df['data'].apply(foo) # 進行資料處理並返回 In [86]: df_tmp = df_tmp.rename(columns={0:'Symbol', 1:'Seqno', 2:'Price'}) # 更改columns的名稱 In [87]: df_tmp.head() Out[87]: Symbol Seqno Price 0 APPL 0 1623 1 APPL 0 1623 2 APPL 0 1623 3 APPL 0 1623 4 APPL 1 1649 In [88]: df_new = df.combine_first(df_tmp) # 通過combine_first新增到目標資料中 In [89]: df_new.head() Out[89]: A Price Seqno Symbol data time 0 A 1623.0 0.0 APPL Symbol: APPL Seqno: 0 Price: 1623 1473411962 1 A 1623.0 0.0 APPL Symbol: APPL Seqno: 0 Price: 1623 1473411962 2 A 1623.0 0.0 APPL Symbol: APPL Seqno: 0 Price: 1623 1473411963 3 A 1623.0 0.0 APPL Symbol: APPL Seqno: 0 Price: 1623 1473411963 4 A 1649.0 1.0 APPL Symbol: APPL Seqno: 1 Price: 1649 1473411963 In [90]: del df_new['A'],df_new['data'] # 刪除無用的資料Series In [91]: df_new.head() Out[91]: Price Seqno Symbol time 0 1623.0 0.0 APPL 1473411962 1 1623.0 0.0 APPL 1473411962 2 1623.0 0.0 APPL 1473411963 3 1623.0 0.0 APPL 1473411963 4 1649.0 1.0 APPL 1473411963 In [92]: df_new.to_csv('demo_duplicate.csv')
去重:
DataFrame.
drop_duplicates
(subset = None,keep ='first',inplace = False )
返回刪除了重複行的DataFrame
引數: |
subset:列標籤或標籤序列,可選
保持:{'first','last',False},預設'first'
inplace:布林值,預設為False
|
---|
In [93]: df = pd.read_csv('demo_duplicate.csv') In [94]: df.head() Out[94]: Unnamed: 0 Price Seqno Symbol time 0 0 1623.0 0.0 APPL 1473411962 1 1 1623.0 0.0 APPL 1473411962 2 2 1623.0 0.0 APPL 1473411963 3 3 1623.0 0.0 APPL 1473411963 4 4 1649.0 1.0 APPL 1473411963 In [95]: del df['Unnamed: 0'] # 刪除Unnamed: 0 columns In [96]: df.head() # 發現Seqno有很多重複的值,下面進行去除工作 Out[96]: Price Seqno Symbol time 0 1623.0 0.0 APPL 1473411962 1 1623.0 0.0 APPL 1473411962 2 1623.0 0.0 APPL 1473411963 3 1623.0 0.0 APPL 1473411963 4 1649.0 1.0 APPL 1473411963 In [97]: df.shape # 看看有多少資料 Out[97]: (3989, 4) In [98]: len(df['Seqno'].unique()) # 看看該列有多少種數值 Out[98]: 1000 In [99]: df['Seqno'].duplicated().head() # 判斷是否是重複的數值,一般第一個為原始的後面的為重複資料。 Out[99]: 0 False 1 True 2 True 3 True 4 False Name: Seqno, dtype: bool In [100]: df['Seqno'].drop_duplicates().head() # 刪除重複的,預設保留第一個出現的,返回的series Out[100]: 0 0.0 4 1.0 8 2.0 12 3.0 16 4.0 Name: Seqno, dtype: float64 In [101]: df.drop_duplicates().head() # 發現還是沒刪除完重複的 Out[101]: Price Seqno Symbol time 0 1623.0 0.0 APPL 1473411962 2 1623.0 0.0 APPL 1473411963 4 1649.0 1.0 APPL 1473411963 6 1649.0 1.0 APPL 1473411964 8 1642.0 2.0 APPL 1473411964 In [102]: df.drop_duplicates(['Seqno']).head() # 加入這一個columns就可以完成,是以這這一列為準刪除 Out[102]: Price Seqno Symbol time 0 1623.0 0.0 APPL 1473411962 4 1649.0 1.0 APPL 1473411963 8 1642.0 2.0 APPL 1473411964 12 1636.0 3.0 APPL 1473411965 16 1669.0 4.0 APPL 1473411966 In [103]: df.drop_duplicates(['Seqno'],keep='last').head() # keep='last'是以重複的最後一個進行保留 Out[103]: Price Seqno Symbol time 3 1623.0 0.0 APPL 1473411963 7 1649.0 1.0 APPL 1473411964 11 1642.0 2.0 APPL 1473411965 15 1636.0 3.0 APPL 1473411966 19 1669.0 4.0 APPL 1473411967