/opt/conda/lib/python3.6/site-packages/pandas/core/ops.py:816: pandas 處理 NaN
這裡記錄一下犯過的及其傻帽的錯誤!!!!哈哈,無語,同時討論一下NaN這個資料型別的處理
/opt/conda/lib/python3.6/site-packages/pandas/core/ops.py:816: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison result = getattr(x, name)(y)
....................
TypeError: invalid type comparison
這裡有一個優惠券的scv表:
import numpy as np
import pandas as pd
dfoff = pd.read_csv("datalab/4901/ccf_offline_stage1_train.csv")
dfofftest = pd.read_csv("datalab/4901/ccf_offline_stage1_test_revised.csv")
dfoff.head()
筆者這裡的目的是想統計出 Coupon_id是非NaN(非空)且Date是NaN(空)的使用者數(行數)
----------------------------------------------------------------------------------------------------------------------------------------------------------------
一般來說比如我們想篩選出 Discount_rate是20:1且Distance不是1.0的行數可以這麼做:
dfoff.info()
print('數目是:',dfoff[(dfoff['Discount_rate']=='20:1')&(dfoff['Date']!=1.0)].shape[0])
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
於是筆者這樣做了篩選:
dfoff.info()
print('有優惠券,但是沒有使用優惠券購買的客戶有',dfoff[(dfoff['Coupon_id']!='NaN')&(dfoff['Date']=='NaN')].shape[0])
結果報錯:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1754884 entries, 0 to 1754883
Data columns (total 7 columns):
User_id int64
Merchant_id int64
Coupon_id float64
Discount_rate object
Distance float64
Date_received float64
Date float64
dtypes: float64(4), int64(2), object(1)
memory usage: 93.7+ MB
/opt/conda/lib/python3.6/site-packages/pandas/core/ops.py:816: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
result = getattr(x, name)(y)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-24-c27c94978405> in <module>()
1 dfoff.info()
----> 2 print('有優惠券,但是沒有使用優惠券購買的客戶有',dfoff[(dfoff['Coupon_id']!='NaN')&(dfoff['Date']=='NaN')].shape[0])
/opt/conda/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
877
878 with np.errstate(all='ignore'):
--> 879 res = na_op(values, other)
880 if is_scalar(res):
881 raise TypeError('Could not compare {typ} type with Series'
/opt/conda/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
816 result = getattr(x, name)(y)
817 if result is NotImplemented:
--> 818 raise TypeError("invalid type comparison")
819 except AttributeError:
820 result = op(x, y)
TypeError: invalid type comparison
其實吧原因很簡單,注意看上面筆者故意標紅的地方,Coupon_id 和Date的資料型別都是float64,而程式碼中卻用了dfoff['Coupon_id']!='NaN',這不是字串嘛!!!!!!
print(type('NaN'))
<class 'str'>
float和str比較當然報錯了是吧,哎!能這樣直接去比較我也算是極品啦哈哈哈
於是可以使用其內建的方法解決:
dfoff.info()
print('有優惠券,但是沒有使用優惠券購買的客戶有',dfoff[(dfoff['Coupon_id'].notnull())&(dfoff['Date'].isnull())].shape[0])
即使用瞭如下兩個方法
.notnull()
.isnull()
其作用就是判斷是否是空值,如果csv中的NaN的地方換成null同樣適用
同時這裡說一下怎麼將NaN替換掉:例如替換成0.0
dfoff['Coupon_id']=dfoff['Coupon_id'].replace(np.nan, 0.0)
-----------------------------------------------------------------------------------------------------------------------------------------------------------
下面來說一下NaN這個資料型別,它的全稱應該是not a number,說到這裡不得不提到另外一個數據型別inf
相同點:都是代表一個無法表示的數
不同點:inf代表無窮大,是一個超過浮點表示範圍的浮點數,而NaN可以看成是缺少值或者是無理數
假設現在有一段程式:
def ConvertRate(row):
if row.isnull():
return 0
elif ':' in str(row):
rows = str(row).split(':')
return 1.0-float(rows[1])/float(rows[0])
else:
return float(row)
dfoff['discount_rate'] = dfoff['Discount_rate'].apply(ConvertRate)
print(dfoff.head(3))
會發現報錯:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-0aa06185ee75> in <module>()
7 else:
8 return float(row)
----> 9 dfoff['discount_rate'] = dfoff['Discount_rate'].apply(ConvertRate)
10 print(dfoff.head(3))
/opt/conda/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
2549 else:
2550 values = self.asobject
-> 2551 mapped = lib.map_infer(values, f, convert=convert_dtype)
2552
2553 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-3-0aa06185ee75> in ConvertRate(row)
1 def ConvertRate(row):
----> 2 if row.isnull():
3 return 0
4 elif ':' in str(row):
5 rows = str(row).split(':')
AttributeError: 'float' object has no attribute 'isnull'
那它到底是什麼資料型別呢?
print(type(np.nan))
print(type(np.inf))
<class 'float'>
<class 'float'>
NaN'就是表示一個普通的字串,而np.nan就是代表真真的nan,那我們可不可以使用這樣:
def ConvertRate(row):
if row==np.nan:
return 0
elif ':' in str(row):
rows = str(row).split(':')
return 1.0-float(rows[1])/float(rows[0])
else:
return float(row)
dfoff['discount_rate'] = dfoff['Discount_rate'].apply(ConvertRate)
print(dfoff.head(3))
User_id Merchant_id Coupon_id Discount_rate Distance Date_received \
0 1439408 2632 NaN NaN 0.0 NaN
1 1439408 4663 11002.0 150:20 1.0 20160528.0
2 1439408 2632 8591.0 20:1 0.0 20160217.0
Date discount_rate
0 20160217.0 NaN
1 NaN 0.866667
2 NaN 0.950000
可以看到這裡還是NaN,並不是0,說明還是不對
那試一下:
def ConvertRate(row):
if row==float('NaN'):
return 0
elif ':' in str(row):
rows = str(row).split(':')
return 1.0-float(rows[1])/float(rows[0])
else:
return float(row)
dfoff['discount_rate'] = dfoff['Discount_rate'].apply(ConvertRate)
print(dfoff.head(3))
結果還是如上面,其實NaN資料型別就是一種特殊的float,這裡相當於強制型別轉化
那到底怎麼辦呢?其實判斷是否是NaN可以使用如下方法:
row!=row
如果結果是真,那麼就是NaN,假就代表不是NaN
可以看一下結果:
def ConvertRate(row):
if row!=row:
return 0
elif ':' in str(row):
rows = str(row).split(':')
return 1.0-float(rows[1])/float(rows[0])
else:
return float(row)
dfoff['discount_rate'] = dfoff['Discount_rate'].apply(ConvertRate)
print(dfoff.head(3))
print(dfoff.head(3))
User_id Merchant_id Coupon_id Discount_rate Distance Date_received \
0 1439408 2632 NaN NaN 0.0 NaN
1 1439408 4663 11002.0 150:20 1.0 20160528.0
2 1439408 2632 8591.0 20:1 0.0 20160217.0
Date discount_rate
0 20160217.0 0.000000
1 NaN 0.866667
2 NaN 0.950000
於是筆者最開始的那個問題也可以這樣解決:
print('有優惠券,但是沒有使用優惠券購買的客戶有',dfoff[(dfoff['Coupon_id']==dfoff['Coupon_id'])&(dfoff['Date']!=dfoff['Date'])].shape[0])
有優惠券,但是沒有使用優惠券購買的客戶有 977900
---------------------------------------------------------------------------------------------------------------------------------------------------------------
有時候在使用apply的時候會報錯,所以最好加一下:axis = 1意思是按列處理的
對應到上面就是吧:
dfoff['discount_rate'] = dfoff['Discount_rate'].apply(ConvertRate)
改為:
dfoff['discount_rate'] = dfoff['Discount_rate'].apply(ConvertRate,axis = 1)
------------------------------------------------------------------------------------------------------------------------------------------------------------
所以最後總結一下:
NaN和inf都是一種特殊的float資料型別
可以使用row!=row類似的形式來判斷是否是NaN,如果是真就代表是NaN,假就代表不是NaN,換句話說也可以使用row==row來判斷是否是NaN,只不過邏輯相反而已
報錯記得加axis = 1