Dataframe根據缺測率對行或列求取平均值
阿新 • • 發佈:2022-05-13
Dataframe自帶的求取平均值的演算法只能忽略nan,無法根據nan出現的頻次計算平均值。
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,np.nan,np.nan,np.nan,np.nan],[2,3,np.nan,np.nan,np.nan],[3,4,5,np.nan,np.nan],[4,5,6,7,np.nan]], index=['a', 'b', 'c', 'd'], columns=['A','B','C','D','E'])
print(df)
結果:
A B C D E
a 1 NaN NaN NaN NaN
b 2 3.0 NaN NaN NaN
c 3 4.0 5.0 NaN NaN
d 4 5.0 6.0 7.0 NaN
下面是我編寫的自定義函式
def df_mean(u0, axis, limit):
"""
dataframe對 行或列 根據缺測率求取平均。
:param u0: 求取平均的dataframe
:param axis: 行或列,1為逐行對列求平均,0為逐列對行求平均
:param limit: 缺測率標準,0-1。缺測率大於等於limit的,平均值定義為nan
:return:
"""
umean=[]
if axis==1:
for ij in u0.index:
if u0.loc[ij, :].isna().sum()/len(u0.loc[ij,:]) >= limit:
umean.append(np.nan)
else:
umean.append(u0.loc[ij,:].mean())
umean=pd.Series(umean, index=u0.index)
elif axis==0:
for ij in u0.columns:
if u0.loc[:, ij].isna().sum() / len(u0.loc[:, ij]) >= limit:
umean.append(np.nan)
else:
umean.append(u0.loc[:, ij].mean())
umean=pd.Series(umean, index=u0.columns)
else:
print('Error for axis')
return umean
直接用df.mean()求取平均值的結果:
print(df.mean(axis=0))
A 2.5 B 4.0 C 5.5 D 7.0 E NaN dtype: float64
利用自定義函式求取的結果:
print(df_mean(df, axis=0, limit=0.6))
A 2.5
B 4.0
C 5.5
D NaN
E NaN
dtype: float64
可以看出,‘D’和‘E’列的缺測率大於0.6,求取的平均值定義為nan了。