過濾掉出現次數少的行---pandaa.groupby
阿新 • • 發佈:2018-12-13
需求:過濾掉pandas DataFrame中出現次數較少的行,可以採用下面的寫法:df為待過濾資料
df_family_car = df.groupby("PLATE_INFO_EX").filter(lambda x: (len(x) > 500 and len(x)<1000))
詳細研究groupby用法,參考連結:https://blog.csdn.net/songbinxu/article/details/79839363
https://blog.csdn.net/youngbit007/article/details/54288603/
新建資料:
import pandas as pd df = pd.DataFrame({'key1':list('aabba'), 'key2': ['one','two','one','two','one'], 'data1': np.random.randn(5), 'data2': np.random.randn(5)}) df Out[83]: key1 key2 data1 data2 0 a one -0.643930 -0.856232 1 a two 0.863575 -0.577838 2 b one 0.261961 -1.045156 3 b two 0.820736 0.790127 4 a one -0.991311 -0.999499
groupby 迭代,group返回時一個tupe,可以迭代
for name,group in df.groupby('key1'): print name print group #結果: #a # data1 data2 key1 key2 #0 -1.389589 0.605121 a one #1 0.057731 1.387236 a two #4 0.973961 -1.540356 a one #b # data1 data2 key1 key2 #2 -0.476933 -0.110656 b one #3 -0.015403 0.117257 b two #多鍵的情況 for (k1,k2),group in df.groupby(['key1','key2']): print k1,k2 print group #結果: #a one # data1 data2 key1 key2 # 0 -0.474012 0.159072 a one # 4 -2.049148 0.389898 a one # a two # data1 data2 key1 key2 # 1 2.471597 1.335773 a two # b one # data1 data2 key1 key2 # 2 0.249875 0.181691 b one # b two # data1 data2 key1 key2 # 3 0.458725 0.040619 b two
1:key內部value求和,累計求和,求積
# key內部求和 gp = df.groupby(["key1"])["data1"].sum().reset_index() # reset_index重置index gp.rename(columns={"data1":"sum_of_value"},inplace=True) # rename改列名 gp Out[85]: key1 sum_of_value 0 a -0.771667 1 b 1.082697 # key內部求value的累計和 gp = df.groupby(["key1"])["data1"].cumsum().reset_index() gp.rename(columns={"data1":"cumsum_of_value"},inplace=True) gp Out[88]: index cumsum_of_value 0 0 -0.643930 1 1 0.219645 2 2 0.261961 3 3 1.082697 4 4 -0.771667 # key內部value全部相乘 gp = df.groupby(["key1"])["data1"].prod().reset_index() gp.rename(columns={"data1":"prod_of_value"},inplace=True) gp Out[91]: key1 prod_of_value 0 a 0.551250 1 b 0.215001
2:key內部value求均值.mean(),最大值.max(),最小值.min,最大值索引.idmax()
# key內部求均值
gp = df.groupby(["key1"])["data1"].mean().reset_index()
gp.rename(columns={"data1":"mean_of_value"},inplace=True)
gp
Out[93]:
key1 mean_of_value
0 a -0.257222
1 b 0.541349
#....最大最小值寫法同上
# key內部求value最大值在原DataFrame中的index
gp =df.groupby(["key1"])["data1"].idxmax().reset_index()
gp.rename(columns={"data1":"maxidx_of_value"},inplace=True)
gp
Out[95]:
key1 maxidx_of_value
0 a 1
1 b 3
3:key內部value的排名,value相同排名會出現小數,排名中會出現排名2.5的值
# key內部求value最大值在原DataFrame中的index
gp =df.groupby(["key1"])["data1"].rank().reset_index()
gp.rename(columns={"data1":"maxidx_of_value"},inplace=True)
gp
Out[97]:
index maxidx_of_value
0 0 2.0
1 1 3.0
2 2 1.0
3 3 2.0
4 4 1.0
4:size()統計出現次數,類似於分組value_counts()、
gp =df.groupby(["key1","key2"]).size().reset_index()
gp.rename(columns={0:"count"},inplace=True)
gp
Out[110]:
key1 key2 count
0 a one 2
1 a two 1
2 b one 1
3 b two 1
----------------------------------------
另一個例子:
條件統計user 訪問了brand_id 每天的次數
gp = data.groupby(["user_id","brand_id","day"]).size().reset_index()
gp.rename(columns={0:"count"},inplace=True)
-------------------------------------------