1. 程式人生 > >過濾掉出現次數少的行---pandaa.groupby

過濾掉出現次數少的行---pandaa.groupby

需求:過濾掉pandas   DataFrame中出現次數較少的行,可以採用下面的寫法:df為待過濾資料

df_family_car = df.groupby("PLATE_INFO_EX").filter(lambda x: (len(x) > 500 and len(x)<1000))

詳細研究groupby用法,參考連結:https://blog.csdn.net/songbinxu/article/details/79839363

https://blog.csdn.net/youngbit007/article/details/54288603/

新建資料:

import pandas as pd
df = pd.DataFrame({'key1':list('aabba'),
                  'key2': ['one','two','one','two','one'],
                  'data1': np.random.randn(5),
                  'data2': np.random.randn(5)})

df
Out[83]: 
  key1 key2     data1     data2
0    a  one -0.643930 -0.856232
1    a  two  0.863575 -0.577838
2    b  one  0.261961 -1.045156
3    b  two  0.820736  0.790127
4    a  one -0.991311 -0.999499

groupby 迭代,group返回時一個tupe,可以迭代

for name,group in df.groupby('key1'):
    print name
    print group
#結果:
#a
#      data1     data2 key1 key2
#0 -1.389589  0.605121    a  one
#1  0.057731  1.387236    a  two
#4  0.973961 -1.540356    a  one
#b
#      data1     data2 key1 key2
#2 -0.476933 -0.110656    b  one
#3 -0.015403  0.117257    b  two
 
#多鍵的情況
for (k1,k2),group in df.groupby(['key1','key2']):
    print k1,k2
    print group
#結果:
#a one
#       data1     data2 key1 key2
# 0 -0.474012  0.159072    a  one
# 4 -2.049148  0.389898    a  one
# a two
#       data1     data2 key1 key2
# 1  2.471597  1.335773    a  two
# b one
#       data1     data2 key1 key2
# 2  0.249875  0.181691    b  one
# b two
#       data1     data2 key1 key2
# 3  0.458725  0.040619    b  two

 

1:key內部value求和,累計求和,求積

# key內部求和
gp = df.groupby(["key1"])["data1"].sum().reset_index() # reset_index重置index
gp.rename(columns={"data1":"sum_of_value"},inplace=True) # rename改列名

gp
Out[85]: 
  key1  sum_of_value
0    a     -0.771667
1    b      1.082697

# key內部求value的累計和
gp = df.groupby(["key1"])["data1"].cumsum().reset_index() 
gp.rename(columns={"data1":"cumsum_of_value"},inplace=True)

gp
Out[88]: 
   index  cumsum_of_value
0      0        -0.643930
1      1         0.219645
2      2         0.261961
3      3         1.082697
4      4        -0.771667

# key內部value全部相乘
gp = df.groupby(["key1"])["data1"].prod().reset_index()
gp.rename(columns={"data1":"prod_of_value"},inplace=True)
gp
Out[91]: 
  key1  prod_of_value
0    a       0.551250
1    b       0.215001

2:key內部value求均值.mean(),最大值.max(),最小值.min,最大值索引.idmax()

# key內部求均值
gp = df.groupby(["key1"])["data1"].mean().reset_index()
gp.rename(columns={"data1":"mean_of_value"},inplace=True)
gp
Out[93]: 
  key1  mean_of_value
0    a      -0.257222
1    b       0.541349

#....最大最小值寫法同上

# key內部求value最大值在原DataFrame中的index
gp =df.groupby(["key1"])["data1"].idxmax().reset_index()
gp.rename(columns={"data1":"maxidx_of_value"},inplace=True)
gp
Out[95]: 
  key1  maxidx_of_value
0    a                1
1    b                3

3:key內部value的排名,value相同排名會出現小數,排名中會出現排名2.5的值

# key內部求value最大值在原DataFrame中的index
gp =df.groupby(["key1"])["data1"].rank().reset_index()
gp.rename(columns={"data1":"maxidx_of_value"},inplace=True)

gp
Out[97]: 
   index  maxidx_of_value
0      0              2.0
1      1              3.0
2      2              1.0
3      3              2.0
4      4              1.0

4:size()統計出現次數,類似於分組value_counts()、

gp =df.groupby(["key1","key2"]).size().reset_index()
gp.rename(columns={0:"count"},inplace=True)

gp
Out[110]: 
  key1 key2  count
0    a  one      2
1    a  two      1
2    b  one      1
3    b  two      1

----------------------------------------
另一個例子:
條件統計user 訪問了brand_id  每天的次數
gp = data.groupby(["user_id","brand_id","day"]).size().reset_index() 
gp.rename(columns={0:"count"},inplace=True)
-------------------------------------------