檢視spark程序/區分pyspark和pandas的表的合併,pyspark是join,pandas是merge
阿新 • • 發佈:2019-02-01
命令: vim ~/.bashrc source ~/.bashrc ps aux | grep spark pkill -f "spark" sudo chown -R sc:sc spark-2.3.1-bin-hadoop2.7/ sudo mv /home/sc/Downloads/spark-2.3.1-bin-hadoop2.7 /opt/ locate *punish* 查詢檔案路徑; 用pandas做join報錯: 我用pandas做join像這樣:df22 = df1.join(df2, df2.company_name_a == df1.company_name,'left_outer') 報這個錯:ValueError: Can only compare identically-labeled Series objects
>>> df1.join(df2, df1["value"] == df2["value"]).count()
0
>>> df1.join(df2, df1["value"].eqNullSafe(df2["value"])).count()
train_x = pd.read_csv('/home/sc/PycharmProjects/sc/risk_rules/sklearn_result_02/the_check_shixin_train.csv') print(train_x.columns) train_x['add_companyname'] = train_x['company_name'] print(train_x.columns) df_check_1000 = pd.read_csv('/home/sc/Desktop/shixin_detect_result_shixin_cnt.csv') df_check_1000=df_check_1000.drop_duplicates() df_ch1 = pd.merge(df_check_1000,train_x,on='company_name',how='left') print(df_ch1.head(2)) df_ch2 = df_ch1[(df_ch1['add_companyname'].isnull()) & (df_ch1['shixin_cnt'] != 1)] #248家;多次失信並且沒有在訓練集出現過 print(df_ch2.groupby(['id']).size()) print(df_ch2.groupby(['shixin_cnt']).size()) print(len(df_ch2)) df_ch2 = pd.merge(df_ch2,df_check_1000,on='company_name',how='left') print(len(df_ch2)) cols = ['company_name','established_years', 'industry_dx_rate', 'regcap_change_cnt', 'industry_dx_cnt', 'address_change_cnt', 'network_share_cancel_cnt', 'cancel_cnt', 'fr_change_cnt', 'network_share_zhixing_cnt', 'network_share_judge_doc_cnt', 'judge_doc_cnt', 'share_change_cnt', 'industry_all_cnt', 'network_share_or_pos_shixin_cnt', 'judgedoc_cnt'] print("hahahhaha") print(df_ch2.columns) df_ch22 = df_ch2.ix[:, cols] print(df_ch22.columns)