FDDC2018金融演算法挑戰賽01-A股上市公司季度營收預測
我所用到的資料
1、income_gb_2代表的是我從天池原有的income_statement中的general business匯出的,balance_gb_2和cash_gb_2
首亦然。
2、 Macro為巨集觀資料,Market為市場資料
匯入相關包,將工作目錄改為資料所在目錄
from pandas import DataFrame from numpy import nan as NA from pandas import Series import os import pandas as pd import numpy as np import random import time import threading as td import multiprocessing as mp import matplotlib.pyplot as plt import seaborn as sns from sklearn.preprocessing import LabelEncoder import xgboost as xgb from sklearn.metrics import accuracy_score from xgboost import XGBClassifier from sklearn.model_selection import KFold from xgboost import XGBRegressor from sklearn.cross_validation import cross_val_score from sklearn.grid_search import GridSearchCV from sklearn.decomposition import PCA #改變工作目錄 os.chdir('E://kaggle//tc2')
1、由於資產負債表、利潤表和現金流量表內各個資料間在存在等式關係,例如:資產=負債+ 股東權益,利潤總額=營業收入-營業成本-各項費用等。根據這些內在邏輯剔除異常值
2、由於內在等式關係, 對於缺失值則直接填充0,不改變原有等式關係
3、資料中有些列是前方列的和,為排除多重共線性,將這些列予以剔除
#根據財務領域知識剔除異常值與線性相關的列 #利潤表 #匯入資料 income_gb2=pd.read_csv('income_gb_2.csv') #填充缺失值 income_gb2=income_gb2.fillna(0) #建立空列表,用於收集需要剔除的觀測樣本 income_drop_index=[] #檢測異常樣本 for i in range(np.shape(income_gb2)[0]): if (income_gb2.ix[i,9]-income_gb2.ix[i,10:16].sum()) >1000 or \ (income_gb2.ix[i,9]-income_gb2.ix[i,10:16].sum()) <-1000 or \ (income_gb2.ix[i,16]-income_gb2.ix[i,17:32].sum()) >1000 or \ (income_gb2.ix[i,16]-income_gb2.ix[i,17:32].sum()) <-1000 or \ (income_gb2.ix[i,10:16].sum()-income_gb2.ix[i,17:32].sum()+income_gb2.ix[i,32:34].sum()+ \ income_gb2.ix[i,35:40].sum()-income_gb2.ix[i,40]) > 1000 or \ (income_gb2.ix[i,10:16].sum()-income_gb2.ix[i,17:32].sum()+income_gb2.ix[i,32:34].sum()+ \ income_gb2.ix[i,35:40].sum()-income_gb2.ix[i,40]) < -1000 : income_drop_index.append(i) print((i/np.shape(income_gb2)[0])*100) #剔除觀測樣本 income_gb2_drop=income_gb2.drop(income_drop_index,axis=0) #根據業務邏輯剔除資料中線性相關的列,防止多重共線性 income_gb2_drop=income_gb2.drop(['T_REVENUE','T_COGS','OPERATE_PROFIT','N_INCOME','T_COMPR_INCOME'],axis=1) income_gb2_drop.to_csv('income_gb2_drop.csv',index=None) #資產負債表 #處理方式同上 balance_gb2=pd.read_csv('balance_gb_2.csv') balance_gb2=balance_gb2.fillna(0) balance_gb2_drop=balance_gb2 balance_gb2_drop1=balance_gb2.drop(['T_CA','T_NCA','T_ASSETS','T_CL','T_NCL','T_LIAB', 'PREFERRED_STOCK_E','PREFERRED_STOCK_L','T_EQUITY_ATTR_P', 'T_SH_EQUITY','T_LIAB_EQUITY'],axis=1) balance_drop_index_total=[] for i in range(np.shape(balance_gb2_drop)[0]) : if (balance_gb2_drop1.ix[i,9:list(balance_gb2_drop1.columns).index('ST_BORR')].sum() - \ balance_gb2_drop1.ix[i,list(balance_gb2_drop1.columns).index('ST_BORR'):].sum()) >10000 or \ (balance_gb2_drop1.ix[i,9:list(balance_gb2_drop1.columns).index('ST_BORR')].sum() - \ balance_gb2_drop1.ix[i,list(balance_gb2_drop1.columns).index('ST_BORR'):].sum()) < -10000 : balance_drop_index_total.append(i) print((i+1)/209872) balance_drop_index_sum=[] for i in range(np.shape(balance_gb2_drop)[0]) : if (balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_ASSETS')] - \ balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_LIAB_EQUITY')]) >10000 or \ (balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_ASSETS')] - \ balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_LIAB_EQUITY')]) < -10000 : balance_drop_index_sum.append(i) print((i+1)/209872) balance_drop_index_TCA=[] for i in range(np.shape(balance_gb2_drop)[0]) : if (balance_gb2_drop.ix[i,9:list(balance_gb2_drop.columns).index('T_CA')].sum() - \ balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_CA')]) >10000 or \ (balance_gb2_drop.ix[i,9:list(balance_gb2_drop.columns).index('T_CA')].sum() - \ balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_CA')]) < -10000 : balance_drop_index_TCA.append(i) print((i+1)/209872) balance_drop_index_TNCA=[] for i in range(np.shape(balance_gb2_drop)[0]) : if (balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('DISBUR_LA'):list(balance_gb2_drop.columns).index('T_NCA')].sum() - \ balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_NCA')]) >10000 or \ (balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('DISBUR_LA'):list(balance_gb2_drop.columns).index('T_NCA')].sum() - \ balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_NCA')]) < -10000 : balance_drop_index_TNCA.append(i) print((i+1)/209872) # balance_drop_index_T_CL=[] for i in range(np.shape(balance_gb2_drop)[0]) : if (balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('ST_BORR'):list(balance_gb2_drop.columns).index('T_CL')].sum() - \ balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_CL')]) >10000 or \ (balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('ST_BORR'):list(balance_gb2_drop.columns).index('T_CL')].sum() - \ balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_CL')]) < -10000 : balance_drop_index_T_CL.append(i) print((i+1)/209872) balance_drop_index_T_NCL=[] for i in range(np.shape(balance_gb2_drop)[0]) : if (balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('LT_BORR'):list(balance_gb2_drop.columns).index('T_NCL')].sum() - \ balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_NCL')] -balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('PREFERRED_STOCK_L')]) >10000 or \ (balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('LT_BORR'):list(balance_gb2_drop.columns).index('T_NCL')].sum() - \ balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_NCL')]-balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('PREFERRED_STOCK_L')]) < -10000 : balance_drop_index_T_NCL.append(i) print((i+1)/209872) balance_drop_index_T_EQUITY_ATTR_P=[] for i in range(np.shape(balance_gb2_drop)[0]) : if (balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('PAID_IN_CAPITAL'):list(balance_gb2_drop.columns).index('T_EQUITY_ATTR_P')].sum() - \ balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_EQUITY_ATTR_P')] -balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('PREFERRED_STOCK_E')]) >10000 or \ (balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('PAID_IN_CAPITAL'):list(balance_gb2_drop.columns).index('T_EQUITY_ATTR_P')].sum() - \ balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_EQUITY_ATTR_P')]-balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('PREFERRED_STOCK_E')]) < -10000 : balance_drop_index_T_EQUITY_ATTR_P.append(i) print((i+1)/209872) balance_drop_index_T_SH_EQUITY=[] for i in range(np.shape(balance_gb2_drop)[0]) : if (balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_EQUITY_ATTR_P'):list(balance_gb2_drop.columns).index('T_SH_EQUITY')].sum() - \ balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_SH_EQUITY')] ) >10000 or \ (balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_EQUITY_ATTR_P'):list(balance_gb2_drop.columns).index('T_SH_EQUITY')].sum() - \ balance_gb2_drop.ix[i,list(balance_gb2_drop.columns).index('T_SH_EQUITY')]) < -10000 : balance_drop_index_T_SH_EQUITY.append(i) print((i+1)/209872) balance_drop_index_final=balance_drop_index_sum+balance_drop_index_TCA+balance_drop_index_TNCA+balance_drop_index_T_CL+balance_drop_index_T_NCL+balance_drop_index_T_EQUITY_ATTR_P+balance_drop_index_T_SH_EQUITY balance_drop_index_final=list(set(balance_drop_index_final)) balance_gb2_drop_final=balance_gb2.drop(balance_drop_index_final,axis=0) balance_gb2_drop_final=balance_gb2_drop_final.drop(['T_CA','T_NCA','T_ASSETS','T_CL','T_NCL','T_LIAB', 'PREFERRED_STOCK_E','PREFERRED_STOCK_L','T_EQUITY_ATTR_P', 'T_SH_EQUITY','T_LIAB_EQUITY'],axis=1) balance_gb2_drop_final.to_csv('balance_gb2_drop.csv',index=None) #現金流量表 #處理方式同上 cash_gb2=pd.read_csv('cash_gb_2.csv') cash_gb2=cash_gb2.fillna(0) cash_drop_index_OPERATE_A=[] for i in range(np.shape(cash_gb2)[0]) : if abs(cash_gb2.ix[i,list(cash_gb2.columns).index('C_FR_SALE_G_S'):list(cash_gb2.columns).index('C_INF_FR_OPERATE_A')].sum() - \ cash_gb2.ix[i,list(cash_gb2.columns).index('C_PAID_G_S'):list(cash_gb2.columns).index('C_OUTF_OPERATE_A')].sum() + \ cash_gb2.ix[i,list(cash_gb2.columns).index('ANOCF')] - \ cash_gb2.ix[i,list(cash_gb2.columns).index('N_CF_OPERATE_A')]) >10000 : cash_drop_index_OPERATE_A.append(i) print((i+1)/209872) cash_drop_index_INVEST_A=[] for i in range(np.shape(cash_gb2)[0]) : if abs(cash_gb2.ix[i,list(cash_gb2.columns).index('PROC_SELL_INVEST'):list(cash_gb2.columns).index('C_INF_FR_INVEST_A')].sum() - \ cash_gb2.ix[i,list(cash_gb2.columns).index('PUR_FIX_ASSETS_OTH'):list(cash_gb2.columns).index('C_OUTF_FR_INVEST_A')].sum() + \ cash_gb2.ix[i,list(cash_gb2.columns).index('ANICF')] - \ cash_gb2.ix[i,list(cash_gb2.columns).index('N_CF_FR_INVEST_A')]) >10000 : cash_drop_index_INVEST_A.append(i) print((i+1)/209872) cash_drop_index_FINAN_A=[] for i in range(np.shape(cash_gb2)[0]) : if abs(cash_gb2.ix[i,list(cash_gb2.columns).index('C_FR_CAP_CONTR'):list(cash_gb2.columns).index('C_INF_FR_FINAN_A')].sum() - \ cash_gb2.ix[i,list(cash_gb2.columns).index('C_PAID_FOR_DEBTS'):list(cash_gb2.columns).index('C_OUTF_FR_FINAN_A')].sum() + \ cash_gb2.ix[i,list(cash_gb2.columns).index('ANFCF')] - \ cash_gb2.ix[i,list(cash_gb2.columns).index('N_CF_FR_FINAN_A')] -\ cash_gb2.ix[i,list(cash_gb2.columns).index('C_FR_MINO_S_SUBS')] + \ cash_gb2.ix[i,list(cash_gb2.columns).index('DIV_PROF_SUBS_MINO_S')]) >10000 : cash_drop_index_FINAN_A.append(i) print((i+1)/209872) cash_drop_index_BAL=[] for i in range(np.shape(cash_gb2)[0]) : if abs(cash_gb2.ix[i,list(cash_gb2.columns).index('N_CHANGE_IN_CASH'):list(cash_gb2.columns).index('N_CE_END_BAL')].sum() - \ cash_gb2.ix[i,list(cash_gb2.columns).index('N_CE_END_BAL')]) >10000 : cash_drop_index_BAL.append(i) print((i+1)/209872) cash_drop_index_final=cash_drop_index_OPERATE_A+cash_drop_index_INVEST_A+cash_drop_index_FINAN_A+cash_drop_index_BAL cash_drop_index_final=list(set(cash_drop_index_final)) cash_gb2_drop_final=cash_gb2.drop(cash_drop_index_final,axis=0) cash_gb2_drop_final=cash_gb2_drop_final.drop(['C_INF_FR_OPERATE_A','C_OUTF_OPERATE_A','N_CF_OPERATE_A','C_INF_FR_INVEST_A','C_OUTF_FR_INVEST_A','N_CF_FR_INVEST_A', 'C_INF_FR_FINAN_A','C_OUTF_FR_FINAN_A','N_CF_FR_FINAN_A', 'N_CHANGE_IN_CASH','N_CE_END_BAL'],axis=1) cash_gb2_drop_final.to_csv('cash_gb2_drop.csv',index=False)
1、整理資料為後面形成訓練集做準備。如同一個數據有多個值選取最近更新的資料
#開啟檔案 cash_gb0=pd.read_csv('cash_gb2_drop.csv') #1、將時間列表提取 #2、然後從字串變為時間戳,並改為資料框 #3、併入原資料框 date_pub=cash_gb0['PUBLISH_DATE'].map(lambda x:int(time.mktime(time.strptime(x,'%Y-%m-%d')))) date_pub=DataFrame(date_pub.values,columns=['PUBLISH_DATE_mktime']) date_rep=cash_gb0['END_DATE_REP'].map(lambda x:int(time.mktime(time.strptime(x,'%Y-%m-%d')))) date_rep=DataFrame(date_rep.values,columns=['END_DATE_REP_mktime']) date_end=cash_gb0['END_DATE'].map(lambda x:int(time.mktime(time.strptime(x,'%Y-%m-%d')))) date_end=DataFrame(date_end.values,columns=['END_DATE_mktime']) cash_gb0=pd.concat([cash_gb0,date_pub],axis=1) cash_gb0=pd.concat([cash_gb0,date_rep],axis=1) cash_gb0=pd.concat([cash_gb0,date_end],axis=1) cash_gb0.sort_index(by=['TICKER_SYMBOL','END_DATE_mktime','PUBLISH_DATE_mktime','END_DATE_REP_mktime'],inplace=True) cash_gb0.set_index(['TICKER_SYMBOL','END_DATE_mktime','PUBLISH_DATE_mktime','END_DATE_REP_mktime'],inplace=True,drop=False) #用迴圈將最新截止日期的財報篩選出來 sum1=0 ticker_unique=cash_gb0['TICKER_SYMBOL'].unique() for i in ticker_unique: cash_slice=cash_gb0.ix[i] end_date_unique=cash_slice['END_DATE_mktime'].unique() sum1 += 1 for j in end_date_unique: index_t1=cash_gb0.ix[i,j]['PUBLISH_DATE_mktime'].values[-1] index_t2=cash_gb0.ix[i,j]['END_DATE_REP_mktime'].values[-1] cash_gb0.ix[(i,j,index_t1,index_t2),'PARTY_ID']=-1 print(sum1/len(ticker_unique)) cash_gb1=cash_gb0.drop(['TICKER_SYMBOL','END_DATE_mktime','PUBLISH_DATE_mktime','END_DATE_REP_mktime'],axis=1) #將'party_id'列不為-1的行全部轉換為缺失值 #將索引轉換為列 cash_gb1=cash_gb1.reset_index() for i in range(np.shape(cash_gb1)[0]): if cash_gb1.ix[i,'PARTY_ID'] !=-1: cash_gb1.iloc[i]=NA else: continue print(i/np.shape(cash_gb1)[0]) cash_gb1.to_csv('cash_gb1.csv') cash_gb2=cash_gb1 #刪除全為缺失值一行 cash_gb2=cash_gb2.dropna(how='all') #設定層次化索引 cash_gb2=cash_gb2.set_index(['TICKER_SYMBOL','END_DATE']) #刪除不需要的列 cash_gb2=cash_gb2.drop(['PARTY_ID','EXCHANGE_CD','REPORT_TYPE','FISCAL_PERIOD','MERGED_FLAG','PUBLISH_DATE', 'END_DATE_REP','END_DATE_mktime','PUBLISH_DATE_mktime','END_DATE_REP_mktime'],axis=1) #為層次索引排序 cash_gb2=cash_gb2.sortlevel(0) #檔案輸出 cash_gb2.to_csv('cash_data.csv') #開啟檔案 balance_gb0=pd.read_csv('balance_gb2_drop.csv') #1、將時間列表提取 #2、然後從字串變為時間戳,並改為資料框 #3、併入原資料框 date_pub=balance_gb0['PUBLISH_DATE'].map(lambda x:int(time.mktime(time.strptime(x,'%Y-%m-%d')))) date_pub=DataFrame(date_pub.values,columns=['PUBLISH_DATE_mktime']) date_rep=balance_gb0['END_DATE_REP'].map(lambda x:int(time.mktime(time.strptime(x,'%Y-%m-%d')))) date_rep=DataFrame(date_rep.values,columns=['END_DATE_REP_mktime']) date_end=balance_gb0['END_DATE'].map(lambda x:int(time.mktime(time.strptime(x,'%Y-%m-%d')))) date_end=DataFrame(date_end.values,columns=['END_DATE_mktime']) balance_gb0=pd.concat([balance_gb0,date_pub],axis=1) balance_gb0=pd.concat([balance_gb0,date_rep],axis=1) balance_gb0=pd.concat([balance_gb0,date_end],axis=1) balance_gb0.sort_index(by=['TICKER_SYMBOL','END_DATE_mktime','PUBLISH_DATE_mktime','END_DATE_REP_mktime'],inplace=True) balance_gb0.set_index(['TICKER_SYMBOL','END_DATE_mktime','PUBLISH_DATE_mktime','END_DATE_REP_mktime'],inplace=True,drop=False) #用迴圈將最新截止日期的財報篩選出來 sum1=0 ticker_unique=balance_gb0['TICKER_SYMBOL'].unique() for i in ticker_unique: balance_slice=balance_gb0.ix[i] end_date_unique=balance_slice['END_DATE_mktime'].unique() sum1 += 1 for j in end_date_unique: index_t1=balance_gb0.ix[i,j]['PUBLISH_DATE_mktime'].values[-1] index_t2=balance_gb0.ix[i,j]['END_DATE_REP_mktime'].values[-1] balance_gb0.ix[(i,j,index_t1,index_t2),'PARTY_ID']=-1 print(sum1/len(ticker_unique)) #將'party_id'列不為-1的行全部轉換為缺失值 balance_gb1=balance_gb0.drop(['TICKER_SYMBOL','END_DATE_mktime','PUBLISH_DATE_mktime','END_DATE_REP_mktime'],axis=1) #將索引轉換為列 balance_gb1=balance_gb1.reset_index() for i in range(np.shape(balance_gb1)[0]): if balance_gb1.ix[i,'PARTY_ID'] !=-1: balance_gb1.iloc[i]=NA else: continue print(i/np.shape(balance_gb1)[0]) balance_gb1.to_csv('balance_gb1.csv') balance_gb2=balance_gb1 #刪除全為缺失值一行 balance_gb2=balance_gb2.dropna(how='all') #設定層次化索引 balance_gb2=balance_gb2.set_index(['TICKER_SYMBOL','END_DATE']) #刪除不需要的列 balance_gb2=balance_gb2.drop(['PARTY_ID','EXCHANGE_CD','REPORT_TYPE','FISCAL_PERIOD','MERGED_FLAG','PUBLISH_DATE', 'END_DATE_REP','END_DATE_mktime','PUBLISH_DATE_mktime','END_DATE_REP_mktime'],axis=1) #為層次索引排序 balance_gb2=balance_gb2.sortlevel(0) #檔案輸出 balance_gb2.to_csv('balance_data.csv') #開啟檔案 income_gb0=pd.read_csv('income_gb2_drop.csv') #1、將時間列表提取 #2、然後從字串變為時間戳,並改為資料框 #3、併入原資料框 date_pub=income_gb0['PUBLISH_DATE'].map(lambda x:int(time.mktime(time.strptime(x,'%Y-%m-%d')))) date_pub=DataFrame(date_pub.values,columns=['PUBLISH_DATE_mktime']) date_rep=income_gb0['END_DATE_REP'].map(lambda x:int(time.mktime(time.strptime(x,'%Y-%m-%d')))) date_rep=DataFrame(date_rep.values,columns=['END_DATE_REP_mktime']) date_end=income_gb0['END_DATE'].map(lambda x:int(time.mktime(time.strptime(x,'%Y-%m-%d')))) date_end=DataFrame(date_end.values,columns=['END_DATE_mktime']) income_gb0=pd.concat([income_gb0,date_pub],axis=1) income_gb0=pd.concat([income_gb0,date_rep],axis=1) income_gb0=pd.concat([income_gb0,date_end],axis=1) income_gb0.sort_index(by=['TICKER_SYMBOL','END_DATE_mktime','PUBLISH_DATE_mktime','END_DATE_REP_mktime'],inplace=True) income_gb0.set_index(['TICKER_SYMBOL','END_DATE_mktime','PUBLISH_DATE_mktime','END_DATE_REP_mktime'],inplace=True,drop=False) #用迴圈將最新截止日期的財報篩選出來 sum1=0 ticker_unique=income_gb0['TICKER_SYMBOL'].unique() for i in ticker_unique: income_slice=income_gb0.ix[i] end_date_unique=income_slice['END_DATE_mktime'].unique() sum1 += 1 for j in end_date_unique: index_t1=income_gb0.ix[i,j]['PUBLISH_DATE_mktime'].values[-1] index_t2=income_gb0.ix[i,j]['END_DATE_REP_mktime'].values[-1] income_gb0.ix[(i,j,index_t1,index_t2),'PARTY_ID']=-1 print(sum1/len(ticker_unique)) #將'party_id'列不為-1的行全部轉換為缺失值 income_gb1=income_gb0.drop(['TICKER_SYMBOL','END_DATE_mktime','PUBLISH_DATE_mktime','END_DATE_REP_mktime'],axis=1) #將索引轉換為列 income_gb1=income_gb1.reset_index() for i in range(np.shape(income_gb1)[0]): if income_gb1.ix[i,'PARTY_ID'] !=-1: income_gb1.iloc[i]=NA else: continue print(i/np.shape(income_gb1)[0]) income_gb1.to_csv('income_gb1.csv') income_gb2=income_gb1 #刪除全為缺失值一行 income_gb2=income_gb2.dropna(how='all') #設定層次化索引 income_gb2=income_gb2.set_index(['TICKER_SYMBOL','END_DATE']) #刪除不需要的列 income_gb2=income_gb2.drop(['PARTY_ID','EXCHANGE_CD','REPORT_TYPE','FISCAL_PERIOD','MERGED_FLAG','PUBLISH_DATE', 'END_DATE_REP','END_DATE_mktime','PUBLISH_DATE_mktime','END_DATE_REP_mktime'],axis=1) #為層次索引排序 income_gb2=income_gb2.sortlevel(0) #檔案輸出 income_gb2.to_csv('income_data.csv')
1、字串日期列通過函式分開為年列和月列。
2、將資產負債表,利潤表和現金流量表根據證券程式碼和日期索引進行內聯結合並。
#將資產負債表、利潤表、現金流量表融合
balance_gb3=pd.read_csv('balance_data.csv',index_col=['TICKER_SYMBOL','END_DATE'])
cash_gb3=pd.read_csv('cash_data.csv',index_col=['TICKER_SYMBOL','END_DATE'])
income_gb3=pd.read_csv('income_data.csv',index_col=['TICKER_SYMBOL','END_DATE'])
merge1=pd.merge(income_gb3,balance_gb3,left_index=True,right_index=True,how='inner')
merge2=pd.merge(merge1,cash_gb3,left_index=True,right_index=True,how='inner')
#財務報表資料第二版
merge3=merge2.reset_index()
def f1(x):
return int(x[:4])
def f2(x):
if len(x) ==10:
return int(x[5:7])
elif len(x) ==8:
return int(x[5:6])
elif x[4:7].count('/') == 2 :
return int(x[5:6])
else:
return int(x[5:7])
#將日期列通過函式分開為年列和月列
merge3['YEAR']=merge3['END_DATE'].map(f1)
merge3['MONTH']=merge3['END_DATE'].map(f2)
merge3.drop(['END_DATE'],axis=1,inplace=True)
merge3=merge3.set_index(['TICKER_SYMBOL','YEAR','MONTH'])
merge3=merge3.sortlevel(0)
#儲存資料
merge3.to_csv('merge_data(7.26).csv')
市場資料進行預處理
#市場資料第二版
market=pd.read_csv('Market.csv')
#檢視是否有缺失值
market.isnull().sum()
#利用元素級函式將日期分為年月
def f1(x):
return int(x[:4])
def f2(x):
if len(x) ==10:
return int(x[5:7])
elif len(x) ==8:
return int(x[5:6])
elif x[4:7].count('/') == 2 :
return int(x[5:6])
else:
return int(x[5:7])
market['YEAR']=market['END_DATE_'].map(f1)
market['MONTH']=market['END_DATE_'].map(f2)
market.drop(['SECURITY_ID','TYPE_ID','TYPE_NAME_CN','END_DATE_'],axis=1,inplace=True)
market=market.set_index(['TICKER_SYMBOL','YEAR','MONTH'])
market=market.sortlevel(0)
market.to_csv('market_final.csv')
對巨集觀資料進行預處理
#巨集觀資料
macro=pd.read_csv('Macro.csv')
#檢視是否有缺失值並剔除
macro.isnull().sum()
macro.dropna(how='any',inplace=True)
macro=macro.set_index('FREQUENCY_CD')
macro=macro.sortlevel(0)
def f1(x):
return int(x[:4])
def f2(x):
if len(x) ==10:
return int(x[5:7])
elif len(x) ==8:
return int(x[5:6])
elif x[4:7].count('/') == 2 :
return int(x[5:6])
else:
return int(x[5:7])
year_test=[2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017,2018]
#年度資料
macro_A=macro.ix['A']
macro_A['YEAR']=macro_A['PERIOD_DATE'].map(f1)
macro_A.drop(['name_cn','PERIOD_DATE'],axis=1,inplace=True)
macro_A=macro_A.set_index(['indic_id','YEAR'])
macro_A=macro_A.sortlevel(0)
macro_A.to_csv('macro_A_final.csv')
#月度資料轉換為季度資料
#提取月度資料
macro_M=macro.ix['M']
#將字串時間列通過函式改變為年列和月列
macro_M['YEAR']=macro_M['PERIOD_DATE'].map(f1)
macro_M['MONTH']=macro_M['PERIOD_DATE'].map(f2)
macro_M.drop(['PERIOD_DATE'],axis=1,inplace=True)
macro_M=macro_M.set_index(['indic_id','YEAR','MONTH'],drop=False)
macro_M=macro_M.sortlevel(0)
#尋找確實觀測值
macro_M_na_year=[]
macro_M_na_month=[]
for i in list(macro_M.index.levels[0]):
try:
for j in year_test:
macro_M.ix[i,j]
except:
macro_M_na_year.append(i)
else:
for j in year_test:
for k in list(set(macro_M.ix[i,j]['MONTH'].values)):
try :
macro_M.ix[i,j,k].values
except:
macro_M_na_month.append([i,j,k])
macro_M.drop(['indic_id','YEAR','MONTH'],axis=1,inplace=True)
macro_M=macro_M.reset_index(['YEAR','MONTH'])
#若整年資料缺失則剔除這個型別的資料
macro_M=macro_M.drop(macro_M_na_year,axis=0)
macro_M=macro_M.reset_index()
#某類資料僅缺少部分月份資料則用年平均值予以填充
for i in macro_M_na_month:
part1=DataFrame([[i[0],i[1],i[2],0,NA]],columns=list(macro_M.columns))
macro_M=pd.concat([macro_M,part1],ignore_index=True)
for i in range(np.shape(macro_M)[0]):
if macro_M.ix[i,1] < 2006:
macro_M.ix[i]=NA
macro_M.dropna(how='all',inplace=True)
macro_M=macro_M.set_index(['indic_id','YEAR'])
macro_M=macro_M.sortlevel(0)
macro_M=macro_M.fillna(macro_M.mean(level=[0,1]))
macro_M=macro_M.reset_index()
macro_M=macro_M.set_index(['indic_id','YEAR','MONTH'],drop=False)
macro_M=macro_M.sortlevel(0)
#按月份分為4個季度
for i in list(macro_M.index.levels[0]):
for j in year_test:
try:
for k in list(set(macro_M.ix[i,j]['MONTH'].values)):
if k <= 3:
macro_M.ix[(i,j,k),'name_cn'] = 3
elif k <= 6:
macro_M.ix[(i,j,k),'name_cn'] = 6
elif k <= 9 :
macro_M.ix[(i,j,k),'name_cn']=9
else:
macro_M.ix[(i,j,k),'name_cn'] =12
except:
print(i,j)
macro_M.drop(['indic_id','YEAR','MONTH'],axis=1,inplace=True)
macro_M=macro_M.reset_index()
macro_M=macro_M.set_index(['indic_id','YEAR','name_cn'])
#求得某類資料某年季度資料
macro_M=macro_M.sum(level=[0,1,2])
macro_M.drop(['MONTH'],axis=1,inplace=True)
macro_M.to_csv('macro_M_final.csv')
#將週數據轉換為季度資料
#處理方式同上
macro_W=macro.ix['W']
macro_W['YEAR']=macro_W['PERIOD_DATE'].map(f1)
macro_W['MONTH']=macro_W['PERIOD_DATE'].map(f2)
macro_W.drop(['PERIOD_DATE'],axis=1,inplace=True)
macro_W['name_cn']=0
macro_W=macro_W.set_index(['indic_id','YEAR','MONTH'])
macro_W=macro_W.sortlevel(0)
macro_W=macro_W.sum(level=('indic_id','YEAR','MONTH'))
macro_W=macro_W.reset_index()
macro_W=macro_W.set_index(['YEAR'])
macro_W=macro_W.drop(list(range(2002,2006,1)))
macro_W=macro_W.reset_index()
macro_W=macro_W.set_index(['indic_id','YEAR','MONTH'],drop=False)
for i in list(macro_W.index.levels[0]):
for j in year_test:
try:
for k in list(set(macro_W.ix[i,j]['MONTH'].values)):
if k <= 3:
macro_W.ix[(i,j,k),'name_cn'] = 3
elif k <= 6:
macro_W.ix[(i,j,k),'name_cn'] = 6
elif k <= 9 :
macro_W.ix[(i,j,k),'name_cn']=9
else:
macro_W.ix[(i,j,k),'name_cn'] =12
except:
print(i,j)
macro_W.drop(['indic_id','YEAR','MONTH'],axis=1,inplace=True)
macro_W=macro_W.reset_index()
macro_W=macro_W.set_index(['indic_id','YEAR','name_cn'])
macro_W=macro_W.sum(level=[0,1,2])
macro_W.drop(['MONTH'],axis=1,inplace=True)
macro_W.to_csv('macro_W_final.csv')
#將日資料轉換為季度資料
#處理方法同上
macro_D=macro.ix['D']
macro_D['YEAR']=macro_D['PERIOD_DATE'].map(f1)
macro_D['MONTH']=macro_D['PERIOD_DATE'].map(f2)
macro_D.drop(['PERIOD_DATE'],axis=1,inplace=True)
macro_D=macro_D.set_index('YEAR')
#剔除2006年以前的資料
macro_D.drop(list(range(1995,2006,1)),inplace=True)
macro_D=macro_D.reset_index()
macro_D=macro_D.set_index(['indic_id','YEAR','MONTH'],drop=False)
macro_D=macro_D.sortlevel(0)
macro_D_na_year=[]
macro_D_na_month=[]
#檢視日資料有從2006年開始的完整資料,每個月是否有缺失值
for i in list(macro_D.index.levels[0]):
try:
for j in year_test:
macro_D.ix[i,j]
except:
macro_D_na_year.append(i)
else:
for j in year_test:
for k in list(set(macro_D.ix[i,j]['MONTH'])):
try :
len(macro_D.ix[i,j,k]['MONTH'].values)>=20
l=len(macro_D.ix[i,j,k]['MONTH'].values)
except:
macro_D_na_month.append([i,j,k,l])
macro_D=macro_D.drop(['indic_id','YEAR','MONTH'],axis=1)
macro_D=macro_D.reset_index()
macro_D=macro_D.set_index('indic_id')
macro_D=macro_D.drop(macro_D_na_year)
macro_D=macro_D.reset_index()
macro_D=macro_D.set_index(['indic_id','YEAR','MONTH'],drop=False)
macro_D=macro_D.sortlevel(0)
for i in list(macro_D.index.levels[0]):
for j in year_test:
try:
for k in list(set(macro_D.ix[i,j]['MONTH'].values)):
if k <= 3:
macro_D.ix[(i,j,k),'name_cn'] = 3
elif k <= 6:
macro_D.ix[(i,j,k),'name_cn'] = 6
elif k <= 9 :
macro_D.ix[(i,j,k),'name_cn']=9
else:
macro_D.ix[(i,j,k),'name_cn'] =12
except:
print(i,j)
macro_D.drop(['indic_id','YEAR','MONTH'],axis=1,inplace=True)
macro_D=macro_D.reset_index()
macro_D=macro_D.set_index(['indic_id','YEAR','name_cn'])
macro_D=macro_D.sum(level=[0,1,2])
macro_D.drop(['MONTH'],axis=1,inplace=True)
macro_D.to_csv('macro_D_final.csv')
構建訓練集並提取列名
#構建訓練集
train2=DataFrame()
sum_count_lost=0
list_count_lost=[]
sum_count_lost1=0
list_count_lost1=[]
sum_na=0
list_na=[]
year_test=[2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017]
ticker_list=list(merge3.index.levels[0])
#獲取列名
train_columns=[]
merge_columns=[]
for i in [0,'0_3_','1_12_','1_9_','1_6_','1_3_','2_12_','2_9_','2_6_','2_3_','3_12_','3_9_','3_6_','3_3_']:
if i == 0:
merge_columns.append('0_6_'+merge3.columns[0])
else:
for k in range(len(list(merge3.columns))):
merge_columns.append(i+list(merge3.columns)[k])
market_columns=[]
for i in ['0_','1_','2_','3_']:
if i == '0_':
for j in range(5,0,-1):
for k in range(len(list(market.columns))):
market_columns.append(str(i)+str(j)+'_'+list(market.columns)[k])
else:
for j in range(12,0,-1):
for k in range(len(list(market.columns))):
market_columns.append(str(i)+str(j)+'_'+list(market.columns)[k])
macro_A_columns=[]
for i in list(macro_A.index.levels[0]):
for j in ['1_','2_','3_']:
macro_A_columns.append(j+str(i))
macro_M_columns=[]
for i in list(macro_M.index.levels[0]):
for j in ['0_','1_','2_','3_']:
if j =='0_':
for k in [3]:
macro_M_columns.append(j+str(k)+'_'+str(i))
else:
for k in sorted(list(macro_M.index.levels[2]),reverse=True):
macro_M_columns.append(j+str(k)+'_'+str(i))
macro_W_columns=[]
for j in ['1_','2_','3_']:
for k in sorted(list(macro_W.index.levels[2]),reverse=True):
macro_W_columns.append(j+str(k)+'_'+'2160000101')
macro_D_columns=[]
for i in list(macro_D.index.levels[0]):
for j in ['1_','2_','3_']:
for k in sorted(list(macro_D.index.levels[2]),reverse=True):
macro_D_columns.append(j+str(k)+'_'+str(i))
train_columns=train_columns+merge_columns+market_columns+macro_A_columns+macro_M_columns+macro_W_columns+macro_D_columns
test_columns=train_columns[1:]
#形成訓練集
for i in ticker_list :
label_unique=[]
for q in list(set(merge3.ix[i].index.labels[0])):
label_unique.append(list(merge3.ix[i].index.levels[0])[q])
year_list=sorted(label_unique,reverse=True)
year_len=len(year_list)
for j in year_list:
if year_len > 3:
year_len=year_len-1
if j <2018:
train=DataFrame()
try:#主要為了當年有剔除缺失值的樣本
#獲取當年的半年報營業收入,1季度報和近三年個季度報告
for l in [j,j-1,j-2,j-3]:
if l == j:
for k in [6,3]:
if k == 6:
train=pd.concat([train,Series(merge3.ix[(i,l,k),:].iloc[0,0],index=[i])],axis=1,ignore_index=True)
else:
train=pd.concat([train,DataFrame(merge3.ix[(i,l,k),:].iloc[0]).T.unstack().unstack()],axis=1,ignore_index=True)
else:
for k in [12,9,6,3]:
train=pd.concat([train,DataFrame(merge3.ix[(i,l,k),:].iloc[0]).T.unstack().unstack()],axis=1,ignore_index=True)
# 獲取當年的前五個月市場資料,上一年的月度市場資料
for l in [j,j-1,j-2,j-3]:
if l == j:
for o in list(range(5,0,-1)):
train=pd.concat([train,DataFrame(market.ix[(int(i),l,o),:]).T.unstack().unstack()],axis=1,ignore_index=True)
else:
for o in list(range(12,0,-1)):
train=pd.concat([train,DataFrame(market.ix[(int(i),l,o),:]).T.unstack().unstack()],axis=1,ignore_index=True)
except Exception as e:
print(e)
sum_na +=1
list_na.append([i,l,k])
print(i,l,k,sum_na)
continue
#獲取近三年巨集觀年度資料
for m in list(macro_A.index.levels[0]):
for l in [j-1,j-2,j-3]:
train=pd.concat([train,Series(macro_A.ix[(m,l)][0],index=[i])],axis=1,ignore_index=True)
# #近三年巨集觀月度資料轉換而來的季度資料
for m in list(macro_M.index.levels[0]):
for l in [j,j-1,j-2,j-3]:
if l == j:
for k in [3]:
train=pd.concat([train,Series(macro_M.ix[(m,l,k),:][0],index=[i])],axis=1,ignore_index=True)
else:
for k in sorted(list(macro_M.index.levels[2]),reverse=True):
train=pd.concat([train,Series(macro_M.ix[(m,l,k),:][0],index=[i])],axis=1,ignore_index=True)
#近三年巨集觀周度資料轉換而來的季度資料
for l in [j-1,j-2,j-3]:
for k in sorted(list(macro_W.index.levels[2]),reverse=True):
train=pd.concat([train,Series(macro_W.ix[(2160000101,l,k),:][0],index=[i])],axis=1,ignore_index=True)
#近三年巨集觀日度資料轉換而來的季度資料
for m in list(macro_D.index.levels[0]):
for l in [j-1,j-2,j-3]:
for k in sorted(list(macro_D.index.levels[2]),reverse=True):
train=pd.concat([train,Series(macro_D.ix[(m,l,k),:][0],index=[i])],axis=1,ignore_index=True)
train2=pd.concat([train2,train],axis=0,ignore_index=True)
complieted=((ticker_list.index(i)+1)/3493)*100
print('已完成:',complieted,'%')
train2.to_csv('train_set.csv',header=False,index=False)
形成測試集
#形成測試集
submit_nes=pd.read_csv('submit_nes.csv')
submit_bank=pd.read_csv('submit_bank.csv')
submit_sec=pd.read_csv('submit_sec.csv')
submit_ins=pd.read_csv('submit_ins.csv')
submit_bank=list(submit_bank['TICKER_SYMBOL2'])
submit_sec=list(submit_sec['TICKER_SYMBOL2'])
submit_ins=list(submit_ins['TICKER_SYMBOL2'])
submit_nes_change=[]
for i in list(submit_nes.values.tolist()):
if list(submit_nes.values.tolist()).index(i) <872:
submit_nes_change.append(int((i[0].strip('.XSHE'))))
else:
submit_nes_change.append(int((i[0].strip('.XSHG'))))
submit_gb_id=[]
no_count=0
for i in submit_nes_change:
if i in submit_bank or i in submit_sec or i in submit_ins:
no_count +=1
continue
else:
submit_gb_id.append(i)
list_test_na=[]
sum_test_na=0
test=DataFrame()
for i in submit_gb_id :
j=2018
train=DataFrame()
try:#主要為了當年有剔除缺失值的樣本
#獲取當年的1季度報和近三年個季度報告
for l in [j,j-1,j-2,j-3]:
if l == j:
for k in [3]:
if k == 6:
train=pd.concat([train,Series(merge3.ix[(i,l,k),:].iloc[0,0],index=[i])],axis=1,ignore_index=True)
else:
train=pd.concat([train,DataFrame(merge3.ix[(i,l,k),:].iloc[0]).T.unstack().unstack()],axis=1,ignore_index=True)
else:
for k in [12,9,6,3]:
train=pd.concat([train,DataFrame(merge3.ix[(i,l,k),:].iloc[0]).T.unstack().unstack()],axis=1,ignore_index=True)
# 當年的前五個月市場資料,上一年的月度市場資料
for l in [j,j-1,j-2,j-3]:
if l == j:
for o in list(range(5,0,-1)):
train=pd.concat([train,DataFrame(market.ix[(int(i),l,o),:]).T.unstack().unstack()],axis=1,ignore_index=True)
else:
for o in list(range(12,0,-1)):
train=pd.concat([train,DataFrame(market.ix[(int(i),l,o),:]).T.unstack().unstack()],axis=1,ignore_index=True)
except Exception as e:
print(e)
sum_test_na +=1
list_test_na.append([i,l,k])
print(i,l,k,sum_test_na)
continue
#近三年巨集觀年度資料
for m in list(macro_A.index.levels[0]):
for l in [j-1,j-2,j-3]:
train=pd.concat([train,Series(macro_A.ix[(m,l)][0],index=[i])],axis=1,ignore_index=True)
# 近三年巨集觀月度資料轉換而來的季度資料
for m in list(macro_M.index.levels[0]):
for l in [j,j-1,j-2,j-3]:
if l == j:
for k in [3]:
train=pd.concat([train,Series(macro_M.ix[(m,l,k),:][0],index=[i])],axis=1,ignore_index=True)
else:
for k in sorted(list(macro_M.index.levels[2]),reverse=True):
train=pd.concat([train,Series(macro_M.ix[(m,l,k),:][0],index=[i])],axis=1,ignore_index=True)
#近三年巨集觀周度資料轉換而來的季度資料
for l in [j-1,j-2,j-3]:
for k in sorted(list(macro_W.index.levels[2]),reverse=True):
train=pd.concat([train,Series(macro_W.ix[(2160000101,l,k),:][0],index=[i])],axis=1,ignore_index=True)
#近三年巨集觀日度資料轉換而來的季度資料
for m in list(macro_D.index.levels[0]):
for l in [j-1,j-2,j-3]:
for k in sorted(list(macro_D.index.levels[2]),reverse=True):
train=pd.concat([train,Series(macro_D.ix[(m,l,k),:][0],index=[i])],axis=1,ignore_index=True)
test=pd.concat([test,train],axis=0,ignore_index=True)
#觀測進度
complieted=((submit_gb_id.index(i)+1)/1460)*100
print('已完成:',complieted,'%')
test.to_csv('test_set.csv',header=False,index=False)
1、構建擁有列名的完整訓練集資料
2、與測試集資料並剔除重複列
3、將分類變數轉換為啞變數
#組合成訓練集資料庫
train_df=pd.read_csv('train_set.csv',header=None)
test_df=pd.read_csv('test_set.csv',header=None)
train_df=DataFrame(np.array(train_df),columns=train_columns)
test_df=DataFrame(np.array(test_df),columns=test_columns)
drop_columns=[]
#剔除資料完全相同的列
for i in train_columns:
if i == '0_5_TYPE_NAME_EN':
continue
else:
if 'TYPE_NAME_EN' in i:
drop_columns.append(i)
train_df=train_df.drop(drop_columns,axis=1)
test_df=test_df.drop(drop_columns,axis=1)
#將分類變數轉化為啞變數
lbl=LabelEncoder()
tlbl=LabelEncoder()
lbl.fit(list(train_df['0_5_TYPE_NAME_EN'].values))
tlbl.fit(list(test_df['0_5_TYPE_NAME_EN'].values))
train_df['0_5_TYPE_NAME_EN']=lbl.transform(list(train_df['0_5_TYPE_NAME_EN'].values))
test_df['0_5_TYPE_NAME_EN']=tlbl.transform(list(test_df['0_5_TYPE_NAME_EN'].values))
train_df.to_csv('train_final.csv',index=False)
test_df.to_csv('test_final.csv',index=False)
1、將與因變數相關度為0.99的變數剔除
2、利用PCA進行降維
#根據相關性剔除特徵向量和利用PCA降維度
train_df=pd.read_csv('train_final.csv')
test_df=pd.read_csv('test_final.csv')
y_train=train_df.ix[:,0:1]
x_train=train_df.ix[:,1:]
drop_corr_columns=[]
check_corr_columns=[]
thresh_hold=0.99
x_train_corr=x_train.corr().abs()
for i in range(np.shape(x_train.columns)[0]):
for j in range(i+1,np.shape(x_train.columns)[0]):
if x_train_corr.ix[i,j]>=thresh_hold:
if x_train.columns[i] not in drop_corr_columns:
drop_corr_columns.append(list(x_train.columns)[i])
check_corr_columns.append([str(x_train.columns[i])+'+'+str(x_train.columns[j])+'='+str(round(x_train_corr.ix[i,j],2))])
print('已完成:',((i+1)/3482)*100,'%')
print('有%f個多餘特徵' % len(drop_corr_columns))
x_train_afcorr=x_train.drop(drop_corr_columns,axis=1)
test_df_afcorr=test_df.drop(drop_corr_columns,axis=1)
train_list_tempo=x_train_afcorr['0_5_TYPE_NAME_EN']
test_list_tempo=test_df_afcorr['0_5_TYPE_NAME_EN']
x_train_afcorr.drop('0_5_TYPE_NAME_EN',axis=1,inplace=True)
test_df_afcorr.drop('0_5_TYPE_NAME_EN',axis=1,inplace=True)
pca=PCA(n_components=92)
pca.fit(x_train_afcorr)
pca_var_rat=pca.explained_variance_ratio_
pca_var=pca.explained_variance_
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)
x_train_new=pca.fit_transform(x_train_afcorr)
x_train_new=pd.concat([DataFrame(x_train_new),train_list_tempo],axis=1)
test_new=pca.fit_transform(test_df_afcorr)
test_new=pd.concat([DataFrame(test_new),test_list_tempo],axis=1)
x_train_new.to_csv('x_train_pca.csv',index=False)
test_new.to_csv('test_pca.csv',index=False)
1、運用XGBOOST演算法訓練資料
2、使用cross_val_score尋找最佳的學習器數量
3、使用GridSearchCV調整決策樹的深度、 最小葉子的比例、每棵樹所用到的樣本比例、每棵樹所用到的特徵比例、正則引數等
#建立xgbt
train_df=pd.read_csv('train_final.csv')
x_train=pd.read_csv('x_train_pca.csv')
test=pd.read_csv('test_pca.csv')
y_train=(train_df.ix[:,0:1])
#尋找最佳學習器數目
k_estimators=list(range(1,1000,2))
k_score_mean=[]
k_score_std=[]
for i in k_estimators:
xgb3=XGBRegressor(objective='reg:linear',
learning_rate=0.1,
max_depth=8,
min_child_weight=1,
subsample=0.3,
colsample_bytree=0.8,
colsample_bylevel=0.7,
seed=3,
eval_metric='rmse',
reg_alpha=2,
reg_lambda=0.1,
n_estimators=i)
score=cross_val_score(xgb3,x_train.values,y_train.values,scoring='neg_mean_squared_error',cv=5,n_jobs=-1)
print(i)
print(score.mean())
print(score.std())
k_score_mean.append(score.mean())
k_score_std.append(score.std())
plt.plot(k_estimators,k_score_mean)
plt.xlabel('value of k for xgb2')
plt.ylabel('neg__mean_squared_error')
plt.show()
#尋找最佳步長和最小葉子比例
xgb2=XGBRegressor(objective='reg:linear',
learning_rate=0.1,
max_depth=6,
min_child_weight=1,
subsample=0.3,
colsample_bytree=0.8,
colsample_bylevel=0.7,
seed=3,
eval_metric='rmse',
n_estimators=216)
param_test={'max_depth':list(range(6,10,1)),'min_child_weight':list(range(1,3,1))}
clf=GridSearchCV(estimator=xgb2,param_grid=param_test,cv=5,scoring='neg_mean_squared_error')
clf.fit(x_train.values,y_train.values)
clf.grid_scores_
clf.best_params_
clf.best_score_
#尋找subsample和colsample_bytree
xgb2=XGBRegressor(objective='reg:linear',
learning_rate=0.1,
max_depth=8,
min_child_weight=1,
subsample=0.3,
colsample_bytree=0.8,
colsample_bylevel=0.7,
seed=3,
eval_metric='rmse',
n_estimators=401)
param_test={'subsample':[i/10 for i in range(3,9)],'colsample_bytree':[i/10 for i in range(6,10)]}
clf=GridSearchCV(estimator=xgb2,param_grid=param_test,cv=5,scoring='neg_mean_squared_error')
clf.fit(x_train.values,y_train.values)
clf.grid_scores_
clf.best_params_
clf.best_score_
#尋找更好的正則引數
reg_alpha=[2,2.5,3]#之前測過【0.1,1,1.5,2】
reg_lambda=[0,0.05,0.1]#之前測過【0.1,0.5,1,2】
xgb2=XGBRegressor(objective='reg:linear',
learning_rate=0.1,
max_depth=8,
min_child_weight=1,
subsample=0.3,
colsample_bytree=0.8,
colsample_bylevel=0.7,
seed=3,
eval_metric='rmse',
n_estimators=401)
param_test={'reg_alpha':reg_alpha,'reg_lambda':reg_lambda}
clf=GridSearchCV(estimator=xgb2,param_grid=param_test,cv=5,scoring='neg_mean_squared_error')
clf.fit(x_train.values,y_train.values)
clf.grid_scores_
clf.best_params_
clf.best_score_
xgb2=XGBRegressor(objective='reg:linear',
learning_rate=0.1,
max_depth=8,
min_child_weight=3,
subsample=0.3,
colsample_bytree=0.8,
colsample_bylevel=0.7,
seed=3,
eval_metric='rmse',
reg_alpha=2,
reg_lambda=0.1,
n_estimators=466)
xgb2.fit(x_train.values,y_train.values)
pred=xgb2.predict(test.values)
相關推薦
FDDC2018金融演算法挑戰賽01-A股上市公司季度營收預測
我所用到的資料 1、income_gb_2代表的是我從天池原有的income_statement中的general business匯出的,balance_gb_2和cash_gb_2 首亦然。 2、 Macro為巨集觀資料,Market為市場資料 匯入相關包,
python抓取動態資料 A股上市公司基本資訊
1.背景 之前寫的抓取A股所有上市公司資訊的小程式在上交所網站改版後,需要同步修改 pyton2.7.9 2.分析過程 以抓取宇通客車【600066】資訊為例 紅框中的內容是需要抓取的資訊,檢視網頁原始碼 可以看到公司資訊並沒有直接寫到html中,使用chrome “
阿里天池競賽 A股上市公司營收預測 使用LSTM模型做時序預測
參賽結束了,最後結果一百多名,先把清洗好的資料和預測演算法檔案記錄下來。 使用的完全程式碼和資料 程式碼註釋如下 # -*- encoding:utf-8 -*- import pandas as pd import numpy as np import sys f
爬取網易財經全部A股上市公司年報
首先要找到所有A股上市公司的股票程式碼,將東方財富網列表中所有的股票的程式碼(6位數字號)取下來 <a target="_blank" href="http://quote.eastmoney.com/sh500001.html">基金金泰(500001
競賽資訊|A股上市公司公告資訊抽取
(本內容轉載自公眾號“科技與Python”) A股上市公司公告資訊抽取 大賽背景 在金融領域,每24小時都會產生大約2.5億位元組的資
融360天機智慧金融演算法挑戰賽-拒絕推斷
2016年1月,機構A通過自建風控模型開始放貸,初期獲得了良好的收益。隨著時間的推移,機構A發現在樣本通過率5%不變的前提下,機構逾期率由2016年1月的5%逐步升至2017年7月的15%,大量壞賬導致機構A由盈利陷入虧損境地。公司模型人員仔細檢查模型,發現其在訓練集和測
A股成“香餑餑”!李彥宏、丁磊、王小川表態回A股上市為哪般?
A股 這幾天有關上市的消息突然多了起來,除了小米或將在A+H股兩地上市備受關註之外,李彥宏、丁磊、王小川等互聯網大佬在政協大會舉行期間,也紛紛表露了對回到A股上市的個人看法。之所以現在A股成為互聯網企業眼中的“香餑餑”,在於A股變革在即,有可能讓這些互聯網企業享受到更多福利和利益。 此前,包括百度、網易、阿裏
利用Python視覺化來檢視中國環保股上市公司!排名第一的居然是?
1. 提取所有股票程式碼 1import tushare as ts 2# 獲取所有股票列表 3data = ts.get_stock_basics() 4print(data.head()) 5# 返回資料如下,所有列值可以參考:ht
獲取美股上市公司股票資料
最近半年算是徹底被樑博坑了,被帶入美股,開始一頓操作猛如虎,毫無經驗的我,被按在低谷使勁的摩擦,閒下來想想,有沒有辦法規範炒股的操作呢?因為我很多次都是誤操作才導致虧錢。慢慢接觸了量化交易,通過訓練歷史資料,修正策略模型,有可能在正股中盈利;我虧的是在期權上,期權的風險遠高於正股,期權一晚上的漲幅可
【01】a tag只為成button用時候設置href的辦法
http developer java 什麽是 控制 ring eve define 順序 a tag為成button使用,把JavaScript動作處理時,有如下四種停止Event效果。 <a href="#"> <a href="javascript
R語言時間序列處理介紹--以A股財報資料處理為案例
本文以處理A股財務報表為例,介紹了將資料轉換成時間序列後在進行處理的一些方法和思路。將會用到xts,lapply,do.call等資料結構和函式。 1、 簡介 我們從各個途徑獲得了個股的財務報表原始資料後,還需要對資料做一些處理,以便後續指標計算和使用。舉個簡單的例子,
演算法基礎--01
一、演算法一些基礎概念 1.演算法概念 所謂的演算法就是 對特定問題的解決步驟。 此處的特定問題一般指定是對資訊進行排序,搜尋目標資訊等不同的問題。 對於此問題,可以類比一下菜譜,也就是做菜的步驟。 2.演算法的目的 需求更優雅的解法。 3.瞭解演算法對玩遊戲是否有幫
資料結構與演算法基礎-01-二分查詢
二分查詢 注:本題目源自《浙江大學-資料結構》課程,題目要求實現二分查詢演算法。 函式介面定義 Position BinarySearch( List L, ElementType X ); 其中List結構定義如下: typedef int Position; typ
演算法初級01——認識時間複雜度、對數器、 master公式計算時間複雜度、小和問題和逆序對問題
雖然以前學過,再次回顧還是有別樣的收穫~ 認識時間複雜度 常數時間的操作:一個操作如果和資料量沒有關係,每次都是固定時間內完成的操作,叫做常數操作。 時間複雜度為一個演算法流程中,常數運算元量的指標。常用O(讀作big O)來表示。具體來說,在常數運算元量的表示式中,
監管層集體發聲紓困A股背後:117家大股東已質押全..
阿里巴巴官方釋出微博稱,連續幾日,一篇名為《阿里員工透露:馬總早移走 1200 億人民幣!網友:不愧是老師》的文章被有組織的進行惡意傳播。阿里巴巴官方釋出微博稱,連續幾日,一篇名為《阿里員工透露:馬總早移走 1200 億人民幣!網友:不愧是老師》的文章被有組織的進行惡意傳播。 對此,阿里表示,該文完全捏造事
【演算法】01分數規劃
昨天做訓練賽的時候遇到了一道求最優比率的題,不會寫,學長說是用01分數規劃來做,於是就看了一下入門級別的。在這裡先寫一下自己的心得。 01分數規劃就是利用二分來查詢最優比率的問題。 首先我們看一下nyoj的一道題目:Yougth的最大化 題意是每個物品都有自己的價值和重量,讓你選K個物品使得
利用動態規劃演算法解01揹包問題->二維陣列傳參->cpp記憶體管理->堆和棧的區別->常見的記憶體錯誤及其對策->指標和陣列的區別->32位系統是4G
1、利用動態規劃演算法解01揹包問題 https://www.cnblogs.com/Christal-R/p/Dynamic_programming.html 兩層for迴圈,依次考察當前石塊是否能放入揹包。如果能,則考察放入該石塊是否會得到當前揹包尺寸的最優解。 // 01 knap
實現簡易字串壓縮演算法:由字母a-z或者A-Z組成,將其中連續出現2次以上(含2次)的字母轉換為字母和出現次數,
@Test public void test1(){ String content1 = "AAAAAAAAAAAAAAAAAAAAAAAAttBffgfaaddddddsCDaaaBBBBdddfdsgggggg"; String result = yasuo(content1);
雙11購物超A股單日成交
雙十一購物和買A股最大的差別,就是A股消費者不能在第二天取消訂單說:把錢還我,我不玩了。 一種類似於股票市場牛市來了一樣的儀式性購物狂熱行為,在上週末落下帷幕。中國內地兩家最大的電商網站天貓和京東,以交易額增速為中國GDP增速約四倍的成績交出今年的雙11大考答卷。 兩
【python學習筆記】37:認識Scrapy爬蟲,爬取滬深A股資訊
學習《Python3爬蟲、資料清洗與視覺化實戰》時自己的一些實踐。 認識Scrapy爬蟲 安裝 書上說在pip安裝會有問題,直接在Anaconda裡安裝。 建立Scrapy專案 PyCharm裡沒有直接的建立入口,在命令列建立(從Anaconda安裝後似乎自動就