機器學習該如何應用到量化投資系列(二)
前言
深度學習技術在交易中的研究
深度學習最近受到了很多關注,特別是在影象分類和語音識別領域。然而,它的應用似乎並沒有廣泛應用到交易當中。這項調查涵蓋了到目前為止作者(Greg Harris)發現相關的系統交易。(點選閱讀原文獲取原文PDF)
一些名詞:
DBN = Deep BeliefNetwork(深度信念網路)
LSTM = LongShort-Term Memory(長短期記憶),一種時間遞迴神經網路
MLP = Multi-layer Perceptron(多層神經網路)
RBM = RestrictedBoltzmann Machine(限制玻爾茲曼機)
ReLU = RectifiedLinear Units(修正線性單元),啟用函式
CNN =Convolutional Neural Network(卷積神經網路)
Limit OrderBook模型
Sirignano(2016)預測了limit order books的變化。他設計了一個可以利用區域性空間結構的“空間神經網路”,他設計的網路可作為分類器而且比一般的神經網路計算效率更高。他建立模型以求出下一個狀態的最佳買價、賣價的聯合分佈情況。同時,也能求出其中之一(買/賣價)的改變對另外一個的影響。
Architecture – Each neural network has 4 layers. The standard neuralnetwork has 250 neurons per hidden layer, and the spatial neural network has50. He uses the tanh activation function on the hidden layer neurons.
Training – He trained and tested on order books from 489 stocks from 2014 to 2015(a separate model for each stock). He uses Level III limit order book data fromthe NASDAQ with event times having nanosecond decimal precision. Traininginvolved 50TB of data and used a cluster with 50 GPUs. He includes 200features: the price and size of the limit order book across the first 50non-zero bid and ask levels. He uses dropout to prevent overfitting. He usesbatch normalization between each hidden layer to prevent internal covariateshift. Training is done with the RMSProp algorithm. RMSProp is similar tostochastic gradient descent with momentum but it normalizes the gradient by arunning average of the past gradients. He uses an adaptive learning rate wherethe learning rate is decreased by a constant factor whenever the training errorincreases over a training epoch. He uses early stopping imposed via avalidation set to reduce overfitting. He also includes an l^2 penalty whentraining in order to reduce overfitting.
Results – He shows that limit order books exhibit some degree of local spatialstructure. He predicts the order book 1 second ahead and also at the time ofthe next bid/ask change. The spatial neural network outperforms the standardneural network and logistic regression with non-linear features. Both neuralnetworks have 10% lower error than logistic regression.
基於價格的分類模型
Dixon(etal.)(2016)使用了一個深度神經網路去預測未來5分鐘的價格變化的訊號,曾在43種大宗商品和外匯期貨中使用。
Architecture – Their input layer has 9,896 neurons for inputfeatures made up of lagged price differences and co-movements betweencontracts. There are 5 learned fully-connected layers. The first of the fourhidden layers contains 1,000 neurons, and each subsequent layer tapers by 100neurons. The output layer has 135 neurons (3 for each class {-1, 0, 1} times 43contracts).
Training – They used the standard back-propagation with stochastic gradientdescent. They speed up training by using mini-batching (computing the gradienton several training examples at once rather than individual examples). Ratherthan an nVidia GPU, they used an Intel Xeon Phi co-processor.
Results – They report 42% accuracy, overall, for three-class classification.They do some walk-forward training instead of a traditional backtest. Theirboxplot shows some generally positive Sharpe ratios from the mini-backtests foreach contract. They did not include transaction costs or crossing the bid-askspread. All their predictions and features were based on the mid-price at theend of each 5-minute time period.
Takkeuchi andLee(2013)研究了動量效應對預測股票月收益率的影響。
Architecture – They use an auto-encoder composed of stacked RBMs toextract features from stock prices which they then pass to a feed-forwardneural network classifier. Each RBM consists of one layer of visible units andone layer of hidden units connected by symmetric links. The first layer has 33units for input features from one stock at a time. For every month t, thefeatures include the 12 monthly returns for month t-2 through t-13 and the 20daily returns approximately corresponding to month t. They normalize each ofthe return features by calculating the z-score relative to the cross-section ofall stocks for each month or day. The number of hidden units in the final layerof the encoder is sharply reduced, forcing dimensionality reduction. The outputlayer has 2 units, corresponding to whether the stock ended up above or belowthe median return for the month. Final layer sizes are 33-40-4-50-2.
Training – During pre-training, they split the dataset into smaller,non-overlapping mini-batches. Afterwards, they un-roll the RBMs to form anencoder-decoder, which is fine-tuned using back-propagation. They consider allstocks trading on the NYSE, AMEX, or NASDAQ with a price greater than $5. Theytrain on data from 1965 to 1989 (848,000 stock-month samples) and test on datafrom 1990 to 2009 (924,300 stock-month samples). Some training data held-outfor validation for the number of layers and the number of units per layer.
Results – Their overall accuracy is around 53%. When they consider thedifference between the top decile and the bottom decile predictions, they get3.35% per month, or 45.93% annualized return.
Batres-Estrada(2015)預測了在給定的交易日中哪些股票會有高於中位數的回報(基於標準普爾500)。他的研究對Takeuchi和Lee(2013)的研究也產生了影響。
Architecture – He uses a 3-layer DBN coupled to an MLP. He uses 400neurons in each hidden layer, and he uses a sigmoid activation function. Theoutput layer is a softmax layer with two output neurons for binaryclassification (above median or below). The DBN is composed of stacked RBMs,each trained sequentially.
Training – He first pre-trains the DBN module, then fine-tunes the entire DBN-MLPusing back-propagation. The input includes 33 features: monthly log-returns formonths t-2 to t-13, 20 daily log-returns for each stock at month t, and anindicator variable for the January effect. The features are normalized usingthe Z-score for each time period. He uses S&P 500 constituent data from1985 to 2006 with a 70-15-15 split for training-validataion-test. He uses thevalidation data to choose the number of layers, the number of neurons, and theregularization parameters. He uses early-stopping to prevent over-fitting.
Results – His model has 53% accuracy, which outperforms regularized logisticregression and a few MLP baselines.
Sharang andRao(2015)使用了DBN(深度信念網路)訓練的技術指標對投資組合進行分類。
Architecture – They use a DBN consisting of 2 stacked RBMs. Thefirst RBM is Gaussian-Bernoulli (15 nodes), and the second RBM is Bernoulli (20nodes). The DBN produces latent features which they try feeding into threedifferent classifiers: regularized logistic regression, support vectormachines, and a neural network with 2 hidden layers. They predict 1 ifportfolio goes up over 5 days, and -1 otherwise.
Training – They train the DBN using a contrastive divergence algorithm. Theycalculate signals based on open, high, low, close, open interest, and volumedata, beginning in 1985, with some points removed during the 2008 financialcrisis. They use 20 features: the “daily trend” calculated over different time frames, and thennormalized. All parameters are chosen using a validation dataset. When trainingthe neural net classifier, they mention using a momentum parameter duringmini-batch gradient descent training to shrink the coefficients by half duringevery update.
Results – The portfolio is constructed using PCA to be neutral to the firstprincipal component. The portfolio is an artificial spread of instruments, soactually trading it is done with a spread between the ZF and ZN contracts. Allinput prices are mid-prices, meaning the bid-ask spread is ignored. The resultslook profitable, with all three classification models performing 5-10% moreaccurately than a random predictor.
Zhu(et al.)(2016)使用了基於深度信念網路的箱體震盪理論來進行決策。箱體震盪理論認為股票的價格會在一個確定的範圍內(箱體)震盪,如果價格超出這個範圍,那麼股票價格會完全進入一個新的箱體。他們的交易策略就是在突破箱體頂部時買入和在跌穿箱體底部時賣出。
Architecture – They use a DBN made up of stacked RBMs and a finalback-propagation layer.
Training – They used block Gibbs sampling to greedily train each layer fromlowest to highest in an unsupervised way. They then train the back-propagationlayer in a supervised way, which fine-tunes the whole model. They chose 400stocks out of the S&P 500 for testing, and the test set covers 400 daysfrom 2004 to 2005. They use open, high, low, close prices as well as technicalanalysis indicators, for a total of 14 model inputs. Some indicators are givenmore influence in the prediction through the use of “gray relation analysis” or “gray correlation degree.”
Results – In their trading strategy, they charge 0.5% transaction costs pertrade and add a couple of parameters for stop-loss and “transaction rate.” I don’t fully understand the result tables, but they seem tobe reporting significant profits.
波動率預測
Xiong (etal.)(2015)根據估算出來的開、高、低、收價格預測了標準普爾500指數的日波動率。
Architecture – They use a single LSTM hidden layer consisting of oneLSTM block. For inputs they use daily S&P 500 returns and volatilities.They also include 25 domestic Google trends, covering sectors and major areasof the economy.
Training – They used the “Adam” method with 32 samples per batch and meanabsolute percent error (MAPE) as the objective loss function. They set themaximum lag of the LSTM to include 10 successive observations.
Results – They show their LSTM method outperforms GARCH, Ridge, and LASSOtechniques.
波基於文字的分類模型
Rönnqvist andSarlin(2016)使用新聞文章來預測銀行的運營狀況。具體來說,他們建立了一個分類器用來判斷一個句子表示的是處於困難時期還是平穩時期。
Architecture – They use two neural networks in this paper. The firstis for semantic pre-training to reduce dimensionality. For this, they run asliding window over text, taking a sequence of 5 words and learning to predictthe next word. They use a feed-forward topology where a projection layer in themiddle provides the semantic vectors once the connection weights have beenlearned. They also include the sentence ID as an input to the model, to providecontext and inform the prediction of the next word. They use binary Huffmancoding to map sentence IDs and word to activation patterns in the input layer,which organizes the words roughly by frequency. They say feed-forwardtopologies with fixed context sizes are more efficient than recurrent neuralnetworks for modeling text sequences. The second neural network is forclassification. Instead of a million inputs (one for each word), they use 600inputs from the learned semantic model. The first layer has 600 nodes, themiddle layer has 50 rectified linear hidden nodes, and the output layer has 2nodes (distress/tranquil).
Training – They train it with 243 distress events over 101 banks observed duringthe financial crisis of 2007-2009. They use 716k sentences mentioning thebanks, taken from 6.6m Reuters news articles published during and after thecrisis.
Results – They evaluate their classification model using a custom “Usefulness” measure. The evaluation is done usingcross-validation, leaving N banks out in each fold. They aggregate the distresscounts into various timeseries but don’t go so far as to consider creating a tradingstrategy.
Fehrer andFeuerriegel(2015)訓練了一個基於新聞標題的模型用來預測德國的股票收益。
Architecture – They use a recursive autoencoder with an additionalsoftmax layer in each autoencoder for estimating probabilities. They performthree-class prediction {-1, 0, 1} for the following day’s return of the stock associated with theheadline.
Training – They initialize the weights with Gaussian noise, and then updatethrough back-propagation. They use an English ad-hoc news announcement dataset(8,359 headlines) for the German market covering 2004 to 2011. Results – Their recursive autoencoder has 56% accuracy, which in an improvementover a more traditional random forest modeling approach with 53% accuracy. Theydo not develop a trading strategy. They have made a Java implementation oftheir code publicly available.
Ding (etal.)(2015)使用從新聞標題中提取出來的結構化資訊來預測標準普爾500指數的變化。他們用OPEN IE(Open information Extraction,不是開啟IE=.=)來處理新聞標題,並獲得新聞事件所表達的資訊(人,事,物,時)。與其他普通的網路不同的是,他們使用了張量神經網路學習語義組合。
Architecture – They combine short-term and long-term effects ofevents, using a CNN to perform semantic composition over the input eventsequence. They use a max pooling layer on top of the convolutional layer, whichmakes the network retain only the most useful features produced by theconvolutional layer. They have separate convolutional layers for long-termevents and mid-term events. Both of these layers, along with an input layer forshort-term events, feed into a hidden layer which then feeds into two outputnodes.
Training – They extracted 10 million events from Reuters and Bloomberg news. Fortraining, they corrupt events by replacing one event argument with a randomargument. During training, they assume that the actual event should be given ahigher score than the corrupted event. When it isn’t, model parameters get updated.
Results – They find that structured events are better features than words forstock market prediction. Their approach outperforms baseline methods by 6%.They make predictions for the S&P 500 index and 15 individual stocks, and atable appears to show that they can predict the S&P 500 with 65% accuracy.
投資組合模型
Heaton (etal.)(2016)試圖尋找一個比生物科技指數IBB表現更好的投資組合。他們有目標地跟蹤指數和一些股票,並嘗試在大幅下跌的情況下仍然能跑贏指數。他們使用支援非線性結構的擬合模型,而不是直接對協方差矩陣建模。
Architecture – They use auto-encoding with regularization and ReLUs.Their auto-encoder has one hidden layer with 5 neurons.
Training – They use weekly return data for the component stocks of IBB from 2012to 2016. They auto-encode all stocks in the index and evaluate the differencebetween each stock and its auto-encoded version. They keep the 10 most “communal” stocks that are most similar to the auto-encodedversion. They also keep a varying number of other stocks, where the number ischosen with cross-validation.
Results – They show the tracking error as a function of the number stocksincluded in the portfolio, but don’t seem to compare against traditional methods. Theyalso replace index drawdowns with positive returns and find portolios thattrack this modified index.