1. 程式人生 > >騰訊AI Lab AAAI18現場陳述論文:用隨機象限性消極下降算法訓練L1範數約束模型

騰訊AI Lab AAAI18現場陳述論文:用隨機象限性消極下降算法訓練L1範數約束模型

騰訊 AI 人工智能

前言:騰訊 AI Lab共有12篇論文入選在美國新奧爾良舉行的國際人工智能領域頂級學術會議 AAAI 2018。騰訊技術工程官方號獨家編譯了論文《用隨機象限性消極下降算法訓練L1範數約束模型》(Training L1-Regularized Models with Orthant-Wise Passive Descent Algorithms),該論文被 AAAI 2018錄用為現場陳述論文(Oral Presentation),由騰訊 AI Lab獨立完成,作者為王倪劍橋。

技術分享圖片


中文概要

L1範數約束模型是一種常用的高維數據的分析方法。對於現代大規模互聯網數據上的該模型,研究其優化算法可以提高其收斂速度,進而在有限時間內顯著其模型準確率,或者降低對服務器資源的依賴。經典的隨機梯度下降 (SGD) 雖然可以適用神經網絡等多種模型,但對於L1範數不可導性並不適用。


在本文中,我們提出了一種新的隨機優化方法,隨機象限性消極下降算法 (OPDA)。本算法的出發點是L1範數函數在任何一個象限內是連續可導的,因此,模型的參數在每一次更新之後被投影到上一次叠代時所在的象限。我們使用隨機方差縮減梯度 (SVRG) 方法產生梯度方向,並結合偽牛頓法 (Quasi-Newton) 來使用損失函數的二階信息。我們提出了一種新的象限投影方法,使得該算法收斂於一個更加稀疏的最優點。我們在理論上證明了,在強凸和光滑的損失函數上,該算法可以線性收斂。在RCV1等典型稀疏數據集上,我們測試了不同參數下L1/L2範數約束Logistic回歸下該算法性能,其結果顯著超越了已有的線性收斂算法Proximal-SVRG,並且在卷積神經網絡 (CNN) 的實驗上超越Proximal-SGD等算法,證明了該算法在凸函數和非凸函數上均有很好的表現。

英文演講全文

Hello, everyone, I am Jianqiao Wangni, from Tencent AI Lab.


1. Introduction: Learning sparse representation has been a very important task for data analysis. For example, in biology, it usually involves millions of genes for the genetic analysis of a single individual. In finance series prediction, online advertising, there also lots of cases where the data numbers are even smaller than data dimensions, which is an ill-conditioned problem without the sparse prior. So, for the conventional models, like logistic regression and linear regression, we put an L1 norm regularization, which is the sum of absolute values, to build robust applications with high dimensional data. And they are very powerful for learning sparse representations. To give an intuitive example. The blue areas are the constrained regions, L-1 norm ball on the left and L-2 norm ball on the right, respectively, while the red circles are the contours of the average of square loss functions. The intersection point between the balls and the contours are the solution to such regularized models. We can see that the solution to the L1-regularized model is near the Y-axis, which means that the X-dimension element is more approaching zero.

技術分享圖片

2. Formal Definition: Now we go to the analytical part. We study a regularized function P(x) which equals to F(x) + R(x), where F(x) is the average of N loss functions each of which depends on a data sample, and R is the L_1 regularization. We also assume that each loss function is twice differentiable, strongly convex and smooth, which are general assumptions in convex optimization. The L1 norm is not differentiable. One of most representative optimization method is the proximal method, which iteratively takes a gradient descent step and then solves a proximal problem on the current point.

技術分享圖片

3. Reference 1: Our primary reference is the orthant-wise limited-memory quasi-newton method, OWL-QN, which was based on L-BFGS, a representative quasi-Newton method that overcame an obstacle that L1 norms are not differentiable. This method restricts the updated parameter to be within a certain orthant, because that in every single orthant, the absolute value function are actually differentiable. A key component of OWL-QN is about the subgradient on zero points. The subgradient of the L1 regularization R(x) can be whether positive lambda or negative lambda. Take the third branch for an example, we study a single dimension, I-th dimension, of the current point X_i and the gradient, V_i. If X_i equals to zero, and X_i plus lambda is negative, then the subgradient is set to be V_i plus positive lambda, since after subtracting this subgradient, X_i will be a positive value, then the subgradient of R(x) will still be positive lambda, this makes the subgradient of the L1 norm to be consistent across one iteration.

技術分享圖片

4. Reference 2: To process the massive internet data, many optimization methods were proposed to speed up the training process. The stochastic gradient descent method, SGD, is a popular choice for optimization. However, SGD generally needs a decreasing stepsize to converge, so it only has a sublinear convergence rate. Recently, some stochastic variance reduction methods, such as SVRG and SAGA, can converge without decreasing stepsizes and achieve linear convergence rates on smooth and strongly-convex models. In SGD, the descent direction in the k-th step, v_k, is evaluated on a stochastic subset S_k of the dataset. In SVRG, we need to periodically calculate a series of full gradients that depend on all data, on some reference points, say tilde-X. The full gradient constructs the third term of v_k of SVRG, then we have to balance the gradient in expectation, by subtracting a stochastic gradient on tilde-X, which is calculated on the same subset S_k.

技術分享圖片

5. The First Step: Now we proceed to our method. Although being efficient, OWL-QN can be further improved, for examples, by dropping the line-search procedure, or using a stochastic gradient instead of the accurate but costly full gradient. Inspired by SVRG and OWL-QN, we develop a stochastic optimization method for in L1-regularized models. This is more complicated than the smooth function. At the first step, we calculate the SVRG of the loss function F(x), then we use an idea from OWL-QN to push the descent direction toward maintaining the same orthant after one iteration, and this will be used as a reference orthant. Our actual descent direction V_k is calculated as the third equation here.

技術分享圖片

6. The Second-Order Information: At this point, we have an optional choice about the descent direction, we can use the second-order information of the loss function, by calculating the Hessian matrix on the current point, or estimating an approximated Hessian matrix, like quasi-Newton method. Then the descent direction D_k can be obtained by minimizing the following quadratic expansion around the current point. And if we use L-BFGS, we actually do not need to do the matrix inversion like the equation, this can be done through efficient matrix-vector multiplications. We may also directly assign V_k to D_k, as a typical first-order method. After this step, the orthant of the direction D_k has to be consistent with V_k, which means that, if some dimensions of D_k have different signs with V_k, they have to be aligned to zero, and we note the aligned version to be P_k.

技術分享圖片

7. Final Updates: The aforementioned calculation does not explicitly involve the partial derivative of R(x), the L1 regularization, except for the alignment reference, since we are avoiding additional variance to the stochastic gradient. To make the solution to be sparse, we introduce a novel alignment operator to encourage zero elements, if X and Y had different signs, or the absolute value of X is less than a threshold, X would be forced to be zero. By this alignment operator, after each time we complete the previous calculation, we examine if the next point is in the same orthant as the current point, if not, some dimensions of the next point will be zero. After this step, clearly more dimensions of X_k should be zero instead of being small but nonzero values.

技術分享圖片

8. Convergence Analysis: In the paper, we prove that, under the assumption of smoothness and strong-convexity, our method will converge with a linear convergence rate. Due to lack of time, we do not use too much time on details of this heavy mathematical analysis.

技術分享圖片

9. Synthetic Experiments: To visualize a comparison, we plot the optimization trajectories on a simple two-dimensional synthetic function in this figure. Our method, OPDA, is noted by the red line, and proximal-gradient descent, is noted by the blue line, as the baseline. After the same number of iterations, we see that our method converges to the minimum with a faster speed.

技術分享圖片

10. Experiments on Convex Cases: We also do some experiments on logistic regression with both L_2 and L_1 regularizations for binary classification. In this part, we compare our method with the proximal-SVRG method, which is also linearly-convergent. In the figure, Y-axis notes the suboptimality, which means how far is the current objective function to the minimum value, and X-axis represents the number of data passes. We found a stable result that OPDA runs faster than proximal-SVRG with different step sizes, L2 regularization, and L1 regularization.

技術分享圖片

11. Experiments on Deep Learning: We also conducted experiments with L-1 regularized convolutional neural networks, or sparse CNN, to demonstrate the efficiency under the nonconvex case. This application is also useful to reduce the parameter size of neural networks. The red line represents our method, and the blue line is the proximal-SVRG. We test different scales of L1-regularization. We see that OPDA converge faster than proximal-SVRG, and the difference is stronger if the L1 regularization is stronger. Actually, by the orthant-wise nature of our methods, lots of dimensions of the descent direction and the updated points are forced to sparse during alignment, making the equivalent speed much slower, however, our methods still converge much faster than the proximal methods, with the same step size. This proves that our propose dalignment operator does calibrate the direction to be better, making the overall framework more efficient in terms of iterations, but with only negligible extra arithmetic operations.

技術分享圖片

技術分享圖片


騰訊AI Lab AAAI18現場陳述論文:用隨機象限性消極下降算法訓練L1範數約束模型