1. 程式人生 > 實用技巧 >線性方程組 梯度下降_為什麼梯度下降和正態方程不利於線性迴歸

線性方程組 梯度下降_為什麼梯度下降和正態方程不利於線性迴歸

線性方程組 梯度下降

介紹 (Introduction)

Most of the ML courses start with linear regression and gradient descent and/or normal equations for this problem. Probably the most well-known Andrew Ng’s course also introduces linear regression as a very basic machine learning algorithm and how to solve it using gradient descent and normal equations methods. Unfortunately, usually, those are quite terrible ways to do it. In fact, if you have ever used LinearRegression

from Scikit-learn, you have used alternative methods!

大多數ML課程都以線性迴歸和梯度下降和/或法線方程開始。 也許最著名的吳安德(Andrew Ng)的課程也將線性迴歸作為一種非常基本的機器學習演算法進行介紹,並介紹瞭如何使用梯度下降和法線方程法求解它。 不幸的是,通常,這些是非常糟糕的方法。 實際上,如果您曾經使用LinearRegression Scikit-learn的LinearRegression,則可以使用其他方法!

問題概述 (Problem outline)

In linear regression, we have to estimate parameters theta — coefficients for linear combination of terms for regression (where x_0 = 1

and theta_0 is a free term/bias):

線上性迴歸中,我們必須估計引數theta_0用於迴歸項線性組合的係數(其中x_0 = 1theta_0是自由項/偏差):

Image for post

We do it by minimizing residual sum of squares (RSS), i. e. average of the squared differences between the output of our model and true values:

我們通過最小化殘差平方和(RSS)來做到這一點,即模型輸出與真實值之間的平方差的平均值:

Image for post

梯度下降法 (Gradient descent method)

Gradient descent, a very general method for function optimization, iteratively approaches the local minimum of the function. Since the loss function for linear regression is quadratic, it is also convex, i. e. there is a unique local and global minimum. We approach it by taking steps based on the negative gradient and chosen learning rate alpha.

梯度下降是一種用於函式優化的非常通用的方法,它迭代地逼近函式的區域性最小值。 由於線性迴歸的損失函式是二次函式,因此它也是凸的,即存在唯一的區域性和全域性最小值。 我們通過採取基於負梯度和所選學習率alpha的步驟來實現這一目標。

Image for post
Image for post

Why is this approach bad in most cases? The main reasons are:

為什麼在大多數情況下這種方法不好? 主要原因是:

  1. It’s slow — iteratively approaching the minimum takes quite a bit of time, especially computing the gradient. While there are methods of speeding this up (stochastic gradient descent, parallel computing, using other gradient methods), this is an inherently slow algorithm for general convex optimization.

    它很慢 -迭代達到最小值需要花費大量時間,尤其是計算梯度時。 雖然有加快速度的方法(隨機梯度下降,平行計算,使用其他梯度方法),但這是一般凸優化的固有速度較慢的演算法。

  2. It does not arrive exactly at the minimum — with the gradient descent, you are guaranteed to never get to the exact minimum, be it local or global one. That’s because you are only as precise as the gradient and learning rate alpha. This may be quite a problem if you want a really accurate solution.

    它不會精確地達到最小值 -隨著梯度下降,您將永遠不會達到精確的最小值,無論是區域性最小值還是全域性最小值。 那是因為您的精確度僅與漸變和學習率alpha一樣。 如果您想要一個真正準確的解決方案,這可能是個問題。

  3. It introduces new hyperparameter alpha — you have to optimize the learning rate alpha, which is a tradeoff between speed (approaching minimum faster) and accuracy (arriving closer to the minimum). While you can use the adaptive learning rate, it’s more complicated and still introduces new hyperparameter.

    它引入了新的超引數alpha-您必須優化學習率alpha,這是速度(更快地接近最小值)和準確性(接近最小值)之間的折衷方案。 儘管您可以使用自適應學習率,但它更加複雜,並且仍然引入了新的超引數。

So why do we even bother with gradient descent for linear regression? There are two main reasons:

那麼,為什麼我們還要為線性迴歸而煩惱梯度下降呢? 主要有兩個原因:

  1. Educational purpose — since linear regression is so simple, it’s easy to introduce the concept of gradient descent with this algorithm. While it’s not good for this particular purpose in practice, it’s very important for neural networks. That’s most probably why Andrew Ng chose this way in his course, and everyone else blindly followed, without explicitly stating that you should not do this in practice.

    教育目的 -由於線性迴歸非常簡單,因此使用此演算法很容易引入梯度下降的概念。 雖然這對於實踐中的特定目的不是很好,但對於神經網路來說卻非常重要。 這很可能就是為什麼吳安德(Andrew Ng)在他的課程中選擇這種方式,而其他所有人卻盲目地遵循,而沒有明確指出您不應在實踐中這樣做。

  2. Extremely big data — if you have huge amounts of data and have to use parallel and/or distributed computing, the gradient approach is very easy to use. You just partition data in chunks, send it to different machines and compute gradient elements on many cores/machines. Most often though you don’t have such needs or computational capabilities.

    極大的資料 -如果您有大量的資料並且必須使用並行和/或分散式計算,則漸變方法非常易於使用。 您只需將資料分割槽,然後將其傳送到不同的計算機,並在許多核心/計算機上計算梯度元素。 大多數情況下,儘管您沒有這種需求或計算能力。

正態方程法 (Normal equation method)

Quadratic cost function has been originally chosen for linear regression because of its nice mathematical properties. It’s easy to use and we are able to get a closed form solution, i. e. a mathematical formula for theta parameters — a normal equation. In the derivation below, we get rid of 1/2n, since in the derivation it will vanish anyway.

二次成本函式最初因其良好的數學特性而被選擇用於線性迴歸。 它易於使用,並且我們可以獲得封閉形式的解決方案,即theta引數的數學公式-一個正常方程式。 在下面的推導中,我們擺脫了1/2n ,因為在任何推導中它都會消失。

Image for post
Image for post

We arrive at a system of linear equations and finally at the normal equation:

我們得到一個線性方程組,最後是法線方程:

Image for post

And why is this approach bad too? Main reasons are:

為什麼這種方法也不好? 主要原因如下:

  1. It’s slow — having a short, nice equation does not mean that computing it is fast. Matrix multiplication is O(n³), inversion is also O(n³). This is actually slower than gradient descent for even modest sized datasets.

    它很慢 -簡短而優美的方程式並不意味著計算就很快。 矩陣乘法是O(n³),反演也是O(n³)。 對於中等大小的資料集,這實際上比梯度下降要慢。

  2. It’s numerically unstable — matrix multiplication X^T * X squares the condition number of the matrix, and later we have to additionally multiply the result by X^T. This can make the results extremely unstable and this is the main reason why this method is almost never used outside pen and paper linear algebra or statistics courses. Not even solving with Cholesky decomposition will save this.

    這在數值上是不穩定的 -矩陣乘法X^T * X將矩陣的條件數平方,然後我們必須將結果乘以X^T 這可能會使結果極不穩定,這是為什麼在筆和紙線性代數或統計課程之外幾乎從未使用此方法的主要原因。 甚至用Cholesky分解求解也不會儲存這一點。

This method should never be used in practice in machine learning. It is nice for mathematical analysis, but that’s it. However, it has become the basis for methods that are actually used by Scikit-learn and other libraries.

在機器學習中,切勿在實踐中使用此方法。 數學分析很好,僅此而已。 但是,它已成為Scikit-learn和其他庫實際使用的方法的基礎。

那麼每個人都用什麼呢? (So what does everyone use?)

As we’ve seen the downsides of the approaches shown in ML courses, let’s see what is used in practice. In Scikit-learn LinearRegression we can see:

正如我們已經看到的ML課程中顯示的方法的弊端,讓我們看看實際使用的方法。 在Scikit學習的LinearRegression我們可以看到:

Image for post

So Scikit-learn does not bother with its own implementation, instead, it just uses Scipy. In scipy.linalg.lstsq we can see that even this library does not use it’s own implementation, instead using LAPACK:

因此,Scikit-learn不會為自己的實現而煩惱,而是僅使用Scipy。 在scipy.linalg.lstsq我們可以看到即使該庫也不使用它自己的實現,而是使用LAPACK:

Image for post

Finally, we arrive at gelsd, gelsy and gelss entries in Intel LAPACK documentation:

最後,我們在Intel LAPACK文件中gelsdgelsygelss條目:

Image for post
Image for post
Image for post

2 out of those 3 methods use Singular Value Decomposition (SVD), a very important algorithm both in numerical methods and machine learning. You may have heard about it in context of NLP or recommender systems, where it’s used for dimensionality reduction. It turns out, it’s also used for practical linear regression, where it provides a reasonably fast and very accurate method for computing least squares problem that lies in the heart of linear regression.

這3種方法中有2種使用奇異值分解(SVD),這在數值方法和機器學習中都是非常重要的演算法。 您可能在NLP或推薦器系統的上下文中聽說過它,該系統用於降維。 事實證明,它還用於實際的線性迴歸,它為線性迴歸的核心提供了一種計算最小二乘問題的合理快速,非常準確的方法。

SVD和Moore-Penrose偽逆 (SVD and Moore-Penrose pseudoinverse)

If we stop one step before the normal equations, we get a regular least squares problem:

如果我們在正常方程之前停了一步,就會遇到正則最小二乘問題:

Image for post

Since X is almost never square (usually we have more samples than features, i. e. a “tall and skinny” matrix X), this equation does not have an exact solution. Instead, we use the least squares approximation, i. e. the theta vector as close as possible to the solution in terms of Euclidean distance (L2 norm):

由於X幾乎從不平方(通常,我們的樣本多於特徵,即“又高又瘦”的矩陣X ),因此該方程式沒有精確的解。 取而代之的是,我們使用最小二乘近似,即theta向量在歐幾里得距離(L2範數)上儘可能接近解:

Image for post

This problem (OLS, Ordinary Least Square) can be solved in many ways, but it turns out that we have a very useful theorem to help us:

這個問題(OLS,普通最小二乘)可以用很多方法解決,但是事實證明,我們有一個非常有用的定理可以幫助我們:

Image for post

The Moore-Penrose pseudoinverse is the matrix inverse approximation for arbitrary matrices — even not square ones! In practice it’s calculated through SVD — Singular Value Decomposition. We decompose the matrix X into the product of 3 matrices:

Moore-Penrose偽逆是任意矩陣(甚至不是正方形矩陣)的矩陣逆近似! 實際上,它是通過SVD(奇異值分解)來計算的。 我們將矩陣X分解為3個矩陣的乘積:

Image for post

The Moore-Penrose pseudoinverse is then defined as:

然後將Moore-Penrose偽逆定義為:

Image for post

As you can see, if we have the SVD, computing the pseudoinverse is quite a trivial operation, since the sigma matrix is diagonal.

如您所見,如果我們有SVD,則計算偽逆是一個微不足道的運算,因為sigma矩陣是對角線。

Finally, we arrive at a very practical formula for linear regression coefficient vector:

最後,我們得出線性迴歸係數向量的一個非常實用的公式:

Image for post

This is what is used in practice by Scikit-learn, Scipy, Numpy and a number of other packages. There are of course some optimizations that can enhance the performance, like divide-and-conquer approach for faster SVD computation (used by Scikit-learn and Scipy by default), but those are more of implementation details. The main idea remains — use SVD and Moore-Penrose pseudoinverse.

這是Scikit-learn,Scipy,Numpy和許多其他軟體包在實踐中使用的。 當然,有一些優化可以提高效能,例如採用分治法實現更快的SVD計算(預設情況下由Scikit-learn和Scipy使用),但是這些實現細節更多。 主要思想仍然是-使用SVD和Moore-Penrose偽逆。

Advantages of this method are:

這種方法的優點是:

  1. Reasonably fast — while SVD is quite costly to compute, it’s quite fast nevertheless. Also many years of research have contributed to the speed of modern implementations, allowing parallelization and distributed computing of the decomposition.

    相當快 -儘管SVD的計算成本很高,但它仍然相當快。 同樣,多年的研究也為現代實現的速度做出了貢獻,從而允許分解的並行化和分散式計算。

  2. Extremely numerically stable — numerical stability of computations is not an issue while using SVD. What’s more, it allows us to be very precise with its results.

    極高的數值穩定性 -使用SVD時,計算的數值穩定性不是問題。 而且,它使我們可以非常精確地獲得結果。

  3. Arrives exactly at global minimum — this method is almost as accurate as machine epsilon, so we have really the best solution possible.

    精確到達全域性最小值 -這種方法幾乎與機器epsilon一樣精確,因此我們確實有可能是最佳的解決方案。

Beware though — this article is about linear regression, not about regularized versions like LASSO or ElasticNet! While this method works wonders for linear regression, with regularization we don’t have the nice least squares minimization and have to use e. g. coordinate descent.

但是要當心-本文是關於線性迴歸的, 而不是像LASSO或ElasticNet這樣的正規化版本! 儘管此方法可以很好地解決線性迴歸問題,但使用正則化方法時,我們沒有最小二乘方最小化方法,因此必須使用例如座標下降法。

摘要 (Summary)

In this article you’ve learned what’s really happening under the mask of the Scikit-learn LinearRegression. While gradient descent and normal equations have their applications (education and mathematical properties), in practice we use the Moore-Penrose pseudoinverse with SVD to get accurate predictions for our linear regression models.

在本文中,您瞭解了Scikit學習LinearRegression掩蓋下的實際情況。 雖然梯度下降方程和正態方程都有其應用(教育和數學性質),但在實踐中,我們使用帶有SVD的Moore-Penrose偽逆來為線性迴歸模型獲得準確的預測。

Sources:

資料來源:

翻譯自: https://towardsdatascience.com/why-gradient-descent-and-normal-equation-are-bad-for-linear-regression-928f8b32fa4f

線性方程組 梯度下降