1. 程式人生 > 其它 >【吳恩達】正則化

【吳恩達】正則化

Chapter7 Regularization

The problem of overfitting

If we have too many features, the learned hypothesis may fit the training set very well, but fail to generalize to new examples.

addressing overfitting

  1. Reduce number of features
  • Manually select which features to keep.
  • Model selection algorithm.
  1. Regularization
  • Keep all the features, but reduce magnitude/values of parameters \(\theta_j\), which can make simpler hypothesis, smoother functions.

Cost function

\[J(\theta)=\frac{1}{2m} [\sum_1^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum^n_{j=1}\theta_j^2] \]

\(\lambda\):regularization parameter, control a trade off between two different goals. The goal of fitting the training set well and the goal of keeping the parameter small to avoid overfitting.

We don't need to shrink \(\theta_0\), because \(\theta_0\) correspond to the constant term, which makes little influence to the overfitting.

Linear regression

Gradient descent

repeat until convergence{

\[\begin{aligned} \theta_0&=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta)\\ &=\theta_j-\frac{\alpha}{m} \sum_1^m[h_\theta(x^{(i)})-y^{(i)}]x_0^{(i)}\\ \theta_j&=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\\ &=\theta_j-\alpha[\frac{1}{m} \sum_1^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+\frac{\lambda}{m}\theta_j]\\ &=\theta_j(1-\alpha\frac{\lambda}{m})-\frac{\alpha}{m} \sum_1^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} \end{aligned} \]

( \(j=1,\cdots,n\)

)

}

\(1-\alpha\frac{\lambda}{m}<1\), but very close to \(1\). Because \(\alpha\) is small and \(m\) is large. multiplying this term means reducing the influence of \(\theta_j\).

Normal equation

\[\theta=(X^TX+\lambda \left[ \begin{matrix} 0&&&&\\ &1&&&\\ &&1&&\\ &&&\ddots&\\ &&&&1 \end{matrix} \right]_{(n+1)\times(n+1)} )^{-1}X^Ty \]

if \(\lambda>0\), we can prove that the matrix is invertible.

Logistic regression

\[\begin{aligned} J(\theta)=-\frac{1}{m}\sum_{i=1}^m[y^{(i)}\log(h_\theta(x^{(i)}))+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))]+\frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2 \end{aligned} \]

Gradient descent

repeat until convergence{

\[\begin{aligned} \theta_0&=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta)\\ &=\theta_j-\frac{\alpha}{m} \sum_1^m[h_\theta(x^{(i)})-y^{(i)}]x_0^{(i)}\\ \theta_j&=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\\ &=\theta_j-\alpha[\frac{1}{m} \sum_1^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+\frac{\lambda}{m}\theta_j]\\ &=\theta_j(1-\alpha\frac{\lambda}{m})-\frac{\alpha}{m} \sum_1^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} \end{aligned} \]

( \(j=1,\cdots,n\))

}

Words and expressions

ameliorate 改良

wiggly 彎曲的