A Simple Intuition for Overfitting, or Why Testing on Training Data is a Bad Idea

阿新 • • 發佈：2019-01-12

When you first start out with machine learning you load a dataset and try models. You might think to yourself, why can’t I just build a model with all of the data and evaluate it on the same dataset?

It seems reasonable. More data to train the model is better, right? Evaluating the model and reporting results on the same dataset will tell you how good the model is, right?

Wrong.

In this post you will discover the difficulties with this reasoning and develop an intuition for why it is important to test a model on unseen data.

Train and Test on the Same Dataset

If you have a dataset, say the iris flower dataset, what is the best model of that dataset?

Irises
Photo by dottieg2007

, some rights reserved

The best model is the dataset itself. If you take a given data instance and ask for it’s classification, you can look that instance up in the dataset and report the correct result every time.

This is the problem you are solving when you train and test a model on the same dataset.

You are asking the model to make predictions to data that it has “seen” before. Data that was used to create the model. The best model for this problem is the look-up model described above.

Descriptive Model

There are some circumstances where you do want to train a model and evaluate it with the same dataset.

You may want to simplify the explanation of a predictive variable from data. For example, you may want a set of simple rules or a decision tree that best describes the observations you have collected.

In this case, you are building a descriptive model.

These models can be vey useful and can help you in your project or your business to better understand how the attributes relate to the predictive value. You can add meaning to the results with the domain expertise that you have.

The important limitation of a descriptive model is that it is limited to describing the data on which it was trained. You have no idea how accurate a predictive the model it is.

Modeling a Target Function

Consider a made up classification problem that goal of which is to classify data instances as either red or green.

Modeling a Target Function
Photo by seantoyer, some rights reserved.

For this problem, assume that there exists a perfect model, or a perfect function that can correctly discriminate any data instance from the domain as red or green. In the context of a specific problem, the perfect discrimination function very likely has profound meaning in the problem domain to the domain experts. We want to think about that and try to tap into that perspective. We want to deliver that result.

Our goal when making a predictive model for this problem is to best approximate this perfect discrimination function.

We build our approximation of the perfect discrimination function using sample data collected from the domain. It’s not all the possible data, it’s a sample or subset of all possible data. If we had all the data, there would be no need to make predictions because the answers could just be looked up.

The data we use to build our approximate model contains structure within it pertaining the the ideal discrimination function. Your goal with data preparation is to best expose that structure to the modeling algorithm. The data also contains things that are irrelevant to the discrimination function such as biases from the selection of the data and random noise that perturbs and hides the structure. The model you select to approximate the function must navigate these obstacles.

The framework helps us understand the deeper difference between a descriptive and predictive model.

Descriptive vs Predictive Models

The descriptive model is only concerned with modeling the structure in the observed data. It makes sense to train and evaluate it on the same dataset.

The predictive model is attempting a much more difficult problem, approximating the true discrimination function from a sample of data. We want to use algorithms that do not pick out and model all of the noise in our sample. We do want to chose algorithms that generalize beyond the observed data. It makes sense that we could only evaluate the ability of the model to generalize from a data sample on data that it had not see before during training.

The best descriptive model is accurate on the observed data. The best predictive model is accurate on unobserved data.

Overfitting

The flaw with evaluating a predictive model on training data is that it does not inform you on how well the model has generalized to new unseen data.

A model that is selected for its accuracy on the training dataset rather than its accuracy on an unseen test dataset is very likely have lower accuracy on an unseen test dataset. The reason is that the model is not as generalized. It has specalized to the structure in the training dataset. This is called overfitting, and it’s more insidious than you think.

For example, you may want to stop training your model once the accuracy stops improving. In this situation, there will be a point where the accuracy on the training set continues to improve but the accuracy on unseen data starts to degrade.

You may be thinking to yourself: “so I’ll train on the training dataset and peek at the test dataset as I go“. A fine idea, but now the test dataset is no longer unseen data as it has been involved and influenced the training dataset.

Tackling Overfitting

A split of data 66%/34% for training to test datasets is a good start. Using cross validation is better, and using multiple runs of cross validation is better again. You want to spend the time and get the best estimate of the models accurate on unseen data.

You can increase the accuracy of your model by decreasing its complexity.

In the case of decision trees for example, you can prune the tree (delete leaves) after training. This will decrease the amount of specialisation in the specific training dataset and increase generalisation on unseen data. If you are using regression for example, you can use regularisation to constrain the complexity (magnitude of the coefficients) during the training process.

Summary

In this post you learned the important framework of phrasing the development of a predictive model as an approximation of an unknown ideal discrimination function.

Under this framework you learned that evaluating the model on training data alone is insufficient. You learned that the best and most meaningful way to evaluate the ability of a predictive model to generalize is to evaluate it on unseen data.

This intuition provided the basis for why it is critical to use train/test split tests, cross validation and ideally multiple cross validation in your test harness when evaluating predictive models.

A Simple Intuition for Overfitting, or Why Testing on Training Data is a Bad Idea

Tweet Share Share Google Plus When you first start out with machine learning you load a dataset

A Simple Framework for Designing Choices

Software is usually designed as a choose-your-own-adventure affair. To complete tasks, users move through an application by making a series of choices base

人工智慧有簡單的演算法嗎？Appendix: Is there a simple algorithm for intelligence?

In this book, we've focused on the nuts and bolts of neural networks: how they work, and how they can be used to solve pattern recogniti

Movin’ Data is a Dream

Search is a big feature at Shaadi.com. It helps our matchmaking algorithms bring out the best matches and it gets our customers closer to finding their par

a simple vim set for fortran

.vimrc cin syn num file reat top idt auto vim ~/.vimrc it‘s a new file if you did not create it before write as follows in the vimrc file

A simple in-process HTTP server for UWP

orm sage not div exc note isolation server cep 原文 http://www.dzhang.com/blog/2012/09/18/a-simple-in-process-http-server-for-windows-8-met

論文閱讀 A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SEN- TENCE EMBEDDINGS

數值 ase pdf 超參數 linear 都是 smo 很好函數這篇論文提出了SIF sentence embedding方法, 作者提供的代碼在Github. 引入作為一種無監督計算句子之間相似度的方法, sif sentence embedding使用預訓練好的

[jnhs]使用netbeans生成的webapp釋出到tomcat是需要改名字的,不然就是404Description The origin server did not find a current representation for the target resource or is not

第一次使用tomcat釋出webapp 遇到404錯誤 Description The origin server did not find a current representation for the target resource or is not will

《PCANet: A Simple Deep Learning Baseline for Image Classification》

對照論文中的示例圖和文章給出的程式碼來梳理從圖中看到，整個網路有三個關鍵步驟，Patch-mean removal 、 PCA filter convolution與Binary quantization &mapping ，分別是區域性均值化、

PCANet: A Simple Deep Learning Baseline for Image Classification?--名詞解釋

1 上取樣與下采樣縮小影象（或稱為下采樣（subsampled）或降取樣（downsampled））的主要目的有兩個：使得影象符合顯示區域的大小生成對應影象的縮圖下采樣原理：對於一幅影象I尺寸為M*N，對其進行s倍下采樣，即得到(M/s)*(N/s)尺寸的得解析度影象，當然s應該是

Ask HN: Freelancers, would you use a simple status updates tool for clients?

At a very basic level, the app wouldâ€¦* Ping you on a schedule you setup to write a status update for your client projects* Send the update as an email* A

Why Apache Spark is a Crossover Hit for Data Scientists [FWD]

Spark is a compelling multi-purpose platform for use cases that span investigative, as well as operational, analytics. Data science is a broad church. I a

The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.報該

今天發現某個action返回404。 HTTP Status 404 – Not Found Type Status Report Message /xxx.action Description The origin server did not find a current repr

A Simple Intuition for Overfitting, or Why Testing on Training Data is a Bad Idea

Train and Test on the Same Dataset

Descriptive Model

Modeling a Target Function

Descriptive vs Predictive Models

Overfitting

Tackling Overfitting

Summary

A Simple Intuition for Overfitting, or Why Testing on Training Data is a Bad Idea

A Simple Framework for Designing Choices

人工智慧有簡單的演算法嗎？Appendix: Is there a simple algorithm for intelligence?

Movin’ Data is a Dream

a simple vim set for fortran

A simple in-process HTTP server for UWP

論文閱讀 A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SEN- TENCE EMBEDDINGS

[jnhs]使用netbeans生成的webapp釋出到tomcat是需要改名字的,不然就是404Description The origin server did not find a current representation for the target resource or is not

《PCANet: A Simple Deep Learning Baseline for Image Classification》

PCANet: A Simple Deep Learning Baseline for Image Classification?--名詞解釋

Ask HN: Freelancers, would you use a simple status updates tool for clients?

Why Apache Spark is a Crossover Hit for Data Scientists [FWD]

The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.報該

java學習筆記20170705 The origin server did not find a current representation for the target resource or

Oral Presentations: Tips on How to Deliver a Speech for School or Work

【論文精讀】Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Java Web 中出現錯誤 The origin server did not find a current representation for the target resource or is

a simple machine learning system demo, for ML study.

【論文解讀】【半監督學習】【Google教你水論文】A Simple Semi-Supervised Learning Framework for Object Detection

CentOS系統yum報錯Cannot find a valid baseurl for repo

A Simple Intuition for Overfitting, or Why Testing on Training Data is a Bad Idea

Train and Test on the Same Dataset

Descriptive Model

Modeling a Target Function

Descriptive vs Predictive Models

Overfitting

Tackling Overfitting

Summary

相關推薦