1. 程式人生 > >Common Pitfalls In Machine Learning Projects

Common Pitfalls In Machine Learning Projects

In a recent presentation, Ben Hamner described the common pitfalls in machine learning projects he and his colleagues have observed during competitions on Kaggle.

In this post we take a look at the pitfalls from Ben’s talk, what they look like and how to avoid them.

Machine Learning Process

Early in the talk, Ben presented a snap-shot of the process for working a machine learning problem end-to-end.

Machine Learning Process

Machine Learning Process
Taken from “Machine Learning Gremlins” by Ben Hamner

This snapshot included 9 steps, as follows:

  1. Start with a business problem
  2. Source data
  3. Split data
  4. Select an evaluation metric
  5. Perform feature extraction
  6. Model Training
  7. Feature Selection
  8. Model Selection
  9. Production System

He commented that the process is iterative rather than linear.

He also commented that each step in this process can go wrong, derailing the whole project.

Discriminating Dogs and Cats

Ben presented a case study problem for building an automatic cat door that can let the cat in and keep the dog out. This was an instructive example as it touched on a number of key problems in working a data problem.

Discriminating Dogs and Cats

Discriminating Dogs and Cats
Taken from “Machine Learning Gremlins” by Ben Hamner

Sample Size

The first great takeaway from this example was that he studied accuracy of the model against data sample size and showed that more samples correlated with greater accuracy.

He then added more data until accuracy leveled off. This was a great example of understanding how easy it can be get an idea of the sensitivity of your system to sample size and adjust accordingly.

Wrong Problem

The second great takeaway from this example was that the system failed, it let in all cats in the neighborhood.

It was a clever example highlighting the importance of understanding the constraints of the problem that needs to be solved, rather than the problem that you want to solve.

Pitfalls In Machine Learning Projects

Ben went on to discuss four common pitfalls in when working on machine learning problems.

Although these problems are common, he points out that they can be identified and addressed relatively easily.

Overfitting

Overfitting
Taken from “Machine Learning Gremlins” by Ben Hamner

  • Data Leakage: The problem of making use of data in the model to which a production system would not have access. This is particularly common in time series problems. Can also happen with data like system id’s that may indicate a class label. Run a model and take a careful look at the attributes that contribute to the success of the model. Sanity check and consider whether it makes sense. (check out the referenced paper “Leakage in Data Mining” PDF)
  • Overfitting: Modeling the training data too closely such that the model also includes noise in the model. The result is poor ability to generalize. This becomes more of a problem in higher dimensions with more complex class boundaries.
  • Data Sampling and Splitting: Related to data leakage, you need to very careful that the train/test/validation sets are indeed independent samples. Much thought and work is required for time series problems to ensure that you can reply data to the system chronologically and validate model accuracy.
  • Data Quality: Check the consistency of your data. Ben gave an example of flight data where some aircraft were landing before taking off. Inconsistent, duplicate, and corrupt data needs to be identified and explicitly handled. It can directly hurt the modeling problem and ability of a model to generalize.

Summary

Ben’s talk “Machine Learning Gremlins” is a quick and practical talk.

You will get a useful crash course in the common pitfalls we are all susceptible to when working on a data problem.

相關推薦

Common Pitfalls In Machine Learning Projects

Tweet Share Share Google Plus In a recent presentation, Ben Hamner described the common pitfalls

機器學習筆記1 - Hello World In Machine Learning

之間 項目 圍棋 gpu 強勁 大量數據 特殊 轉換成 [1] 前言 Alpha Go在16年以4:1的戰績打敗了李世石,17年又以3:0的戰績戰勝了中國圍棋天才柯潔,這真是科技界振奮人心的進步。伴隨著媒體的大量宣傳,此事變成了婦孺皆知的大事件。大家又開始激烈的討論機器人什

sp3.1 Structuring Machine Learning Projects

分析與改進專案瓶頸:很多時候可能不知道下一步怎麼改善系統,錯誤的方法浪費大量時間 有這麼多策略 怎麼試   思維清晰知道要調整哪個引數 這些引數就像按鈕一樣啊 正交法:讓各種功能按鈕能夠分開 比如開車時候速度和方向 一個按鈕結合了其他按鈕

Data Leakage in Machine Learning 機器學習訓練中的資料洩漏

refer to:  https://www.kaggle.com/dansbecker/data-leakage There are two main types of leakage: Leaky Predictors and a Leaky Validation Strategies. L

Top 4 Steps for Data Preprocessing in Machine Learning

Data Processing in the machine learning is a data mining technique. In this process, the raw data gathered and you analyze the data to find a way to transf

How Facebook Uses Bayesian Optimization to Conduct Better Experiments in Machine Learning Models

How Facebook Uses Bayesian Optimization to Conduct Better Experiments in Machine Learning ModelsHyperparameter optimization is a key aspect of the lifecycl

How to deliver on Machine Learning projects

As Machine Learning (ML) is becoming an important part of every industry, the demand for Machine Learning Engineers (MLE) has grown dramatically. MLEs comb

[Research] Help relating to a theorem in machine learning | AITopics

This is related to a theorem that I have proved and its relation (or not) to an existing result. Essentially, I have shown that PAC-learning is undecidable

Regularization in Machine Learning: Connect the dots

Following are the various steps we will walk together and try gaining an understanding. In this post, we will consider Linear Regression as the algorithm w

Restoring balance in machine learning datasets

If you want to teach a child what an elephant looks like, you have an infinite number of options. Take a photo from National Geographic, a stuffed animal o

Vectorization Implementation in Machine Learning

IntroductionIn machine learning filed, advanced players have the need to write their own cost function or optimization algorithm in achieving a more custom

Algorithmia Survey: Large Enterprises Have Taken the Lead in Machine Learning

Companies of all sizes are not satisfied with their machine learning process and various challenges to widespread adoption remain. SEATTLE, Oct. 16, 2018 (

Report: Large organizations are finding success in machine learning

Enterprises of all sizes are looking to leverage machine learning, but not everyone is finding immediate success. A newly released report revealed larger o

Five steps for getting started in machine learning: Top data scientists share their tips

If you want to carve out a career in machine learning then knowing where to start can be daunting. Not only is the technology built on college-level math,

A new course to teach people about fairness in machine learning

In my undergraduate studies, I majored in philosophy with a focus on ethics, spending countless hours grappling with the notion of fairness: both how to de

A Quick Introduction to Text Summarization in Machine Learning

A Quick Introduction to Text Summarization in Machine LearningText summarization refers to the technique of shortening long pieces of text. The intention i

Evolutionary Algorithms: the Next Big Thing in Machine Learning?

Sentient Technologies Asks Experts from Industry and Academia to Weigh In Sentient Technologies, a world leader in artificial intelligence (AI) produc

conversations in machine learning

© 2014-2018 Mighty AI. Mighty AI, the Mighty AI logo, Training Data as a Service, TDAAS and SPARE5 are trademarks or registered trademarks of Mighty AI, In

Embrace Randomness in Machine Learning

Tweet Share Share Google Plus Why Do You Get Different Results On Different Runs Of An Algorith

How Beginners Get It Wrong In Machine Learning

Tweet Share Share Google Plus The 5 Most Common Mistakes That Beginners Make And How To Avoid Th