1. 程式人生 > >How to Kick Ass in Competitive Machine Learning

How to Kick Ass in Competitive Machine Learning

David Kofoed Wind posted an article to the Kaggle blog No Free Hunch titled “Learning from the best“. In the post, David summarized 6 key areas related to participating and doing well in competitive machine learning with quotes from top performing kagglers.

In this post you will discover the key heuristics for doing well in competitive machine learning distilled from that post.

learning from the best

Learning from the best
Photo by Lida, some rights reserved

Learning from Kaggle Masters

David is a PhD student at The Technical University of Denmark in the Cognitive Systems department. Before that he was a masters student and the title of his thesis was “

Concepts in Predictive Machine Learning“.

You would know it from the title, but it is a great thesis. In it, David distills the advice from 3 Kaggle masters Tim Salimans, Steve Donoho and Anil Thomas and an analysis of the results from 10 competitions into an approach for performing well in Kaggle competitions and then tests these lessons by participating in 2 case study competitions.

His framework has 5 components:

  1. Feature engineering is the most important part of predictive machine learning
  2. Overfitting to the leaderboard is a real issue
  3. Simple models can get you very far
  4. Ensembling is a winning strategy
  5. Predicting the right thing is important

David summarizes these five areas in his blog post and adds a sixth which is general advice that does not fit into the categories.

Predictive Modeling Competitive Framework

In this section we look at key lessons from each of the five parts of the framework and additional heuristics to consider.

Feature Engineering

Feature engineering is the data preparation step that involves the transformation, aggregation and decomposition of attributes into those features that best characterize the structure in the data for the modeling problem.

  • Data matters more than the algorithms you apply.
  • Spend most of your time on feature engineering.
  • Exploit automatic methods for generating, extracting, removing and altering attributes.
  • Semi-supervised learning methods used in deep learning can automatically model features.
  • Sometimes a careful denormalization of the dataset can outperform complex feature engineering.

Overfitting

Overfitting refers to creating models that perform well on the training data and not as well (or far from it) on unseen test data. This extends to the scores observed on the leaderboard, which is an evaluation of the models on sample of a validation dataset (typically around 20%) used to identify competition winners.

  • Small and noisy training datasets can result in larger mismatch between leaderboard and final results.
  • The leaderboard does contain information, it can be used for model selection and hyper-parameter tuning.
  • Kaggle makes the dangers of overfitting painfully real.
  • Spend a lot of time on your test harness for estimating model accuracy, and even ignore the leaderboard.
  • Correlate test harness scores with leaderboard scores to evaluate the trust you can put in the leaderboard.

Simple Models

Using simple model refers to the use of classical or well understood algorithms on a dataset rather than state-of-the-art methods that are typically more complex. The simplicity or complexity of the model refers to the number of terms required and processes used to optimize those terms.

  • Simpler methods are commonly used by the best competitors.
  • Beginners move to complex models too soon, which can slow down the learning on the problem.
  • Simpler models are faster to train, easier to understand and adapt, and in turn provide more insights.
  • Simpler models force you to work on the data first, rather than tune parameters.
  • Simple models may be the reproduction of the benchmark, such as an average response by segment.

Ensembles

Ensembles refer to the combination of the predictions from multiple models into a single set of predictions, typically a blend weighted by the skill of each contributing model (such as on the public leaderboard).

  • Most prize winning models are ensembles of multiple models.
  • Ensembles highly tuned models as well as mediocre models gives good results.
  • Combining models constrained in diverse ways leads to better results.
  • Get the most out of algorithms before considering ensembles.
  • Investigate ensembles as a last step before the competition concludes.

Predict the Right Thing

Each competition has a specified model evaluation function that will be used to compare predictions made by a model against the actual values. This can define the loss function and the structure of the dataset, but it does not have to.

  • Brainstorm many different ways that the data could be used to model the problem.
  • Example of modeling flight landing time versus total flight time of a ratio of expected flight time.
  • Explore the preparation of models using different loss functions (i.e. RMSE vs MAE).

Additional Advice

This section lists additional insights from David and his interviewees for doing well in competitive machine learning.

  • Get something on the leaderboard as fast as possible
  • Build a pipeline that loads data and reliably evaluates a model it’s is almost harder than you think.
  • Have a toolbox with a lot of tools and know when and how to use them.
  • Make good use of the forums, both give and take.
  • Optimize model parameters late, after you know you are getting the most from the dataset.
  • Competitions are not won by one insight, but several chained together.

Summary

In this post you discovered a framework of 5 concerns when participating in competitive machine learning: feature engineering, overfitting, use of simple models, ensembles and predicting the right thing.

In this post we have reviewed David’s framework in key rules of thumb that can be used to get the most from data and algorithms when participating in Kaggle competitions.

相關推薦

How to Kick Ass in Competitive Machine Learning

Tweet Share Share Google Plus David Kofoed Wind posted an article to the Kaggle blog No Free Hun

How to Normalize and Standardize Your Machine Learning Data in Weka

Tweet Share Share Google Plus Machine learning algorithms make assumptions about the dataset you

How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)

Tweet Share Share Google Plus Ensembles can give you a boost in accuracy on your dataset. In thi

How to Work Through a Regression Machine Learning Project in Weka Step

Tweet Share Share Google Plus The fastest way to get good at applied machine learning is to prac

How to Layout and Manage Your Machine Learning Project

Tweet Share Share Google Plus Project layout is critical for machine learning projects just as i

How to Build an Intuition for Machine Learning Algorithms

Tweet Share Share Google Plus Machine learning algorithms are complex. To get good at applying a

10 Examples of How to Use Statistical Methods in a Machine Learning Project

Tweet Share Share Google Plus Statistics and machine learning are two very closely related field

How To Get Started In Machine Learning: A Self

Tweet Share Share Google Plus Specifically, the original poster of the question had completed t

AI In China: How Uber Rival Didi Chuxing Uses Machine Learning To Revolutionize Transportation

Chinese company, Didi Chuxing may be known by most as the world's largest ride-sharing company with a goal "to build a better journey," but its vision reve

How to rapidly test dozens of deep learning models in Python

Although k-fold cross validation is a great way of assessing a model’s performance, it’s computationally expensive to obtain these results. We can simply s

How to use DeepLab in TensorFlow for object segmentation using Deep Learning

How to use DeepLab in TensorFlow for object segmentation using Deep LearningModifying the DeepLab code to train on your own dataset for object segmentation

learning uboot how to enable watchdog in qca4531 cpu

find cpu datasheet , watchdog relate registers:   0x18060008 watchdong timer control 0x1806000c watchdog timer   we

How to Create a Linux Virtual Machine For Machine Learning Development With Python 3

Tweet Share Share Google Plus Linux is an excellent environment for machine learning development

Introduction to Random Number Generators for Machine Learning in Python

Tweet Share Share Google Plus Randomness is a big part of machine learning. Randomness is used a

How to Install wget in OS X如何在Mac OS X下安裝wget並解決configure: error:

configure openssl usr local 解壓 fix 官網下載 .org get 1.ftp://ftp.gnu.org/gnu/wget/官網下載最新的安裝包 wget-1.19.tar.gz 2.打開終端輸入 tar zxvf wget-1.9.1.ta

How to Access Data in a Property Tree

clas 代碼 3.0 float itl compute iter () find 在屬性樹裏怎麽訪問數據?屬性樹類似於(幾乎是)一個標準容器,其值類型為pair。它具有通常的成員函數,如insert、push_back、find、erase等,當然可以使用這些函數來填充

[Selenium+Java] How to Take Screenshot in Selenium WebDriver

pack ID save nsh cfi box screen clas pen Original URL: https://www.guru99.com/take-screenshot-selenium-webdriver.html Screenshots are de

How To install XRDP in UBUNTU 16.04

source accep tls .com dea wal enter href his 轉載自:http://www.techtogeek.com/how-to-install-xrdp-in-ubuntu-16-04/ by TechtoGeek · October 3

How to Create Triggers in MySQL

https://www.sitepoint.com/how-to-create-mysql-triggers/   I created two tables: CREATE TABLE `sw_user` ( `id` int(11) unsigned NOT NULL AUTO_IN

How to Install OpenCV in Ubuntu 16.04 LTS for C / C++

Step 1 – Updating Ubuntu $ sudo apt-get update $ sudo apt-get upgrade Step 2 – Install dependencies $ sudo apt-get install build-esse