1. 程式人生 > >Applied Machine Learning Process

Applied Machine Learning Process

The Systematic Process For Working Through
Predictive Modeling Problems That 
Delivers Above Average Results

Over time, working on applied machine learning problems you develop a pattern or process for quickly getting to good robust results.

Once developed, you can use this process again and again on project after project. The more robust and developed your process, the faster you can get to reliable results.

In this post, I want to share with you the skeleton of my process for working a machine learning problem.

You can use this as a starting point or template on your next project.

5-Step Systematic Process

I liked to use a 5-step process:

  1. Define the Problem
  2. Prepare Data
  3. Spot Check Algorithms
  4. Improve Results
  5. Present Results

There is a lot of flexibility in this process. For example, the “prepare data” step is typically broken down into analyze data (summarize and graph) and prepare data (prepare samples for experiments). The “Spot Checks” step may involve multiple formal experiments.

It’s a great big production line that I try to move through in a linear manner. The great thing in using automated tools is that you can go back a few steps (say from “Improve Results” back to “Prepare Data”) and insert a new transform of the dataset and re-run experiments in the intervening steps to see what interesting results come out and how they compare to the experiments you executed before.

Production Line

Production Line
Photo by East Capital, some rights reserved

The process I use has been adapted from the standard data mining process of knowledge discovery in databases (or KDD), See the post What is Data Mining and KDD for more details.

1. Define the Problem

I like to use a three step process to define the problem. I like to move quickly and I use this mini process to see the problem from a few different perspectives very quickly:

  • Step 1: What is the problem? Describe the problem informally and formally and list assumptions and similar problems.
  • Step 2: Why does the problem need to be solved? List your motivation for solving the problem, the benefits a solution provides and how the solution will be used.
  • Step 3: How would I solve the problem? Describe how the problem would be solved manually to flush domain knowledge.

You can learn more about this process in the post:

2. Prepare Data

I preface data preparation with a data analysis phase that involves summarizing the attributes and visualizing them using scatter plots and histograms. I also like to describe in detail each attribute and relationships between attributes. This grunt work forces me to think about the data in the context of the problem before it is lost to the algorithms

The actual data preparation process is three step as follows:

  • Step 1: Data Selection: Consider what data is available, what data is missing and what data can be removed.
  • Step 2: Data Preprocessing: Organize your selected data by formatting, cleaning and sampling from it.
  • Step 3: Data Transformation: Transform preprocessed data ready for machine learning by engineering features using scaling, attribute decomposition and attribute aggregation.

You can learn more about this process for preparing data in the post:

3. Spot Check Algorithms

I use 10 fold cross validation in my test harnesses by default. All experiments (algorithm and dataset combinations) are repeated 10 times and the mean and standard deviation of the accuracy is collected and reported. I also use statistical significance tests to flush out meaningful results from noise. Box-plots are very useful for summarizing the distribution of accuracy results for each algorithm and dataset pair.

I spot check algorithms, which means loading up a bunch of standard machine learning algorithms into my test harness and performing a formal experiment. I typically run 10-20 standard algorithms from all the major algorithm families across all the transformed and scaled versions of the dataset I have prepared.

The goal of spot checking is to flush out the types of algorithms and dataset combinations that are good at picking out the structure of the problem so that they can be studied in more detail with focused experiments.

More focused experiments with well-performing families of algorithms may be performed in this step, but algorithm tuning is left for the next step.

You can discover more about defining your test harness in the post:

You can discover the importance of spot checking algorithms in the post:

4. Improve Results

After spot checking, it’s time to squeeze out the best result from the rig. I do this by running an automated sensitivity analysis on the parameters of the top performing algorithms. I also design and run experiments using standard ensemble methods of the top performing algorithms. I put a lot of time into thinking about how to get more out of the dataset or of the family of algorithms that have been shown to perform well.

Again, statistical significance of results is critical here. It is so easy to focus on the methods and play with algorithm configurations. The results are only meaningful if they are significant and all configuration are already thought out and the experiments are executed in batch. I also like to maintain my own personal leaderboard of top results on a problem.

In summary, the process of improving results involves:

  • Algorithm Tuning: where discovering the best models is treated like a search problem through model parameter space.
  • Ensemble Methods: where the predictions made by multiple models are combined.
  • Extreme Feature Engineering: where the attribute decomposition and aggregation seen in data preparation is pushed to the limits.

You can discover more about this process in the post:

5. Present Results

The results of a complex machine learning problem are meaningless unless they are put to work. This typically means a presentation to stakeholders. Even if it is a competition or a problem I am working on for myself, I still go through the process of presenting the results. It’s a good practice and gives me clear learnings I can build upon next time.

The template I use to present results is below and may take the form of a text document, formal report or presentation slides.

  • Context (Why): Define the environment in which the problem exists and set up the motivation for the research question.
  • Problem (Question): Concisely describe the problem as a question that you went out and answered.
  • Solution (Answer): Concisely describe the solution as an answer to the question you posed in the previous section. Be specific.
  • Findings: Bulleted lists of discoveries you made along the way that interests the audience. They may be discoveries in the data, methods that did or did not work or the model performance benefits you achieved along your journey.
  • Limitations: Consider where the model does not work or questions that the model does not answer. Do not shy away from these questions, defining where the model excels is more trusted if you can define where it does not excel.
  • Conclusions (Why+Question+Answer): Revisit the “why”, research question and the answer you discovered in a tight little package that is easy to remember and repeat for yourself and others.

You can discover more about using the results of a machine learning project in the post:

Summary

In this post, you have learned my general template for processing a machine learning problem.

I use this process almost without fail and I use it across platforms, from Weka, R and scikit-learn and even new platforms I have been playing around with like pylearn2.

What is your process, leave a comment and share?

Will you copy this process, and if so, what changes will you make to it?

相關推薦

Applied Machine Learning Process

Tweet Share Share Google Plus The Systematic Process For Working Through Predictive Modeling Pr

A Gentle Introduction to Applied Machine Learning as a Search Problem (譯文)

​ A Gentle Introduction to Applied Machine Learning as a Search Problem 原文作者:Jason Brownlee 原文地址:https://machinelearningmastery.com/applied-m

Machine Learning Process Archives

Getting started in applied machine learning can be difficult, especially when working with real-world data. Often, machine learning tutorials will recomme

Python is the Growing Platform for Applied Machine Learning

Tweet Share Share Google Plus You should pick the right tool for the job. The specific predictiv

10 Standard Datasets for Practicing Applied Machine Learning

Tweet Share Share Google Plus The key to getting good at applied machine learning is practicing

Hello World of Applied Machine Learning

Tweet Share Share Google Plus It is easy to feel overwhelmed with the large numbers of machine l

Why Applied Machine Learning Is Hard

Tweet Share Share Google Plus How to Handle the Intractability of Applied Machine Learning. Appl

Applied Machine Learning is a Meritocracy

Tweet Share Share Google Plus When making a start in a new field it is common to feel overwhelme

A Survey of Machine Learning Techniques Applied to Software Defined Networking (SDN): Research Issues and Challenges

文章名稱:A Survey of Machine Learning Techniques Applied to Software Defined Networking (SDN): Research Issues and Challenges 文章名稱:應用於SDN的機器學習技術綜述:研究問題與挑戰

【博觀而約取,深研而廣求】Researcher on Stochastic Process, Variational Inference, Computer Vision and Machine Learning.

Researcher on Stochastic Process, Variational Inference, Computer Vision and Machine Learning.

machine learning--L1 ,L2 norm

lan font 更多 ora net 例如 參數 而已 內容   關於L1範數和L2範數的內容和圖示,感覺已經看過千百遍,剛剛看完此大牛博客http://blog.csdn.net/zouxy09/article/details/24971995/,此時此刻終於弄懂了那麽

Ng第十一課:機器學習系統的設計(Machine Learning System Design)

未能 計算公式 pos 構建 我們 行動 mic 哪些 指標 11.1 首先要做什麽 11.2 誤差分析 11.3 類偏斜的誤差度量 11.4 查全率和查準率之間的權衡 11.5 機器學習的數據 11.1 首先要做什麽 在接下來的視頻將談到機器

[Machine Learning (Andrew NG courses)]V. Octave Tutorial (Week 2)

img and learning text net con fonts http .net [Machine Learning (Andrew NG courses)]V. Octave Tutorial (Week 2)

Machine Learning in Action-chapter2-k近鄰算法

turn fma 全部 pytho label -c log eps 數組 一.numpy()函數 1.shape[]讀取矩陣的長度 例: import numpy as np x = np.array([[1,2],[2,3],[3,4]]) print x

Ng第十七課:大規模機器學習(Large Scale Machine Learning)

在線 src 化簡 ima 機器學習 learning 大型數據集 machine cnblogs 17.1 大型數據集的學習 17.2 隨機梯度下降法 17.3 微型批量梯度下降 17.4 隨機梯度下降收斂 17.5 在線學習 17.6 映射化簡和數據並行

Machine Learning:Neural Network---Representation

white div and for 設計 rop out fcm multi Machine Learning:Neural Network---Representation 1。Non-Linear Classification 假設還採取簡

Machine Learning — 關於過度擬合(Overfitting)

機器學習 gis ear http 問題 正則化 數據集 技術 wid 機器學習是在模型空間中選擇最優模型的過程,所謂最優模型,及可以很好地擬合已有數據集,並且正確預測未知數據。 那麽如何評價一個模型的優劣的,用代價函數(Cost function)來度量預測錯誤的程度。代

Machine Learning — 邏輯回歸

url home mage 簡化 bsp 線性 alt 邏輯回歸 sce 現實生活中有很多分類問題,比如正常郵件/垃圾郵件,良性腫瘤/惡性腫瘤,識別手寫字等等,這些可以用邏輯回歸算法來解決。 一、二分類問題 所謂二分類問題,即結果只有兩類,Yes or No,這樣結果{0,

Machine Learning~初探

Y軸 ron 當我 什麽 http 過程 網上 數據 大坑   最近接觸了機器學習,感覺很夢幻,能實現的我的夢想,看網上說的花天酒地的難,但是想做就要做下去,毅然決然的跳入這個大坑。   讓我們慢慢來,先懟它幾個概念。 監督學習   我們給出了關於每個數據的“正確答案”。監

<Machine Learning in Action >之二 樸素貝葉斯 C#實現文章分類

options 直升機 water 飛機 math mes 視頻 write mod def trainNB0(trainMatrix,trainCategory): numTrainDocs = len(trainMatrix) numWords =