1. 程式人生 > >How to Identify Outliers in your Data

How to Identify Outliers in your Data

Bojan Miletic asked a question about outlier detection in datasets when working with machine learning algorithms. This post is in answer to his question.

If you have a question about machine learning, sign-up to the newsletter and reply to an email or use the

contact form and ask, I will answer your question and may even turn it into a blog post.

Outliers

Many machine learning algorithms are sensitive to the range and distribution of attribute values in the input data. Outliers in input data can skew and mislead the training process of machine learning algorithms resulting in longer training times, less accurate models and ultimately poorer results.

Outlier

Outlier
Photo by Robert S. Donovan, some rights reserved

Even before predictive models are prepared on training data, outliers can result in misleading representations and in turn misleading interpretations of collected data. Outliers can skew the summary distribution of attribute values in descriptive statistics like mean and standard deviation and in plots such as histograms and scatterplots, compressing the body of the data.

Finally, outliers can represent examples of data instances that are relevant to the problem such as anomalies in the case of fraud detection and computer security.

Outlier Modeling

Outliers are extreme values that fall a long way outside of the other observations. For example, in a normal distribution, outliers may be values on the tails of the distribution.

The process of identifying outliers has many names in data mining and machine learning such as outlier mining, outlier modeling and novelty detection and anomaly detection.

In his book Outlier Analysis (affiliate link), Aggarwal provides a useful taxonomy of outlier detection methods, as follows:

  • Extreme Value Analysis: Determine the statistical tails of the underlying distribution of the data. For example, statistical methods like the z-scores on univariate data.
  • Probabilistic and Statistical Models: Determine unlikely instances from a probabilistic model of the data. For example, gaussian mixture models optimized using expectation-maximization.
  • Linear Models: Projection methods that model the data into lower dimensions using linear correlations. For example, principle component analysis and data with large residual errors may be outliers.
  • Proximity-based Models: Data instances that are isolated from the mass of the data as determined by cluster, density or nearest neighbor analysis.
  • Information Theoretic Models: Outliers are detected as data instances that increase the complexity (minimum code length) of the dataset.
  • High-Dimensional Outlier Detection: Methods that search subspaces for outliers give the breakdown of distance based measures in higher dimensions (curse of dimensionality).

Aggarwal comments that the interpretability of an outlier model is critically important. Context or rationale is required around decisions why a specific data instance is or is not an outlier.

In his contributing chapter to Data Mining and Knowledge Discovery Handbook (affiliate link), Irad Ben-Gal proposes a taxonomy of outlier models as univariate or multivariate and parametric and nonparametric. This is a useful way to structure methods based on what is known about the data. For example:

  • Are you considered with outliers in one or more than one attributes (univariate or multivariate methods)?
  • Can you assume a statistical distribution from which the observations were sampled or not (parametric or nonparametric)?

Get Started

There are many methods and much research put into outlier detection. Start by making some assumptions and design experiments where you can clearly observe the effects of the those assumptions against some performance or accuracy measure.

I recommend working through a stepped process from extreme value analysis, proximity methods and projection methods.

Extreme Value Analysis

You do not need to know advanced statistical methods to look for, analyze and filter out outliers from your data. Start out simple with extreme value analysis.

  • Focus on univariate methods
  • Visualize the data using scatterplots, histograms and box and whisker plots and look for extreme values
  • Assume a distribution (Gaussian) and look for values more than 2 or 3 standard deviations from the mean or 1.5 times from the first or third quartile
  • Filter out outliers candidate from training dataset and assess your models performance

Proximity Methods

Once you have explore simpler extreme value methods, consider moving onto proximity-based methods.

  • Use clustering methods to identify the natural clusters in the data (such as the k-means algorithm)
  • Identify and mark the cluster centroids
  • Identify data instances that are a fixed distance or percentage distance from cluster centroids
  • Filter out outliers candidate from training dataset and assess your models performance

Projection Methods

Projection methods are relatively simple to apply and quickly highlight extraneous values.

  • Use projection methods to summarize your data to two dimensions (such as PCA, SOM or Sammon’s mapping)
  • Visualize the mapping and identify outliers by hand
  • Use proximity measures from projected values or codebook vectors to identify outliers
  • Filter out outliers candidate from training dataset and assess your models performance

Methods Robust to Outliers

An alternative strategy is to move to models that are robust to outliers. There are robust forms of regression that minimize the median least square errors rather than mean (so-called robust regression), but are more computationally intensive. There are also methods like decision trees that are robust to outliers.

You could spot check some methods that are robust to outliers. If there are significant model accuracy benefits then there may be an opportunity to model and filter out outliers from your training data.

Resources

There are a lot of webpages that discuss outlier detection, but I recommend reading through a good book on the subject, something more authoritative. Even looking through introductory books on machine learning and data mining won’t be that useful to you. For a classical treatment of outliers by statisticians, check out:

For a modern treatment of outliers by data mining community, see:

相關推薦

How to Identify Outliers in your Data

Tweet Share Share Google Plus Bojan Miletic asked a question about outlier detection in datasets

Why (and how) to use eslint in your project

Why (and how) to use eslint in your projectThis story was written by Sam Roberts, a Senior Software Engineer at IBM Canada. It was first published in IBM d

How to Normalize and Standardize Your Machine Learning Data in Weka

Tweet Share Share Google Plus Machine learning algorithms make assumptions about the dataset you

How To Load CSV Machine Learning Data in Weka (如何在Weka中載入CSV機器學習資料)

How To Load CSV Machine Learning Data in Weka 原文作者:Jason Brownlee 原文地址:https://machinelearningmastery.com/load-csv-machine-learning-data-weka/

How to One Hot Encode Sequence Data in Python

Tweet Share Share Google Plus Machine learning algorithms cannot work with categorical data dire

How to Make AI Count Your Calories: A Working Prototype in 5 Minutes

Whether you ate too much this Thanksgiving holiday, or just want to be more careful about what you eat in general, I'm here to show you a Clarifai visual r

How to Install wget in OS X如何在Mac OS X下安裝wget並解決configure: error:

configure openssl usr local 解壓 fix 官網下載 .org get 1.ftp://ftp.gnu.org/gnu/wget/官網下載最新的安裝包 wget-1.19.tar.gz 2.打開終端輸入 tar zxvf wget-1.9.1.ta

[Selenium+Java] How to Take Screenshot in Selenium WebDriver

pack ID save nsh cfi box screen clas pen Original URL: https://www.guru99.com/take-screenshot-selenium-webdriver.html Screenshots are de

How To install XRDP in UBUNTU 16.04

source accep tls .com dea wal enter href his 轉載自:http://www.techtogeek.com/how-to-install-xrdp-in-ubuntu-16-04/ by TechtoGeek · October 3

How to Create Triggers in MySQL

https://www.sitepoint.com/how-to-create-mysql-triggers/   I created two tables: CREATE TABLE `sw_user` ( `id` int(11) unsigned NOT NULL AUTO_IN

How to Install OpenCV in Ubuntu 16.04 LTS for C / C++

Step 1 – Updating Ubuntu $ sudo apt-get update $ sudo apt-get upgrade Step 2 – Install dependencies $ sudo apt-get install build-esse

How to compare dates in Java

How to compare dates in JavaBy mkyong | January 18, 2010 | Updated : November 15, 2016 | Viewed : 930,987 | +4,252 pv/wFew examples show you how to compare

how to define boundary in binary search 二分法的邊界設定

用虛擬碼來表示, 二分查詢演算法大致是這個樣子的: left = 0, right = n -1 while (left <= right) mid = (left + right) / 2 case x[mid] <

How to trace sessions in Oracle Database

當Oracle資料庫出現效能問題時,我們通常可以利用一些效能診斷工具來跟蹤SQL的執行情況,進而根據輸出的跟蹤檔案來了解和分析資料庫內部的一些操作過程和統計資訊,找出效能瓶頸。 針對不同的需求場景,有不同的診斷工具,或者同一種工具有不同的用法。這裡就彙總下一些常用的效能診斷工具和使用場景。 跟蹤

How To Do Anything In Life

Are you willing to admit that there is something you really suck at? You are terrible at it and you don’t need someone to tell you that. I know I do. I

How to import constants in a JSP page

在jsp 裡 import static 的寫法,超級怪,是醬子: ways to use constants in a JSP than: <%@ page import="static package.Interface.NAME"%> 在 java 裡的寫法: import static

[iOS] How to show Actionsheet in iPad in swift

you should configure UIAlertController to be presented on a specific point on iPAD. Example for navigation bar: // 1 let optionMenu = UIAlertCont

How to compare strings in C conditional preprocessor-directives

想使用 #define 寫code 在 preprocessor 裡,會顯示錯誤: Invalid token at start of preprocessor expression 解法: 解法 1,換成流水號: #define USER_JACK 1 #define USER_QUEEN 2

How to upgrade XAMPP in Windows?

Download the latest version of XAMPP.Install it in the same drive where your old Xampp originally was.Now go to xampp folder and run the xampp control pane

Don’t miss the chance to deploy AI in your customer service applications

Customer service is one of the most exposed parts of an organization’s corporate image. Delivering a great customer experi