1. 程式人生 > >Improve Model Accuracy with Data Pre

Improve Model Accuracy with Data Pre

Data preparation can make or break the predictive ability of your model.

In Chapter 3 of their book Applied Predictive Modeling, Kuhn and Johnson introduce the process of data preparation. They refer to it as the addition, deletion or transformation of training set data.

In this post you will discover the data pre-process steps that you can use to improve the predictive ability of your models.

i love spreadsheets

I Love Spreadsheets
Photo by Craig Chew-Moulding, some rights reserved

Data Preparation

You must pre-process your raw data before you model your problem. The specific preparation may depend on the data that you have available and the machine learning algorithms you want to use.

Sometimes, pre-processing of data can lead to unexpected improvements in model accuracy. This may be because a relationship in the data has been simplified or unobscured.

Data preparation is an important step and you should experiment with data pre-processing steps that are appropriate for your data to see if you can get that desirable boost in model accuracy.

There are three types of pre-processing you can consider for your data:

  • Add attributes to your data
  • Remove attributes from your data
  • Transform attributes in your data

We will dive into each of these three types of pre-process and review some specific examples of operations that you can perform.

Add Data Attributes

Advanced models can extract the relationships from complex attributes, although some models require those relationships to be spelled out plainly. Deriving new attributes from your training data to include in the modeling process can give you a boost in model performance.

  • Dummy Attributes: Categorical attributes can be converted into n-binary attributes, where n is the number of categories (or levels) that the attribute has. These denormalized or decomposed attributes are known as dummy attributes or dummy variables.
  • Transformed Attribute: A transformed variation of an attribute can be added to the dataset in order to allow a linear method to exploit possible linear and non-linear relationships between attributes. Simple transforms like log, square and square root can be used.
  • Missing Data: Attributes with missing data can have that missing data imputed using a reliable method, such as k-nearest neighbors.

Remove Data Attributes

Some methods perform poorly with redundant or duplicate attributes. You can get a boost in model accuracy by removing attributes from your data.

  • Projection: Training data can be projected into lower dimensional spaces, but still characterize the inherent relationships in the data. A popular approach is Principal Component Analysis (PCA) where the principal components found by the method can be taken as a reduced set of input attributes.
  • Spatial Sign: A spatial sign projection of the data will transform data onto the surface of a multidimensional sphere. The results can be used to highlight the existence of outliers that can be modified or removed from the data.
  • Correlated Attributes: Some algorithms degrade in importance with the existence of highly correlated attributes. Pairwise attributes with high correlation can be identified and the most correlated attributes can be removed from the data.

Transform Data Attributes

Transformations of training data can reduce the skewness of data as well as the prominence of outliers in the data. Many models expect data to be transformed before you can apply the algorithm.

  • Centering: Transform the data so that it has a mean of zero and a standard deviation of one. This is typically called data standardization.
  • Scaling: A standard scaling transformation is to map the data from the original scale to a scale between zero and one. This is typically called data normalization.
  • Remove Skew: Skewed data is data that has a distribution that is pushed to one side or the other (larger or smaller values) rather than being normally distributed. Some methods assume normally distributed data and can perform better if the skew is removed. Try replacing the attribute with the log, square root or inverse of the values.
  • Box-Cox: A Box-Cox transform or family of transforms can be used to reliably adjust data to remove skew.
  • Binning: Numeric data can be made discrete by grouping values into bins. This is typically called data discretization. This process can be performed manually, although is more reliable if performed systematically and automatically using a heuristic that makes sense in the domain.

Summary

Data pre-process is an important step that can be required to prepare raw data for modeling, to meet the expectations of data for a specific machine learning algorithms, and can give unexpected boosts in model accuracy.

In this post we discovered three groups of data pre-processing methods:

  • Adding Attributes
  • Removing Attributes
  • Transforming Attributes

The next time you are looking for a boost in model accuracy, consider what new perspectives you can engineer on your data for your models to explore and exploit.

相關推薦

Improve Model Accuracy with Data Pre

Tweet Share Share Google Plus Data preparation can make or break the predictive ability of your

SQUEEZENET:AlexNet-level Accuracy with 50X fewer parameters and 0.5MB model size

這是由UC Berkerley和Stanford研究人員一起完成的Squeezenet網路結構和設計思想。SqueezeNet設計目標是在保持精度(Alexnet)的情況下簡化網路的複雜度。 1、設計原則: 儘量選擇1*1卷積核來代替3*3卷積核,因為1*1的卷積核比3*3的卷積核引數

6x speed-up on your data pre-processing with Python

Here’s how you can get a 2–6x speed-up on your data pre-processing with PythonGet a 2–6x speed-up on your pre-processing with these 3 lines of code!Python

【論文閱讀】【ICLR 2017】SqueezeNet AlexNet-level accuracy with 50x fewer parameters and 0.5MB model size

SqueezeNet AlexNet-level accuracy with 50x fewer parameters and 0.5MB model size SqueezeNet 是一種網路結構,準確率與AlexNet相當(ImageNet資料集上),但

WPF 03 - Model Validation with IDataErrorInfo 例子

提示 ack temp pos 樣式 OS 進化 需要 blog WPF 03 - Model Validation with IDataErrorInfo 例子 VIEWMODEL: 界面: 運行結果: 當為空時,error被觸發。邊框變為

Saving a model fails with mongo error: MongoError: Unknown modifier: $pushAll

Order.findById(orderId) .populate( [{ path: 'prescription' }] ) // .lean()

Puzzle Play with data terminal unit Children Results in Better Spatial Skills

www.inhandnetworks.com UChicago researchers have found that children who played with puzzles at a young age later perform better on tasks utiliz

First Contaction with Data Image Processing and Some Social Problems That This Technology Can Solve

In the class I had in last term, I first contacted the data image processing. Images in the computer can be roughly divided into binary images, gr

[2] Getting Started With Data Reflections

Getting Started With Data Reflections Why Data Reflections? 分析中通常涉及較大資料集和資源密集型的操作,資料分析和資料科學家需要較高效的互動式查詢來完成他們的分析工作,其中分析任務多是迭代關聯性的,每一

inspired architecture could improve how computers handle data and advance AI

Today's computers are built on the von Neumann architecture, developed in the 1940s. Von Neumann computing systems feature a central processer that execut

inspired architecture could improve how computers handle data and advance AI | AITopics

Today's computers are built on the von Neumann architecture, developed in the 1940s. Von Neumann computing systems feature a central processer that execute

Do more with Data: Building a Data Supplier plugin for Sketch

In Sketch 52, we introduced an exciting new feature —Data. If you still haven’t read about it, be sure to check our release blog post, or take a look at th

Machine Learning: Balancing model performance with business goals

This post is designed to give some guidance for evaluating the use of machine learning to solve your business problem. As a data scientist, I am highly mot

Ask HN: How to model numerical energy data in Wolfram Alpha

I'm working on a dataset that contains energy supply and consumption data. This is just a hobby and idea is to do visualizations and simple moodelling base

The Evolution of Analytics with Data

The Evolution of Analytics with DataWe have made tremendous progress in the field of Information & Technology in recent times. Some of the revolutionar

Running with Data

The build-up for this year’s Berlin marathon was a little more exciting than usual. There was something in the air and a hint that perhaps we might witness

Create customize map with data OpenStreetMap

Recently my company is developing smart parking application and is very important path is map. Google map is great system with a lots of function but w

How to build Go plugin with data inside @ Alex Pliutau's Blog

Go 1.8 gives us a new tool for creating shared libraries, called plugins! This new plugin buildmode is currently only supported on Linux. But

driven organisations can harness the power of ‘many eyes’ and improve the accuracy of…

How data-driven organisations can harness the power of ‘many eyes’ and improve the accuracy of their data and analyticsAnalytic Ops, simply put, is Dev Ops

Improve Your Python With Python Tricks

Improve Your Python With: