Dealing with Noisy Data in Data Science

阿新 • • 發佈：2018-12-29

We were working on a dataset for our data science project, where we saw that our model was not performing up to the mark. While performance is a subjective term and there can be many reasons for an under-performing model, our hunch was that this is because of the noise in the dataset.

We tried many approaches to identify and reduce this noise. Some of them worked, and some of them didn’t, because of the specific nature of the problem and the patterns in the data.

Based on my above experience, I am going to discuss various type of noise in data, and the approaches and methods to identify & reduce noise in a given dataset.

Understanding Noise in Data

Noise (in the data science space) is unwanted data items, features or records which don’t help in explaining the feature itself, or the relationship between feature & target. Noise often causes the algorithms to miss out patterns in the data.

Noise in tabular data can be of three types:

Anomalies in certain data items (Noise 1: certain anomalies in features & target)
Features that don’t help in explaining the target (Noise 2: irrelevant/weak features)
Records which don’t follow the form or relation which rest of the records do (Noise 3: noisy records)

Benefits of identifying & treating noise in data:

enables the DS algorithm to train faster.
reduces the complexity of a model and makes it easier to interpret
improves the accuracy of a model if the right subset is chosen
reduces overfitting

These are the ways of dealing noise within data based on the type of noise:

Noise as an item

We can analyse the features & target and identify the noise in terms of outliers.

Outlier detection & treatment: either remove the records or put upper and lower ceiling.

Noise as a feature

This type of noise is introduced when there are features in the data which are not related to target or doesn’t help explaining target.

Feature Selection or Elimination

Not all features are important, so we can use various methods to find the best subset of features:

Filter method

We can perform various statistical tests between feature & response to identify which features are more relevant than others.

Please note that above methods don’t identify or deal with multicollinearity, we need to figure that out separately.

Wrapper method

Here we add/remove features to baseline model and compare the performance of the model:

Forward selection
Backward elimination
Recursive elimination

Embedded Methods (Regularization)

This method make use of filter & wrapper method, it is implemented using algos which have its own built-in feature selection methods.

Noise as a record

In these methods, we can try to find the set of records which have noise.

K-fold validation

In this method, we can look at the cross validation score of each fold and analyse the folds which have poor CV scores, what are the common attributes of records having poor scores, etc.

Manual method

Here we can evaluate CV of each record (predicted vs. actual) and filter/analyse the records having a poor CV score. This will help us in analyzing why this is happening in the first place.

Unsupervised Methods (Anomaly Detection)

We can also use unsupervised learning algorithms to identify anomalies in data, these are mostly categorized as Anomaly Detection techniques.

Density-based anomaly detection

This method assumes normal data points occur around a dense neighborhood and abnormalities are far away. i.e. kNN & LOF based methods

Clustering-based anomaly detection

Using clustering technique, we can analyse the clusters to analyse which has noise. Data instances falling outside the clusters can be marked as anomalies. i.e. k-Means clustering

SVM-based anomaly detection

This technique uses SVM to learn the soft boundary in the training set and tune on validation set to identify anomalies. In this approach, the need of large samples by the previous approach is reduced by using Support Vector Machine while maintaining the high quality of clustering-based anomaly detection methods. i.e. One-class SVM

Autoencoder-based anomaly detection

Auto-encoders are used in deep learning for unsupervised learning, we can use them for anomaly detection to identify noisy data-set. These methods are advanced and outperforms traditional anomaly detection methods. i.e. Variational Autoencoder based Anomaly Detection using Reconstruction Probability.

Dealing with Noisy Data in Data Science

Understanding Noise in Data

Noise as an item

Noise as a feature

Noise as a record

Unsupervised Methods (Anomaly Detection)

References

Dealing with Noisy Data in Data Science

[Javascript] Classify JSON text data with machine learning in Natural

The Power of Goal-Setting in Data Science

[PWA] Cache JSON Data in a React PWA with Workbox, and Display it while Offline

Working With JSON Data in Python

Starting to develop in PySpark with Jupyter installed in a Big Data Cluster

How To Talk About Data in Machine Learning (Terminology from Statistics and Computer Science)

How to Load Data in Python with Scikit

This function has none of DETERMINISTIC, NO SQL, or READS SQL DATA in its de 錯誤解決辦法

Managing Hierarchical Data in MySQL(鄰接表模型）[轉載]

(Les16 Managing Data in Different Time Zones)[20171230]

SDP（0）：Streaming-Data-Processor - Data Processing with Akka-Stream

How to Access Data in a Property Tree

This function has none of DETERMINISTIC, NO SQL, or READS SQL DATA in its 錯誤解決辦法

naive bayes classifier in data mining

using ThreadLocal to cache data in request scope

HR Copy of fetching HR data in different ways

資料視覺化之"Research on visualization techniques in data mining"

How To Load CSV Machine Learning Data in Weka (如何在Weka中載入CSV機器學習資料)

for（let i in data）{}

Dealing with Noisy Data in Data Science

Understanding Noise in Data

Noise as an item

Noise as a feature

Noise as a record

Unsupervised Methods (Anomaly Detection)

References

相關推薦