How to Handle Missing Data with Python

Real-world data often has missing values.

Data can have missing values for a number of reasons such as observations that were not recorded and data corruption.

Handling missing data is important as many machine learning algorithms do not support data with missing values.

In this tutorial, you will discover how to handle missing data for machine learning with Python.

Specifically, after completing this tutorial you will know:

How to marking invalid or corrupt values as missing in your dataset.
How to remove rows with missing data from your dataset.
How to impute missing values with mean values in your dataset.

Let’s get started.

Note: The examples in this post assume that you have Python 2 or 3 with Pandas, NumPy and Scikit-Learn installed, specifically scikit-learn version 0.18 or higher.

Update March/2018: Added alternate link to download the dataset as the original appears to have been taken down.

How to Handle Missing Values with Python
Photo by

CoCreatr, some rights reserved.

Overview

This tutorial is divided into 6 parts:

Pima Indians Diabetes Dataset: where we look at a dataset that has known missing values.
Mark Missing Values: where we learn how to mark missing values in a dataset.
Missing Values Causes Problems: where we see how a machine learning algorithm can fail when it contains missing values.
Remove Rows With Missing Values: where we see how to remove rows that contain missing values.
Impute Missing Values: where we replace missing values with sensible values.
Algorithms that Support Missing Values: where we learn about algorithms that support missing values.

First, let’s take a look at our sample dataset with missing values.

1. Pima Indians Diabetes Dataset

The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:

0. Number of times pregnant.
1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
2. Diastolic blood pressure (mm Hg).
3. Triceps skinfold thickness (mm).
4. 2-Hour serum insulin (mu U/ml).
5. Body mass index (weight in kg/(height in m)^2).
6. Diabetes pedigree function.
7. Age (years).
8. Class variable (0 or 1).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77%.

A sample of the first 5 rows is listed below.

6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1

12345

6,148,72,35,0,33.6,0.627,50,11,85,66,29,0,26.6,0.351,31,08,183,64,0,0,23.3,0.672,32,11,89,66,23,94,28.1,0.167,21,00,137,40,35,168,43.1,2.288,33,1

This dataset is known to have missing values.

Specifically, there are missing observations for some columns that are marked as a zero value.

We can corroborate this by the definition of those columns and the domain knowledge that a zero value is invalid for those measures, e.g. a zero for body mass index or blood pressure is invalid.

Download the dataset from here and save it to your current working directory with the file name pima-indians-diabetes.csv (update: download from here).

2. Mark Missing Values

In this section, we will look at how we can identify and mark values as missing.

We can use plots and summary statistics to help identify missing or corrupt data.

We can load the dataset as a Pandas DataFrame and print summary statistics on each attribute.

from pandas import read_csv
dataset = read_csv('pima-indians-diabetes.csv', header=None)
print(dataset.describe())

123	from pandas import read_csvdataset=read_csv('pima-indians-diabetes.csv',header=None)print(dataset.describe())

Running this example produces the following output:

                0           1           2           3           4           5  \
count  768.000000  768.000000  768.000000  768.000000  768.000000  768.000000
mean     3.845052  120.894531   69.105469   20.536458   79.799479   31.992578
std      3.369578   31.972618   19.355807   15.952218  115.244002    7.884160
min      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
25%      1.000000   99.000000   62.000000    0.000000    0.000000   27.300000
50%      3.000000  117.000000   72.000000   23.000000   30.500000   32.000000
75%      6.000000  140.250000   80.000000   32.000000  127.250000   36.600000
max     17.000000  199.000000  122.000000   99.000000  846.000000   67.100000

                6           7           8
count  768.000000  768.000000  768.000000
mean     0.471876   33.240885    0.348958
std      0.331329   11.760232    0.476951
min      0.078000   21.000000    0.000000
25%      0.243750   24.000000    0.000000
50%      0.372500   29.000000    0.000000
75%      0.626250   41.000000    1.000000
max      2.420000   81.000000    1.000000

12345678910111213141516171819

0 1 2 3 4 5 \count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160min 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000025% 1.000000 99.000000 62.000000 0.000000 0.000000 27.30000050% 3.000000 117.000000 72.000000 23.000000 30.500000 32.00000075% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 6 7 8count 768.000000 768.000000 768.000000mean 0.471876 33.240885 0.348958std 0.331329 11.760232 0.476951min 0.078000 21.000000 0.00000025% 0.243750 24.000000 0.00000050% 0.372500 29.000000 0.00000075% 0.626250 41.000000 1.000000max 2.420000 81.000000 1.000000

This is useful.

We can see that there are columns that have a minimum value of zero (0). On some columns, a value of zero does not make sense and indicates an invalid or missing value.

Specifically, the following columns have an invalid zero minimum value:

1: Plasma glucose concentration
2: Diastolic blood pressure
3: Triceps skinfold thickness
4: 2-Hour serum insulin
5: Body mass index

Let’ confirm this my looking at the raw data, the example prints the first 20 rows of data.

from pandas import read_csv
import numpy
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# print the first 20 rows of data
print(dataset.head(20))

12345

from pandas import read_csvimport numpydataset=read_csv('pima-indians-diabetes.csv',header=None)# print the first 20 rows of dataprint(dataset.head(20))

Running the example, we can clearly see 0 values in the columns 2, 3, 4, and 5.

     0    1   2   3    4     5      6   7  8
0    6  148  72  35    0  33.6  0.627  50  1
1    1   85  66  29    0  26.6  0.351  31  0
2    8  183  64   0    0  23.3  0.672  32  1
3    1   89  66  23   94  28.1  0.167  21  0
4    0  137  40  35  168  43.1  2.288  33  1
5    5  116  74   0    0  25.6  0.201  30  0
6    3   78  50  32   88  31.0  0.248  26  1
7   10  115   0   0    0  35.3  0.134  29  0
8    2  197  70  45  543  30.5  0.158  53  1
9    8  125  96   0    0   0.0  0.232  54  1
10   4  110  92   0    0  37.6  0.191  30  0
11  10  168  74   0    0  38.0  0.537  34  1
12  10  139  80   0    0  27.1  1.441  57  0
13   1  189  60  23  846  30.1  0.398  59  1
14   5  166  72  19  175  25.8  0.587  51  1
15   7  100   0   0    0  30.0  0.484  32  1
16   0  118  84  47  230  45.8  0.551  31  1
17   7  107  74   0    0  29.6  0.254  31  1
18   1  103  30  38   83  43.3  0.183  33  0
19   1  115  70  30   96  34.6  0.529  32  1

123456789101112131415161718192021

0 1 2 3 4 5 6 7 80 6 148 72 35 0 33.6 0.627 50 11 1 85 66 29 0 26.6 0.351 31 02 8 183 64 0 0 23.3 0.672 32 13 1 89 66 23 94 28.1 0.167 21 04 0 137 40 35 168 43.1 2.288 33 15 5 116 74 0 0 25.6 0.201 30 06 3 78 50 32 88 31.0 0.248 26 17 10 115 0 0 0 35.3 0.134 29 08 2 197 70 45 543 30.5 0.158 53 19 8 125 96 0 0 0.0 0.232 54 110 4 110 92 0 0 37.6 0.191 30 011 10 168 74 0 0 38.0 0.537 34 112 10 139 80 0 0 27.1 1.441 57 013 1 189 60 23 846 30.1 0.398 59 114 5 166 72 19 175 25.8 0.587 51 115 7 100 0 0 0 30.0 0.484 32 116 0 118 84 47 230 45.8 0.551 31 117 7 107 74 0 0 29.6 0.254 31 118 1 103 30 38 83 43.3 0.183 33 019 1 115 70 30 96 34.6 0.529 32 1

We can get a count of the number of missing values on each of these columns. We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.

We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.

from pandas import read_csv
dataset = read_csv('pima-indians-diabetes.csv', header=None)
print((dataset[[1,2,3,4,5]] == 0).sum())

123	from pandas import read_csvdataset=read_csv('pima-indians-diabetes.csv',header=None)print((dataset[[1,2,3,4,5]]==0).sum())

Running the example prints the following output:

12345

1 52 353 2274 3745 11

We can see that columns 1,2 and 5 have just a few zero values, whereas columns 3 and 4 show a lot more, nearly half of the rows.

This highlights that different “missing value” strategies may be needed for different columns, e.g. to ensure that there are still a sufficient number of records left to train a predictive model.

In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN.

Values with a NaN value are ignored from operations like sum, count, etc.

We can mark values as NaN easily with the Pandas DataFrame by using the replace() function on a subset of the columns we are interested in.

After we have marked the missing values, we can use the isnull() function to mark all of the NaN values in the dataset as True and get a count of the missing values for each column.

from pandas import read_csv
import numpy
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN)
# count the number of NaN values in each column
print(dataset.isnull().sum())

1234567

from pandas import read_csvimport numpydataset=read_csv('pima-indians-diabetes.csv',header=None)# mark zero values as missing or NaNdataset[[1,2,3,4,5]]=dataset[[1,2,3,4,5]].replace(0,numpy.NaN)# count the number of NaN values in each columnprint(dataset.isnull().sum())

Running the example prints the number of missing values in each column. We can see that the columns 1:5 have the same number of missing values as zero values identified above. This is a sign that we have marked the identified missing values correctly.

We can see that the columns 1 to 5 have the same number of missing values as zero values identified above. This is a sign that we have marked the identified missing values correctly.

123456789

0 01 52 353 2274 3745 116 07 08 0

This is a useful summary. I always like to look at the actual data though, to confirm that I have not fooled myself.

Below is the same example, except we print the first 20 rows of data.

from pandas import read_csv
import numpy
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN)
# print the first 20 rows of data
print(dataset.head(20))

1234567

from pandas import read_csvimport numpydataset=read_csv('pima-indians-diabetes.csv',header=None)# mark zero values as missing or NaNdataset[[1,2,3,4,5]]=dataset[[1,2,3,4,5]].replace(0,numpy.NaN)# print the first 20 rows of dataprint(dataset.head(20))

Running the example, we can clearly see NaN values in the columns 2, 3, 4 and 5. There are only 5 missing values in column 1, so it is not surprising we did not see an example in the first 20 rows.

It is clear from the raw data that marking the missing values had the intended effect.

     0      1     2     3      4     5      6   7  8
0    6  148.0  72.0  35.0    NaN  33.6  0.627  50  1
1    1   85.0  66.0  29.0    NaN  26.6  0.351  31  0
2    8  183.0  64.0   NaN    NaN  23.3  0.672  32  1
3    1   89.0  66.0  23.0   94.0  28.1  0.167  21  0
4    0  137.0  40.0  35.0  168.0  43.1  2.288  33  1
5    5  116.0  74.0   NaN    NaN  25.6  0.201  30  0
6    3   78.0  50.0  32.0   88.0  31.0  0.248  26  1
7   10  115.0   NaN   NaN    NaN  35.3  0.134  29  0
8    2  197.0  70.0  45.0  543.0  30.5  0.158  53  1
9    8  125.0  96.0   NaN    NaN   NaN  0.232  54  1
10   4  110.0  92.0   NaN    NaN  37.6  0.191  30  0
11  10  168.0  74.0   NaN    NaN  38.0  0.537  34  1
12  10  139.0  80.0   NaN    NaN  27.1  1.441  57  0
13   1  189.0  60.0  23.0  846.0  30.1  0.398  59  1
14   5  166.0  72.0  19.0  175.0  25.8  0.587  51  1
15   7  100.0   NaN   NaN    NaN  30.0  0.484  32  1
16   0  118.0  84.0  47.0  230.0  45.8  0.551  31  1
17   7  107.0  74.0   NaN    NaN  29.6  0.254  31  1
18   1  103.0  30.0  38.0   83.0  43.3  0.183  33  0
19   1  115.0  70.0  30.0   96.0  34.6  0.529  32  1

123456789101112131415161718192021

0 1 2 3 4 5 6 7 80 6 148.0 72.0 35.0 NaN 33.6 0.627 50 11 1 85.0 66.0 29.0 NaN 26.6 0.351 31 02 8 183.0 64.0 NaN NaN 23.3 0.672 32 13 1 89.0 66.0 23.0 94.0 28.1 0.167 21 04 0 137.0 40.0 35.0 168.0 43.1 2.288 33 15 5 116.0 74.0 NaN NaN 25.6 0.201 30 06 3 78.0 50.0 32.0 88.0 31.0 0.248 26 17 10 115.0 NaN NaN NaN 35.3 0.134 29 08 2 197.0 70.0 45.0 543.0 30.5 0.158 53 19 8 125.0 96.0 NaN NaN NaN 0.232 54 110 4 110.0 92.0 NaN NaN 37.6 0.191 30 011 10 168.0 74.0 NaN NaN 38.0 0.537 34 112 10 139.0 80.0 NaN NaN 27.1 1.441 57 013 1 189.0 60.0 23.0 846.0 30.1 0.398 59 114 5 166.0 72.0 19.0 175.0 25.8 0.587 51 115 7 100.0 NaN NaN NaN 30.0 0.484 32 116 0 118.0 84.0 47.0 230.0 45.8 0.551 31 117 7 107.0 74.0 NaN NaN 29.6 0.254 31 118 1 103.0 30.0 38.0 83.0 43.3 0.183 33 019 1 115.0 70.0 30.0 96.0 34.6 0.529 32 1

Before we look at handling missing values, let’s first demonstrate that having missing values in a dataset can cause problems.

3. Missing Values Causes Problems

Having missing values in a dataset can cause errors with some machine learning algorithms.

In this section, we will try to evaluate a the Linear Discriminant Analysis (LDA) algorithm on the dataset with missing values.

This is an algorithm that does not work when there are missing values in the dataset.

The below example marks the missing values in the dataset, as we did in the previous section, then attempts to evaluate LDA using 3-fold cross validation and print the mean accuracy.

from pandas import read_csv
import numpy
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN)
# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]
# evaluate an LDA model on the dataset using k-fold cross validation
model = LinearDiscriminantAnalysis()
kfold = KFold(n_splits=3, random_state=7)
result = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
print(result.mean())

1234567891011121314151617

from pandas import read_csvimport numpyfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysisfrom sklearn.model_selection import KFoldfrom sklearn.model_selection import cross_val_scoredataset=read_csv('pima-indians-diabetes.csv',header=None)# mark zero values as missing or NaNdataset[[1,2,3,4,5]]=dataset[[1,2,3,4,5]].replace(0,numpy.NaN)# split dataset into inputs and outputsvalues=dataset.valuesX=values[:,0:8]y=values[:,8]# evaluate an LDA model on the dataset using k-fold cross validationmodel=LinearDiscriminantAnalysis()kfold=KFold(n_splits=3,random_state=7)result=cross_val_score(model,X,y,cv=kfold,scoring='accuracy')print(result

How to Handle Missing Data with Python

Tweet Share Share Google Plus Real-world data often has missing values. Data can have missing va

How to Handle Missing Timesteps in Sequence Prediction Problems with Python

Tweet Share Share Google Plus It is common to have missing observations from sequence data. Data

How To Handle Missing Values In Machine Learning Data With Weka

Tweet Share Share Google Plus Data is rarely clean and often you can have corrupt or missing val

How to create own operator with python in mxnet?

處理需要調用父類 rgs rop 數據類型賦值創建 recipe 繼承CustomOp 定義操作符，重寫前向後向方法，此時可以通過_init__ 方法傳遞需要用到的參數 1 class LossLayer(mxnet.operator.CustomOp):

How to Use Power Transforms for Time Series Forecast Data with Python

Tweet Share Share Google Plus Data transforms are intended to remove noise and improve the signa

tensorflow讀取jpg格式圖片報錯 ValueError: Only know how to handle extensions: ['png']; with Pillow installed matplotlib can handle more images

nac pill images flow value bubuko 技術分享 img info 當運行mpimg.imread("img.jpg")時，spyder 出現如下錯誤： ValueError: Only know how to handle extensions

How to Handle Missing Data with Python

Overview

1. Pima Indians Diabetes Dataset

2. Mark Missing Values

3. Missing Values Causes Problems

How to Handle Missing Data with Python

How to Handle Missing Timesteps in Sequence Prediction Problems with Python

How To Handle Missing Values In Machine Learning Data With Weka

How to create own operator with python in mxnet?

How to Use Power Transforms for Time Series Forecast Data with Python

tensorflow讀取jpg格式圖片報錯 ValueError: Only know how to handle extensions: ['png']; with Pillow installed matplotlib can handle more images

How to Handle Context with Dialogflow (Part 1: Knock Knock Jokes)

Simpson’s Paradox: How to Prove Opposite Arguments with the Same Data

Python Pandas vs. Scala: how to handle dataframes (part II)

How to build Go plugin with data inside @ Alex Pliutau's Blog

Pandas for beginners: How to handle real-life data

How to get bitting code with SEC-E9 key cutting machine

[轉]How to display the data read in DataReceived event handler of serialport

轉載 -- How To Optimize Your Site With GZIP Compression

轉載 -- How To Optimize Your Site With HTTP Caching

How To Handle Click Events In Android RecyclerViews

How To Handle Network On Main Thread Error In Android

How to set connection timeout with OkHttp

[iOS] How to sort an NSMutableArray with custom objects in it?

[iOS] How to handle click event of BackBarButtonItem in nav bar?

How to Handle Missing Data with Python

Overview

1. Pima Indians Diabetes Dataset

2. Mark Missing Values

3. Missing Values Causes Problems

相關推薦