How to Handle Missing Data with Python
Real-world data often has missing values.
Data can have missing values for a number of reasons such as observations that were not recorded and data corruption.
Handling missing data is important as many machine learning algorithms do not support data with missing values.
In this tutorial, you will discover how to handle missing data for machine learning with Python.
Specifically, after completing this tutorial you will know:
- How to marking invalid or corrupt values as missing in your dataset.
- How to remove rows with missing data from your dataset.
- How to impute missing values with mean values in your dataset.
Let’s get started.
Note: The examples in this post assume that you have Python 2 or 3 with Pandas, NumPy and Scikit-Learn installed, specifically scikit-learn version 0.18 or higher.
- Update March/2018: Added alternate link to download the dataset as the original appears to have been taken down.
This tutorial is divided into 6 parts:
- Pima Indians Diabetes Dataset: where we look at a dataset that has known missing values.
- Mark Missing Values: where we learn how to mark missing values in a dataset.
- Missing Values Causes Problems: where we see how a machine learning algorithm can fail when it contains missing values.
- Remove Rows With Missing Values: where we see how to remove rows that contain missing values.
- Impute Missing Values: where we replace missing values with sensible values.
- Algorithms that Support Missing Values: where we learn about algorithms that support missing values.
First, let’s take a look at our sample dataset with missing values.
1. Pima Indians Diabetes Dataset
The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:
- 0. Number of times pregnant.
- 1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
- 2. Diastolic blood pressure (mm Hg).
- 3. Triceps skinfold thickness (mm).
- 4. 2-Hour serum insulin (mu U/ml).
- 5. Body mass index (weight in kg/(height in m)^2).
- 6. Diabetes pedigree function.
- 7. Age (years).
- 8. Class variable (0 or 1).
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77%.
A sample of the first 5 rows is listed below.
12345 | 6,148,72,35,0,33.6,0.627,50,11,85,66,29,0,26.6,0.351,31,08,183,64,0,0,23.3,0.672,32,11,89,66,23,94,28.1,0.167,21,00,137,40,35,168,43.1,2.288,33,1 |
This dataset is known to have missing values.
Specifically, there are missing observations for some columns that are marked as a zero value.
We can corroborate this by the definition of those columns and the domain knowledge that a zero value is invalid for those measures, e.g. a zero for body mass index or blood pressure is invalid.
Download the dataset from here and save it to your current working directory with the file name pima-indians-diabetes.csv (update: download from here).
2. Mark Missing Values
In this section, we will look at how we can identify and mark values as missing.
We can use plots and summary statistics to help identify missing or corrupt data.
We can load the dataset as a Pandas DataFrame and print summary statistics on each attribute.
123 | from pandas import read_csvdataset=read_csv('pima-indians-diabetes.csv',header=None)print(dataset.describe()) |
Running this example produces the following output:
12345678910111213141516171819 | 0 1 2 3 4 5 \count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160min 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000025% 1.000000 99.000000 62.000000 0.000000 0.000000 27.30000050% 3.000000 117.000000 72.000000 23.000000 30.500000 32.00000075% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 6 7 8count 768.000000 768.000000 768.000000mean 0.471876 33.240885 0.348958std 0.331329 11.760232 0.476951min 0.078000 21.000000 0.00000025% 0.243750 24.000000 0.00000050% 0.372500 29.000000 0.00000075% 0.626250 41.000000 1.000000max 2.420000 81.000000 1.000000 |
This is useful.
We can see that there are columns that have a minimum value of zero (0). On some columns, a value of zero does not make sense and indicates an invalid or missing value.
Specifically, the following columns have an invalid zero minimum value:
- 1: Plasma glucose concentration
- 2: Diastolic blood pressure
- 3: Triceps skinfold thickness
- 4: 2-Hour serum insulin
- 5: Body mass index
Let’ confirm this my looking at the raw data, the example prints the first 20 rows of data.
12345 | from pandas import read_csvimport numpydataset=read_csv('pima-indians-diabetes.csv',header=None)# print the first 20 rows of dataprint(dataset.head(20)) |
Running the example, we can clearly see 0 values in the columns 2, 3, 4, and 5.
123456789101112131415161718192021 | 0 1 2 3 4 5 6 7 80 6 148 72 35 0 33.6 0.627 50 11 1 85 66 29 0 26.6 0.351 31 02 8 183 64 0 0 23.3 0.672 32 13 1 89 66 23 94 28.1 0.167 21 04 0 137 40 35 168 43.1 2.288 33 15 5 116 74 0 0 25.6 0.201 30 06 3 78 50 32 88 31.0 0.248 26 17 10 115 0 0 0 35.3 0.134 29 08 2 197 70 45 543 30.5 0.158 53 19 8 125 96 0 0 0.0 0.232 54 110 4 110 92 0 0 37.6 0.191 30 011 10 168 74 0 0 38.0 0.537 34 112 10 139 80 0 0 27.1 1.441 57 013 1 189 60 23 846 30.1 0.398 59 114 5 166 72 19 175 25.8 0.587 51 115 7 100 0 0 0 30.0 0.484 32 116 0 118 84 47 230 45.8 0.551 31 117 7 107 74 0 0 29.6 0.254 31 118 1 103 30 38 83 43.3 0.183 33 019 1 115 70 30 96 34.6 0.529 32 1 |
We can get a count of the number of missing values on each of these columns. We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.
We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.
123 | from pandas import read_csvdataset=read_csv('pima-indians-diabetes.csv',header=None)print((dataset[[1,2,3,4,5]]==0).sum()) |
Running the example prints the following output:
12345 | 1 52 353 2274 3745 11 |
We can see that columns 1,2 and 5 have just a few zero values, whereas columns 3 and 4 show a lot more, nearly half of the rows.
This highlights that different “missing value” strategies may be needed for different columns, e.g. to ensure that there are still a sufficient number of records left to train a predictive model.
In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN.
Values with a NaN value are ignored from operations like sum, count, etc.
We can mark values as NaN easily with the Pandas DataFrame by using the replace() function on a subset of the columns we are interested in.
After we have marked the missing values, we can use the isnull() function to mark all of the NaN values in the dataset as True and get a count of the missing values for each column.
1234567 | from pandas import read_csvimport numpydataset=read_csv('pima-indians-diabetes.csv',header=None)# mark zero values as missing or NaNdataset[[1,2,3,4,5]]=dataset[[1,2,3,4,5]].replace(0,numpy.NaN)# count the number of NaN values in each columnprint(dataset.isnull().sum()) |
Running the example prints the number of missing values in each column. We can see that the columns 1:5 have the same number of missing values as zero values identified above. This is a sign that we have marked the identified missing values correctly.
We can see that the columns 1 to 5 have the same number of missing values as zero values identified above. This is a sign that we have marked the identified missing values correctly.
123456789 | 0 01 52 353 2274 3745 116 07 08 0 |
This is a useful summary. I always like to look at the actual data though, to confirm that I have not fooled myself.
Below is the same example, except we print the first 20 rows of data.
1234567 | from pandas import read_csvimport numpydataset=read_csv('pima-indians-diabetes.csv',header=None)# mark zero values as missing or NaNdataset[[1,2,3,4,5]]=dataset[[1,2,3,4,5]].replace(0,numpy.NaN)# print the first 20 rows of dataprint(dataset.head(20)) |
Running the example, we can clearly see NaN values in the columns 2, 3, 4 and 5. There are only 5 missing values in column 1, so it is not surprising we did not see an example in the first 20 rows.
It is clear from the raw data that marking the missing values had the intended effect.
123456789101112131415161718192021 | 0 1 2 3 4 5 6 7 80 6 148.0 72.0 35.0 NaN 33.6 0.627 50 11 1 85.0 66.0 29.0 NaN 26.6 0.351 31 02 8 183.0 64.0 NaN NaN 23.3 0.672 32 13 1 89.0 66.0 23.0 94.0 28.1 0.167 21 04 0 137.0 40.0 35.0 168.0 43.1 2.288 33 15 5 116.0 74.0 NaN NaN 25.6 0.201 30 06 3 78.0 50.0 32.0 88.0 31.0 0.248 26 17 10 115.0 NaN NaN NaN 35.3 0.134 29 08 2 197.0 70.0 45.0 543.0 30.5 0.158 53 19 8 125.0 96.0 NaN NaN NaN 0.232 54 110 4 110.0 92.0 NaN NaN 37.6 0.191 30 011 10 168.0 74.0 NaN NaN 38.0 0.537 34 112 10 139.0 80.0 NaN NaN 27.1 1.441 57 013 1 189.0 60.0 23.0 846.0 30.1 0.398 59 114 5 166.0 72.0 19.0 175.0 25.8 0.587 51 115 7 100.0 NaN NaN NaN 30.0 0.484 32 116 0 118.0 84.0 47.0 230.0 45.8 0.551 31 117 7 107.0 74.0 NaN NaN 29.6 0.254 31 118 1 103.0 30.0 38.0 83.0 43.3 0.183 33 019 1 115.0 70.0 30.0 96.0 34.6 0.529 32 1 |
Before we look at handling missing values, let’s first demonstrate that having missing values in a dataset can cause problems.
3. Missing Values Causes Problems
Having missing values in a dataset can cause errors with some machine learning algorithms.
In this section, we will try to evaluate a the Linear Discriminant Analysis (LDA) algorithm on the dataset with missing values.
This is an algorithm that does not work when there are missing values in the dataset.
The below example marks the missing values in the dataset, as we did in the previous section, then attempts to evaluate LDA using 3-fold cross validation and print the mean accuracy.
1234567891011121314151617 | from pandas import read_csvimport numpyfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysisfrom sklearn.model_selection import KFoldfrom sklearn.model_selection import cross_val_scoredataset=read_csv('pima-indians-diabetes.csv',header=None)# mark zero values as missing or NaNdataset[[1,2,3,4,5]]=dataset[[1,2,3,4,5]].replace(0,numpy.NaN)# split dataset into inputs and outputsvalues=dataset.valuesX=values[:,0:8]y=values[:,8]# evaluate an LDA model on the dataset using k-fold cross validationmodel=LinearDiscriminantAnalysis()kfold=KFold(n_splits=3,random_state=7)result=cross_val_score(model,X,y,cv=kfold,scoring='accuracy')print(result |