1. 程式人生 > >How to Handle Missing Data with Python

How to Handle Missing Data with Python

Real-world data often has missing values.

Data can have missing values for a number of reasons such as observations that were not recorded and data corruption.

Handling missing data is important as many machine learning algorithms do not support data with missing values.

In this tutorial, you will discover how to handle missing data for machine learning with Python.

Specifically, after completing this tutorial you will know:

  • How to marking invalid or corrupt values as missing in your dataset.
  • How to remove rows with missing data from your dataset.
  • How to impute missing values with mean values in your dataset.

Let’s get started.

Note: The examples in this post assume that you have Python 2 or 3 with Pandas, NumPy and Scikit-Learn installed, specifically scikit-learn version 0.18 or higher.

  • Update March/2018: Added alternate link to download the dataset as the original appears to have been taken down.
How to Handle Missing Values with Python

How to Handle Missing Values with Python
Photo by

CoCreatr, some rights reserved.

Overview

This tutorial is divided into 6 parts:

  1. Pima Indians Diabetes Dataset: where we look at a dataset that has known missing values.
  2. Mark Missing Values: where we learn how to mark missing values in a dataset.
  3. Missing Values Causes Problems: where we see how a machine learning algorithm can fail when it contains missing values.
  4. Remove Rows With Missing Values: where we see how to remove rows that contain missing values.
  5. Impute Missing Values: where we replace missing values with sensible values.
  6. Algorithms that Support Missing Values: where we learn about algorithms that support missing values.

First, let’s take a look at our sample dataset with missing values.

1. Pima Indians Diabetes Dataset

The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:

  • 0. Number of times pregnant.
  • 1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
  • 2. Diastolic blood pressure (mm Hg).
  • 3. Triceps skinfold thickness (mm).
  • 4. 2-Hour serum insulin (mu U/ml).
  • 5. Body mass index (weight in kg/(height in m)^2).
  • 6. Diabetes pedigree function.
  • 7. Age (years).
  • 8. Class variable (0 or 1).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77%.

A sample of the first 5 rows is listed below.

12345 6,148,72,35,0,33.6,0.627,50,11,85,66,29,0,26.6,0.351,31,08,183,64,0,0,23.3,0.672,32,11,89,66,23,94,28.1,0.167,21,00,137,40,35,168,43.1,2.288,33,1

This dataset is known to have missing values.

Specifically, there are missing observations for some columns that are marked as a zero value.

We can corroborate this by the definition of those columns and the domain knowledge that a zero value is invalid for those measures, e.g. a zero for body mass index or blood pressure is invalid.

Download the dataset from here and save it to your current working directory with the file name pima-indians-diabetes.csv (update: download from here).

2. Mark Missing Values

In this section, we will look at how we can identify and mark values as missing.

We can use plots and summary statistics to help identify missing or corrupt data.

We can load the dataset as a Pandas DataFrame and print summary statistics on each attribute.

123 from pandas import read_csvdataset=read_csv('pima-indians-diabetes.csv',header=None)print(dataset.describe())

Running this example produces the following output:

12345678910111213141516171819                 0           1           2           3           4           5  \count  768.000000  768.000000  768.000000  768.000000  768.000000  768.000000mean     3.845052  120.894531   69.105469   20.536458   79.799479   31.992578std      3.369578   31.972618   19.355807   15.952218  115.244002    7.884160min      0.000000    0.000000    0.000000    0.000000    0.000000    0.00000025%      1.000000   99.000000   62.000000    0.000000    0.000000   27.30000050%      3.000000  117.000000   72.000000   23.000000   30.500000   32.00000075%      6.000000  140.250000   80.000000   32.000000  127.250000   36.600000max     17.000000  199.000000  122.000000   99.000000  846.000000   67.100000                6           7           8count  768.000000  768.000000  768.000000mean     0.471876   33.240885    0.348958std      0.331329   11.760232    0.476951min      0.078000   21.000000    0.00000025%      0.243750   24.000000    0.00000050%      0.372500   29.000000    0.00000075%      0.626250   41.000000    1.000000max      2.420000   81.000000    1.000000

This is useful.

We can see that there are columns that have a minimum value of zero (0). On some columns, a value of zero does not make sense and indicates an invalid or missing value.

Specifically, the following columns have an invalid zero minimum value:

  • 1: Plasma glucose concentration
  • 2: Diastolic blood pressure
  • 3: Triceps skinfold thickness
  • 4: 2-Hour serum insulin
  • 5: Body mass index

Let’ confirm this my looking at the raw data, the example prints the first 20 rows of data.

12345 from pandas import read_csvimport numpydataset=read_csv('pima-indians-diabetes.csv',header=None)# print the first 20 rows of dataprint(dataset.head(20))

Running the example, we can clearly see 0 values in the columns 2, 3, 4, and 5.

123456789101112131415161718192021      0    1   2   3    4     5      6   7  80    6  148  72  35    0  33.6  0.627  50  11    1   85  66  29    0  26.6  0.351  31  02    8  183  64   0    0  23.3  0.672  32  13    1   89  66  23   94  28.1  0.167  21  04    0  137  40  35  168  43.1  2.288  33  15    5  116  74   0    0  25.6  0.201  30  06    3   78  50  32   88  31.0  0.248  26  17   10  115   0   0    0  35.3  0.134  29  08    2  197  70  45  543  30.5  0.158  53  19    8  125  96   0    0   0.0  0.232  54  110   4  110  92   0    0  37.6  0.191  30  011  10  168  74   0    0  38.0  0.537  34  112  10  139  80   0    0  27.1  1.441  57  013   1  189  60  23  846  30.1  0.398  59  114   5  166  72  19  175  25.8  0.587  51  115   7  100   0   0    0  30.0  0.484  32  116   0  118  84  47  230  45.8  0.551  31  117   7  107  74   0    0  29.6  0.254  31  118   1  103  30  38   83  43.3  0.183  33  019   1  115  70  30   96  34.6  0.529  32  1

We can get a count of the number of missing values on each of these columns. We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.

We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.

123 from pandas import read_csvdataset=read_csv('pima-indians-diabetes.csv',header=None)print((dataset[[1,2,3,4,5]]==0).sum())

Running the example prints the following output:

12345 1 52 353 2274 3745 11

We can see that columns 1,2 and 5 have just a few zero values, whereas columns 3 and 4 show a lot more, nearly half of the rows.

This highlights that different “missing value” strategies may be needed for different columns, e.g. to ensure that there are still a sufficient number of records left to train a predictive model.

In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN.

Values with a NaN value are ignored from operations like sum, count, etc.

We can mark values as NaN easily with the Pandas DataFrame by using the replace() function on a subset of the columns we are interested in.

After we have marked the missing values, we can use the isnull() function to mark all of the NaN values in the dataset as True and get a count of the missing values for each column.

1234567 from pandas import read_csvimport numpydataset=read_csv('pima-indians-diabetes.csv',header=None)# mark zero values as missing or NaNdataset[[1,2,3,4,5]]=dataset[[1,2,3,4,5]].replace(0,numpy.NaN)# count the number of NaN values in each columnprint(dataset.isnull().sum())

Running the example prints the number of missing values in each column. We can see that the columns 1:5 have the same number of missing values as zero values identified above. This is a sign that we have marked the identified missing values correctly.

We can see that the columns 1 to 5 have the same number of missing values as zero values identified above. This is a sign that we have marked the identified missing values correctly.

123456789 0      01      52     353    2274    3745     116      07      08      0

This is a useful summary. I always like to look at the actual data though, to confirm that I have not fooled myself.

Below is the same example, except we print the first 20 rows of data.

1234567 from pandas import read_csvimport numpydataset=read_csv('pima-indians-diabetes.csv',header=None)# mark zero values as missing or NaNdataset[[1,2,3,4,5]]=dataset[[1,2,3,4,5]].replace(0,numpy.NaN)# print the first 20 rows of dataprint(dataset.head(20))

Running the example, we can clearly see NaN values in the columns 2, 3, 4 and 5. There are only 5 missing values in column 1, so it is not surprising we did not see an example in the first 20 rows.

It is clear from the raw data that marking the missing values had the intended effect.

123456789101112131415161718192021      0      1     2     3      4     5      6   7  80    6  148.0  72.0  35.0    NaN  33.6  0.627  50  11    1   85.0  66.0  29.0    NaN  26.6  0.351  31  02    8  183.0  64.0   NaN    NaN  23.3  0.672  32  13    1   89.0  66.0  23.0   94.0  28.1  0.167  21  04    0  137.0  40.0  35.0  168.0  43.1  2.288  33  15    5  116.0  74.0   NaN    NaN  25.6  0.201  30  06    3   78.0  50.0  32.0   88.0  31.0  0.248  26  17   10  115.0   NaN   NaN    NaN  35.3  0.134  29  08    2  197.0  70.0  45.0  543.0  30.5  0.158  53  19    8  125.0  96.0   NaN    NaN   NaN  0.232  54  110   4  110.0  92.0   NaN    NaN  37.6  0.191  30  011  10  168.0  74.0   NaN    NaN  38.0  0.537  34  112  10  139.0  80.0   NaN    NaN  27.1  1.441  57  013   1  189.0  60.0  23.0  846.0  30.1  0.398  59  114   5  166.0  72.0  19.0  175.0  25.8  0.587  51  115   7  100.0   NaN   NaN    NaN  30.0  0.484  32  116   0  118.0  84.0  47.0  230.0  45.8  0.551  31  117   7  107.0  74.0   NaN    NaN  29.6  0.254  31  118   1  103.0  30.0  38.0   83.0  43.3  0.183  33  019   1  115.0  70.0  30.0   96.0  34.6  0.529  32  1

Before we look at handling missing values, let’s first demonstrate that having missing values in a dataset can cause problems.

3. Missing Values Causes Problems

Having missing values in a dataset can cause errors with some machine learning algorithms.

In this section, we will try to evaluate a the Linear Discriminant Analysis (LDA) algorithm on the dataset with missing values.

This is an algorithm that does not work when there are missing values in the dataset.

The below example marks the missing values in the dataset, as we did in the previous section, then attempts to evaluate LDA using 3-fold cross validation and print the mean accuracy.

1234567891011121314151617 from pandas import read_csvimport numpyfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysisfrom sklearn.model_selection import KFoldfrom sklearn.model_selection import cross_val_scoredataset=read_csv('pima-indians-diabetes.csv',header=None)# mark zero values as missing or NaNdataset[[1,2,3,4,5]]=dataset[[1,2,3,4,5]].replace(0,numpy.NaN)# split dataset into inputs and outputsvalues=dataset.valuesX=values[:,0:8]y=values[:,8]# evaluate an LDA model on the dataset using k-fold cross validationmodel=LinearDiscriminantAnalysis()kfold=KFold(n_splits=3,random_state=7)result=cross_val_score(model,X,y,cv=kfold,scoring='accuracy')print(result