Naive Bayes Classifier From Scratch in Python

The Naive Bayes algorithm is simple and effective and should be one of the first methods you try on a classification problem.

In this tutorial you are going to learn about the Naive Bayes algorithm including how it works and how to implement it from scratch in Python.

Update March/2018: Added alternate link to download the dataset as the original appears to have been taken down.

Naive Bayes Classifier
Photo by Matt Buck, some rights reserved

About Naive Bayes

The Naive Bayes algorithm is an intuitive method that uses the probabilities of each attribute belonging to each class to make a prediction. It is the supervised learning approach you would come up with if you wanted to model a predictive modeling problem probabilistically.

Naive bayes simplifies the calculation of probabilities by assuming that the probability of each attribute belonging to a given class value is independent of all other attributes. This is a strong assumption but results in a fast and effective method.

The probability of a class value given a value of an attribute is called the conditional probability. By multiplying the conditional probabilities together for each attribute for a given class value, we have a probability of a data instance belonging to that class.

To make a prediction we can calculate probabilities of the instance belonging to each class and select the class value with the highest probability.

Naive bases is often described using categorical data because it is easy to describe and calculate using ratios. A more useful version of the algorithm for our purposes supports numeric attributes and assumes the values of each numerical attribute are normally distributed (fall somewhere on a bell curve). Again, this is a strong assumption, but still gives robust results.

Get your FREE Algorithms Mind Map

Sample of the handy machine learning algorithms mind map.

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Download For Free

Also get exclusive access to the machine learning algorithms email mini-course.

Predict the Onset of Diabetes

The test problem we will use in this tutorial is the Pima Indians Diabetes problem.

This problem is comprised of 768 observations of medical details for Pima indians patents. The records describe instantaneous measurements taken from the patient such as their age, the number of times pregnant and blood workup. All patients are women aged 21 or older. All attributes are numeric, and their units vary from attribute to attribute.

Each record has a class value that indicates whether the patient suffered an onset of diabetes within 5 years of when the measurements were taken (1) or not (0).

This is a standard dataset that has been studied a lot in machine learning literature. A good prediction accuracy is 70%-76%.

Below is a sample from the pima-indians.data.csv file to get a sense of the data we will be working with (update: download from here).

Sample from the pima-indians.data.csv file

6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1

12345

6,148,72,35,0,33.6,0.627,50,11,85,66,29,0,26.6,0.351,31,08,183,64,0,0,23.3,0.672,32,11,89,66,23,94,28.1,0.167,21,00,137,40,35,168,43.1,2.288,33,1

Naive Bayes Algorithm Tutorial

This tutorial is broken down into the following steps:

Handle Data: Load the data from CSV file and split it into training and test datasets.
Summarize Data: summarize the properties in the training dataset so that we can calculate probabilities and make predictions.
Make a Prediction: Use the summaries of the dataset to generate a single prediction.
Make Predictions: Generate predictions given a test dataset and a summarized training dataset.
Evaluate Accuracy: Evaluate the accuracy of predictions made for a test dataset as the percentage correct out of all predictions made.
Tie it Together: Use all of the code elements to present a complete and standalone implementation of the Naive Bayes algorithm.

1. Handle Data

The first thing we need to do is load our data file. The data is in CSV format without a header line or any quotes. We can open the file with the open function and read the data lines using the reader function in the csv module.

We also need to convert the attributes that were loaded as strings into numbers that we can work with them. Below is the loadCsv() function for loading the Pima indians dataset.

Load a CSV file of scalars into memory Python

import csv
def loadCsv(filename):
	lines = csv.reader(open(filename, "rb"))
	dataset = list(lines)
	for i in range(len(dataset)):
		dataset[i] = [float(x) for x in dataset[i]]
	return dataset

1234567

importcsvdefloadCsv(filename):lines=csv.reader(open(filename,"rb"))dataset=list(lines)foriinrange(len(dataset)):dataset[i]=[float(x)forxindataset[i]]returndataset

We can test this function by loading the pima indians dataset and printing the number of data instances that were loaded.

Test the loadCsv() function Python

filename = 'pima-indians-diabetes.data.csv'
dataset = loadCsv(filename)
print('Loaded data file {0} with {1} rows').format(filename, len(dataset))

123	filename='pima-indians-diabetes.data.csv'dataset=loadCsv(filename)print('Loaded data file {0} with {1} rows').format(filename,len(dataset))

Running this test, you should see something like:

Example output of testing the loadCsv() function

Loaded data file pima-indians-diabetes.data.csv rows

1	Loaded data file pima-indians-diabetes.data.csv rows

Next we need to split the data into a training dataset that Naive Bayes can use to make predictions and a test dataset that we can use to evaluate the accuracy of the model. We need to split the data set randomly into train and datasets with a ratio of 67% train and 33% test (this is a common ratio for testing an algorithm on a dataset).

Below is the splitDataset() function that will split a given dataset into a given split ratio.

Split a loaded dataset into a train and test datasets Python

import random
def splitDataset(dataset, splitRatio):
	trainSize = int(len(dataset) * splitRatio)
	trainSet = []
	copy = list(dataset)
	while len(trainSet) < trainSize:
		index = random.randrange(len(copy))
		trainSet.append(copy.pop(index))
	return [trainSet, copy]

123456789

importrandomdefsplitDataset(dataset,splitRatio):trainSize=int(len(dataset)*splitRatio)trainSet=[]copy=list(dataset)whilelen(trainSet)<trainSize:index=random.randrange(len(copy))trainSet.append(copy.pop(index))return[trainSet,copy]

We can test this out by defining a mock dataset with 5 instances, split it into training and testing datasets and print them out to see which data instances ended up where.

Test the splitDataset() function Python

dataset = [[1], [2], [3], [4], [5]]
splitRatio = 0.67
train, test = splitDataset(dataset, splitRatio)
print('Split {0} rows into train with {1} and test with {2}').format(len(dataset), train, test)

1234	dataset=[[1],[2],[3],[4],[5]]splitRatio=0.67train,test=splitDataset(dataset,splitRatio)print('Split {0} rows into train with {1} and test with {2}').format(len(dataset),train,test)

Running this test, you should see something like:

Example output from testing the splitDataset() function

Split 5 rows into train with [[4], [3], [5]] and test with [[1], [2]]

1	Split5rows into train with[[4],[3],[5]]andtest with[[1],[2]]

2. Summarize Data

The naive bayes model is comprised of a summary of the data in the training dataset. This summary is then used when making predictions.

The summary of the training data collected involves the mean and the standard deviation for each attribute, by class value. For example, if there are two class values and 7 numerical attributes, then we need a mean and standard deviation for each attribute (7) and class value (2) combination, that is 14 attribute summaries.

These are required when making predictions to calculate the probability of specific attribute values belonging to each class value.

We can break the preparation of this summary data down into the following sub-tasks:

Separate Data By Class
Calculate Mean
Calculate Standard Deviation
Summarize Dataset
Summarize Attributes By Class

Separate Data By Class

The first task is to separate the training dataset instances by class value so that we can calculate statistics for each class. We can do that by creating a map of each class value to a list of instances that belong to that class and sort the entire dataset of instances into the appropriate lists.

The separateByClass() function below does just this.

The separateByClass() function

def separateByClass(dataset):
	separated = {}
	for i in range(len(dataset)):
		vector = dataset[i]
		if (vector[-1] not in separated):
			separated[vector[-1]] = []
		separated[vector[-1]].append(vector)
	return separated

12345678

def separateByClass(dataset):separated={}foriinrange(len(dataset)):vector=dataset[i]if(vector[-1]notinseparated):separated[vector[-1]]=[]separated[vector[-1]].append(vector)returnseparated

You can see that the function assumes that the last attribute (-1) is the class value. The function returns a map of class values to lists of data instances.

We can test this function with some sample data, as follows:

Testing the separateByClass() function

dataset = [[1,20,1], [2,21,0], [3,22,1]]
separated = separateByClass(dataset)
print('Separated instances: {0}').format(separated)

123	dataset=[[1,20,1],[2,21,0],[3,22,1]]separated=separateByClass(dataset)print('Separated instances: {0}').format(separated)

Running this test, you should see something like:

Output when testing the separateByClass() function

Separated instances: {0: [[2, 21, 0]], 1: [[1, 20, 1], [3, 22, 1]]}

1	Separated instances:{0:[[2,21,0]],1:[[1,20,1],[3,22,1]]}

Calculate Mean

We need to calculate the mean of each attribute for a class value. The mean is the central middle or central tendency of the data, and we will use it as the middle of our gaussian distribution when calculating probabilities.

We also need to calculate the standard deviation of each attribute for a class value. The standard deviation describes the variation of spread of the data, and we will use it to characterize the expected spread of each attribute in our Gaussian distribution when calculating probabilities.

The standard deviation is calculated as the square root of the variance. The variance is calculated as the average of the squared differences for each attribute value from the mean. Note we are using the N-1 method, which subtracts 1 from the number of attribute values when calculating the variance.

Functions to calculate the mean and standard deviations of attributes

import math
def mean(numbers):
	return sum(numbers)/float(len(numbers))

def stdev(numbers):
	avg = mean(numbers)
	variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
	return math.sqrt(variance)

12345678

import mathdef mean(numbers):returnsum(numbers)/float(len(numbers))def stdev(numbers):avg=mean(numbers)variance=sum([pow(x-avg,2)forxinnumbers])/float(len(numbers)-1)returnmath.sqrt(variance)

We can test this by taking the mean of the numbers from 1 to 5.

Code to test the mean() and stdev() functions

numbers = [1,2,3,4,5]
print('Summary of {0}: mean={1}, stdev={2}').format(numbers, mean(numbers), stdev(numbers))

12	numbers=[1,2,3,4,5]print('Summary of {0}: mean={1}, stdev={2}').format(numbers,mean(numbers),stdev(numbers))

Running this test, you should see something like:

Output of testing the mean() and stdev() functions

Summary of [1, 2, 3, 4, 5]: mean=3.0, stdev=1.58113883008

1	Summary of[1,2,3,4,5]:mean=3.0,stdev=1.58113883008

Summarize Dataset

Now we have the tools to summarize a dataset. For a given list of instances (for a class value) we can calculate the mean and the standard deviation for each attribute.

The zip function groups the values for each attribute across our data instances into their own lists so that we can compute the mean and standard deviation values for the attribute.

The summarize() function

def summarize(dataset):
	summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
	del summaries[-1]
	return summaries

1234	def summarize(dataset):summaries=[(mean(attribute),stdev(attribute))forattribute inzip(*dataset)]del summaries[-1]returnsummaries

We can test this summarize() function with some test data that shows markedly different mean and standard deviation values for the first and second data attributes.

Code to test the summarize() function

dataset = [[1,20,0], [2,21,1], [3,22,0]]
summary = summarize(dataset)
print('Attribute summaries: {0}').format(summary)

Naive Bayes Classifier From Scratch in Python

About Naive Bayes

Get your FREE Algorithms Mind Map

Predict the Onset of Diabetes

Naive Bayes Algorithm Tutorial

1. Handle Data

2. Summarize Data

Separate Data By Class

Calculate Mean

Summarize Dataset

Naive Bayes Classifier From Scratch in Python

How to build your own Neural Network from scratch in Python

Building a Simple Chatbot from Scratch in Python (using NLTK)

Build and train a neural net from scratch in Python

naive bayes classifier in data mining

Naive Bayes Classifier in OpenNLP

樸素貝葉斯分類器的應用 Naive Bayes classifier

機器學習---樸素貝葉斯分類器（Machine Learning Naive Bayes Classifier）

How to Scale Machine Learning Data From Scratch With Python

How to Implement Stacked Generalization From Scratch With Python

Machine Learning Algorithms From Scratch: With Python

From Scratch: AI Balancing Act in 50 Lines of Python

Developing a Naive Bayes Text Classifier in JAVA

Nearest Neighbors in Python From Scratch

【黎明傳數==>機器學習速成寶典】模型篇05——樸素貝葉斯【Naive Bayes】（附python代碼）

【Spark MLlib速成寶典】模型篇04樸素貝葉斯【Naive Bayes】（Python版）

naive bayes 演算法的Python實現與理解

LeetCode in Python: Remove Duplicates from Sorted Array II

LeetCode in Python: Remove Duplicates from Sorted Array

樸素貝葉斯演算法(Naive Bayes)演算法的python實現含原始碼

Naive Bayes Classifier From Scratch in Python

About Naive Bayes

Get your FREE Algorithms Mind Map

Predict the Onset of Diabetes

Naive Bayes Algorithm Tutorial

1. Handle Data

2. Summarize Data

Separate Data By Class

Calculate Mean

Summarize Dataset

相關推薦