Written on January 13, 2017
Author: Lewis Gavin

Naive Bayes Example using Golf Dataset

The following notebook works through a really simple example of a Naive Bayes implementation.

The aim of this machine learning application is to predict whether or not to play golf based on Weather conditions.

1. Import the required Libraries

import pandas as pd
import numpy
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB

2. Read in the Data file

Here we are going to read in the golf.csv data file using the pandas library. This will read our CSV file into a pandas data frame.

golf_file = "Golf.csv"

# Open the file for reading and read in data
golf_file_handler = open(golf_file, "r")
golf_data = pd.read_csv(golf_file_handler, sep=",")
golf_file_handler.close()

golf_data.head(10)
Row No Outlook Temperature Humidity Wind Play
0 1 Sunny 85 85 False No
1 2 Sunny 80 90 True No
2 3 Overcast 83 78 False Yes
3 4 Rain 70 96 False Yes
4 5 Rain 68 80 False Yes
5 6 Rain 65 70 True No
6 7 Overcast 64 65 True Yes
7 8 Sunny 72 95 False No
8 9 Sunny 69 70 False Yes
9 10 Rain 75 80 False Yes

3. Data Cleansing and Feature Selection

As with any Data Science application, data cleansing and feature selection play a vital role.

  1. We need to select the columns from the dataset that we feel will give us the best prediction score - this is called feature selection. Here we will use all columns apart from the first one, as this is a row number column.
  2. We need to ensure the data is in the correct format for the naive bayes algorithm.
    1. We need to map our string column Outlook to numbers. This is because the naive bayes implementation cannot deal with strings.
# Remove the Row No column as it is not an important feature
golf_data = golf_data.drop("Row No", axis=1)

# Map string vales for Outlook column to numbers
d = {'Sunny': 1, 'Overcast': 2, 'Rain': 3}
golf_data.Outlook = [d[item] for item in golf_data.Outlook.astype(str)]

golf_data.head(10)
Outlook Temperature Humidity Wind Play
0 1 85 85 False No
1 1 80 90 True No
2 2 83 78 False Yes
3 3 70 96 False Yes
4 3 68 80 False Yes
5 3 65 70 True No
6 2 64 65 True Yes
7 1 72 95 False No
8 1 69 70 False Yes
9 3 75 80 False Yes

4. Splitting Data into Training and Test sets

We now need to randomly split the data into two sets. The first set will be sent to train the algorithm. The second to test the model to see if it can predict for us without being told the answer.

# split the data into training and test data
train, test = cross_validation.train_test_split(golf_data,test_size=0.3, random_state=0)

# initialise Gaussian Naive Bayes
naive_b = GaussianNB()

# Use all columns apart from the Play column as feautures
train_features = train.ix[:,0:4]
# Use the play column as the label
train_label = train.iloc[:,4]

# Repeate above for test data
test_features = test.ix[:,0:4]
test_label = test.iloc[:,4]

5. Training and Prediction

Firstly we need to train our model. Here our features are sent along with the actual answer (survived column). The algorithm then uses this to build a model.

Our test data set will then be used to test the model. We give the model the test data (that it has never seen before) without the answer(survived column). It will then try to predict based on the features whether or not that person would have survived.

We can see a sample of the test data along with the prediction and the overall accuracy below

# Train the naive bayes model
naive_b.fit(train_features, train_label)

# build a dataframe to show the expected vs predicted values
test_data = pd.concat([test_features, test_label], axis=1)
test_data["prediction"] = naive_b.predict(test_features)

print test_data

# Use the score function and output the prediction accuracy
print "Naive Bayes Accuracy:", naive_b.score(test_features,test_label)
    Outlook  Temperature  Humidity   Wind Play prediction
8         1           69        70  False  Yes        Yes
6         2           64        65   True  Yes         No
4         3           68        80  False  Yes        Yes
11        2           72        90   True  Yes         No
2         2           83        78  False  Yes        Yes
Naive Bayes Accuracy: 0.6

`