Naive Bayes Example using Golf Dataset
The following notebook works through a really simple example of a Naive Bayes implementation.
The aim of this machine learning application is to predict whether or not to play golf based on Weather conditions.
1. Import the required Libraries
import pandas as pd
import numpy
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
2. Read in the Data file
Here we are going to read in the golf.csv data file using the pandas library. This will read our CSV file into a pandas data frame.
golf_file = "Golf.csv"
# Open the file for reading and read in data
golf_file_handler = open(golf_file, "r")
golf_data = pd.read_csv(golf_file_handler, sep=",")
golf_file_handler.close()
golf_data.head(10)
Row No | Outlook | Temperature | Humidity | Wind | Play | |
---|---|---|---|---|---|---|
0 | 1 | Sunny | 85 | 85 | False | No |
1 | 2 | Sunny | 80 | 90 | True | No |
2 | 3 | Overcast | 83 | 78 | False | Yes |
3 | 4 | Rain | 70 | 96 | False | Yes |
4 | 5 | Rain | 68 | 80 | False | Yes |
5 | 6 | Rain | 65 | 70 | True | No |
6 | 7 | Overcast | 64 | 65 | True | Yes |
7 | 8 | Sunny | 72 | 95 | False | No |
8 | 9 | Sunny | 69 | 70 | False | Yes |
9 | 10 | Rain | 75 | 80 | False | Yes |
3. Data Cleansing and Feature Selection
As with any Data Science application, data cleansing and feature selection play a vital role.
- We need to select the columns from the dataset that we feel will give us the best prediction score - this is called feature selection. Here we will use all columns apart from the first one, as this is a row number column.
- We need to ensure the data is in the correct format for the naive bayes algorithm.
- We need to map our string column Outlook to numbers. This is because the naive bayes implementation cannot deal with strings.
# Remove the Row No column as it is not an important feature
golf_data = golf_data.drop("Row No", axis=1)
# Map string vales for Outlook column to numbers
d = {'Sunny': 1, 'Overcast': 2, 'Rain': 3}
golf_data.Outlook = [d[item] for item in golf_data.Outlook.astype(str)]
golf_data.head(10)
Outlook | Temperature | Humidity | Wind | Play | |
---|---|---|---|---|---|
0 | 1 | 85 | 85 | False | No |
1 | 1 | 80 | 90 | True | No |
2 | 2 | 83 | 78 | False | Yes |
3 | 3 | 70 | 96 | False | Yes |
4 | 3 | 68 | 80 | False | Yes |
5 | 3 | 65 | 70 | True | No |
6 | 2 | 64 | 65 | True | Yes |
7 | 1 | 72 | 95 | False | No |
8 | 1 | 69 | 70 | False | Yes |
9 | 3 | 75 | 80 | False | Yes |
4. Splitting Data into Training and Test sets
We now need to randomly split the data into two sets. The first set will be sent to train the algorithm. The second to test the model to see if it can predict for us without being told the answer.
# split the data into training and test data
train, test = cross_validation.train_test_split(golf_data,test_size=0.3, random_state=0)
# initialise Gaussian Naive Bayes
naive_b = GaussianNB()
# Use all columns apart from the Play column as feautures
train_features = train.ix[:,0:4]
# Use the play column as the label
train_label = train.iloc[:,4]
# Repeate above for test data
test_features = test.ix[:,0:4]
test_label = test.iloc[:,4]
5. Training and Prediction
Firstly we need to train our model. Here our features are sent along with the actual answer (survived column). The algorithm then uses this to build a model.
Our test data set will then be used to test the model. We give the model the test data (that it has never seen before) without the answer(survived column). It will then try to predict based on the features whether or not that person would have survived.
We can see a sample of the test data along with the prediction and the overall accuracy below
# Train the naive bayes model
naive_b.fit(train_features, train_label)
# build a dataframe to show the expected vs predicted values
test_data = pd.concat([test_features, test_label], axis=1)
test_data["prediction"] = naive_b.predict(test_features)
print test_data
# Use the score function and output the prediction accuracy
print "Naive Bayes Accuracy:", naive_b.score(test_features,test_label)
Outlook Temperature Humidity Wind Play prediction
8 1 69 70 False Yes Yes
6 2 64 65 True Yes No
4 3 68 80 False Yes Yes
11 2 72 90 True Yes No
2 2 83 78 False Yes Yes
Naive Bayes Accuracy: 0.6
`