Written on January 13, 2017
Author: Lewis Gavin

Naive Bayes Example using Golf Dataset

The following notebook works through a really simple example of a Naive Bayes implementation.

The aim of this machine learning application is to predict whether or not to play golf based on Weather conditions.

1. Import the required Libraries

import pandas as pd
import numpy
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB

2. Read in the Data file

Here we are going to read in the golf.csv data file using the pandas library. This will read our CSV file into a pandas data frame.

golf_file = "Golf.csv"

# Open the file for reading and read in data
golf_file_handler = open(golf_file, "r")
golf_data = pd.read_csv(golf_file_handler, sep=",")
golf_file_handler.close()

golf_data.head(10)

	Row No	Outlook	Temperature	Humidity	Wind	Play
0	1	Sunny	85	85	False	No
1	2	Sunny	80	90	True	No
2	3	Overcast	83	78	False	Yes
3	4	Rain	70	96	False	Yes
4	5	Rain	68	80	False	Yes
5	6	Rain	65	70	True	No
6	7	Overcast	64	65	True	Yes
7	8	Sunny	72	95	False	No
8	9	Sunny	69	70	False	Yes
9	10	Rain	75	80	False	Yes

3. Data Cleansing and Feature Selection

As with any Data Science application, data cleansing and feature selection play a vital role.

We need to select the columns from the dataset that we feel will give us the best prediction score - this is called feature selection. Here we will use all columns apart from the first one, as this is a row number column.
We need to ensure the data is in the correct format for the naive bayes algorithm.
1. We need to map our string column Outlook to numbers. This is because the naive bayes implementation cannot deal with strings.

# Remove the Row No column as it is not an important feature
golf_data = golf_data.drop("Row No", axis=1)

# Map string vales for Outlook column to numbers
d = {'Sunny': 1, 'Overcast': 2, 'Rain': 3}
golf_data.Outlook = [d[item] for item in golf_data.Outlook.astype(str)]

golf_data.head(10)

	Outlook	Temperature	Humidity	Wind	Play
0	1	85	85	False	No
1	1	80	90	True	No
2	2	83	78	False	Yes
3	3	70	96	False	Yes
4	3	68	80	False	Yes
5	3	65	70	True	No
6	2	64	65	True	Yes
7	1	72	95	False	No
8	1	69	70	False	Yes
9	3	75	80	False	Yes

4. Splitting Data into Training and Test sets

We now need to randomly split the data into two sets. The first set will be sent to train the algorithm. The second to test the model to see if it can predict for us without being told the answer.

# split the data into training and test data
train, test = cross_validation.train_test_split(golf_data,test_size=0.3, random_state=0)

# initialise Gaussian Naive Bayes
naive_b = GaussianNB()

# Use all columns apart from the Play column as feautures
train_features = train.ix[:,0:4]
# Use the play column as the label
train_label = train.iloc[:,4]

# Repeate above for test data
test_features = test.ix[:,0:4]
test_label = test.iloc[:,4]

5. Training and Prediction

Firstly we need to train our model. Here our features are sent along with the actual answer (survived column). The algorithm then uses this to build a model.

Our test data set will then be used to test the model. We give the model the test data (that it has never seen before) without the answer(survived column). It will then try to predict based on the features whether or not that person would have survived.

We can see a sample of the test data along with the prediction and the overall accuracy below

# Train the naive bayes model
naive_b.fit(train_features, train_label)

# build a dataframe to show the expected vs predicted values
test_data = pd.concat([test_features, test_label], axis=1)
test_data["prediction"] = naive_b.predict(test_features)

print test_data

# Use the score function and output the prediction accuracy
print "Naive Bayes Accuracy:", naive_b.score(test_features,test_label)

    Outlook  Temperature  Humidity   Wind Play prediction
       1           69        70  False  Yes        Yes
       2           64        65   True  Yes         No
       3           68        80  False  Yes        Yes
      2           72        90   True  Yes         No
       2           83        78  False  Yes        Yes
Naive Bayes Accuracy: 0.6

Lewis Gavin

Naive Bayes Example using Golf Dataset

1. Import the required Libraries

2. Read in the Data file

3. Data Cleansing and Feature Selection

4. Splitting Data into Training and Test sets

5. Training and Prediction

Recommended Posts

How to Start your Digital Detox in 2020

| self improvement | | health |

Minimalism Tips for a Clear Mind and Decluttered Life

| self improvement | | productivity |