TRAINING & TESTING of the MODELS

We need to train and test the models and finding out the model with best accuracy.

This post will be updated soon.

Let’s get started with part 5 of the series Machine Learning in Bioinformatics With Python. Previously videos we have downloaded the data from UCI Repository and we are also preprocessed our data.

In this video we will be doing the training and testing of the models. We will split the data into the training dataset and the testing dataset. Eventually we will train our models by using two classifiers SVC and logistic regression.

First of all let’s import a few things we will be needing train_test_split , LogisticRegression and SVC . See the code block below.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

Let’s move on to splitting our datasets into training and testing. So we need X_train X_test y_train and y_test ( by convention y variable for the labels is kept small).

Usually we show shome of the data to the model and keep some of the data to test the accuracy of the model after it is trained. Over here we will use 90% of the data for training purpose and 10% for the testing of the model. We will split the data by using the train_test_split() funcation. test_size = 0.1 means that the training data will be 10% and random_state = 0 is a random seed which could be a number to ensure that the random numbers are generated in the same order. Here is the code for the split.

[X_train, X_test, y_train, y_test] = train_test_split(X, y, test_size = 0.1, random_state = 0)

The above line of code will create 4 variables. It will store the training & testing lables in y_train & y_test and training & testing features in X_train & X_test.

Now it is time to define and train our very first classifier the SVC Classifier. Defining a classifier is very easy we just have set up a variable it could be anything like Classifier for this example and then we just have to call a function like so.

Classifier = SVC(kernel = 'linear')

Note: You can just ignore the kernal for the time being or you can study more about it here).

Let’s train out model by simply using the .fit function and we will be using the training data here. I am going with all the defaults in this beginner tutorieal. However, you can learn more about it in sklearn.svm.SVC the official documentation.

model = Classifier.fit(X_train, y_train)

Once our model is trained we need to test the accuracy of our model. We will test this model on out testing data and print the results just like that:

accu = model.score(X_test, y_test)
print("Accuracy of SVC: ", accu)

The accuracy for SVC turned out to be 0.9857142857142858 which is pretty darn good.

Let’s try another classifier logistic regression. Now you might get confused that this is logistic regression (regression is mostly when you are trying to predict the continuous values) but here we are doing binary classification. The thing is the naming here is just a bit confusing but logictic regression is used for classification in scikit-learn. We will be using the similar code with slight modifications:

Classifier = LogisticRegression(solver = 'liblinear')
model = Classifier.fit(X_train, y_train)
accu = model.score(X_test, y_test)
print("Accuracy of Logistic Regression : ", accu)

This time accuracy is 0.9714285714285714 which is approximately 97%, not to shabby.

Note: Don’t get worried about solver = 'liblinear'. You can read about its details here.

That’s all for now, we are done with training of our models and in the next we will be making some predictions with the help of our trained models.