logistic regression classification Scikit learn tutorial

Table of Contents

  1. What is Classification
  2. Using logistic Regression
  3. Iris Data set
  4. Example
  5. Brief Introduction
  6. Confusion Matrix
  7. Accuracy
  8. Video Tutorial ( 2 Videos )

1. What is Classification?

Though logistic Regression sounds like regression, Logistic Regression is actually a classification. In classification when our dependent variable belongs to a discrete set of values.

2. Using logistic Regression

  1. Let’s look at the basic function of classification that is known as sigmoid function.
  2. The values which are higher than 0.5 be considered as ‘1’ and the values which are lesser than 0.5 will be considered as ‘0’.That is by far the simplest definition of sigmoid function.
Figure-1

So, If the function having a specific threshold in which sigmoid function by default is having 0.5, the values above that are classified as the category of ‘1’ else the value lesser than that is classified as a value of ‘0’.

3. Iris Data set

We call all our necessary libraries and we’ll be using Iris data set provided by scikit-learn that provides a specific subclass of datasets. Iris data set is very small dataset and it’s being used for the basic classification purpose.

Figure-2

Iris data set categorized three different type of flowers that is setosa, Versicolor and virginica depending on the different sizes of petal and sepal lengths.

4.  Example

The dataset is basically classified into the data and the target. We assign X as data and Y as a target. We need to split the data sets for testing and training purposes. So, we’ll be using a specific set of values for training and a specific set of values for testing.

Figure-3
  1. The training data is the one on which the model learns. This can be done using scikit learn function-‘train_test_split’.
  2. This function gives us the output of four values that is X_train, X_test,y_train, and y_test. In which we load the basic x and y values, followed by the size of the test size.
  3. We can have the test size of either 20% or 25%. As the data is very small we’ll be taking like 20% as the test size and 80% of the data for training purposes. Then followed by a very important parameter ‘Random state’.
  4. When we give a specific value to the Random State, each time you run your model the data will be separated in the same manner as the random distribution. If that is not followed what happens is eventually after modeling, your complete model looks into the complete data set and this leads to data leakage.
  5. So, the random state is a very important parameter and it can be given any value.
  6. Then we have called the scikit-learn pre-processing library called Standard Scalar. We need to standardize the column values.
Figure-4

We call the logistic regression library and the random state is given zero and then we fit the classifier with X_train and y_train.

Figure-5

Our fit function simply means to train the model on the given data set. It is the same as compared to almost every library in machine learning. The function predicts the y value.

Figure-6

We now compare the original value with the predicted value. This comparison gives us the closest of how much accurate our model has been.

Figure-7

5 Brief Introduction

  1. In the first part of the logistic regression, we were trying to solve the Iris dataset problem. We did the classification of the iris dataset and obtained the train test split data, and performed logistic regression on those train data, and finally obtained y_predicted values for the sample.
  2. In this second part, we deal with the accuracy and comparisons a little more clearly.
  3. As we are not manually comparing the predicted and actual values one by one, we use the Confusion matrix and Accuracy.

6 Confusion Matrix

The main one is known as the confusion matrix, which gives you the idea of how any classifications is been done correctly and how many are done wrong. So, we can call the confusion matrix library from sci-kit to learn and then directly apply it to the two values, y_test, and y_predict

Figure-8

Now as we can see the middle diagonally of the matrix gives us the correct values. And the off-diagonal values are incorrect values.

7 Accuracy

  1. For an easier approach, one can directly look at the accuracy score of the model provided which is actually derived from the confusion matrix.
  2. To get the accuracy score we go into the same subclass of sci-kit-learn and matrices and import directly the function of accuracy_Score.
  3. Once we pass the values of y_test and y_predicted into the function, we can easily get the value of accuracy.
Figure-9

once we run the cell the accuracy will be obtained. Here we got 96% accuracy on the test data, which is a pretty good classification given for the Iris data set and the distribution provided. Usually 96% or above is very good.

Video Tutorial-1

Video Tutorial-2

Leave a Reply

Your email address will not be published. Required fields are marked *