Table of Contents
- Introduction to linear regression
- Problem Statement
- Code walkthrough
- Video Tutorial
1 Introduction to linear regression
- we have an input variable X and one output variable Y and we want to build a linear relationship between these variables
- Here the input variable is called an independent variable and the output variable is called dependent
- variables we have a data set consisting of two columns one has a new franchise fee and one as startup cost here we want to predict the value of start-up cost
- X- variable will be animal franchise fee and my Y would be startup cost we would like to get a linear relationship between these two in order to predict startup cost for my testing data set
2 Code walk-through
Imported the all necessary libraries Matplotlib which will be used for plotting the second one is pandas for reading a dataset
Imported my data set using read CSV function and initialized x and y variables with my annual franchise fee and my startup cost respectively
let’s visualize the dataset
- Here is the data set plotted where the x-axis is the annual franchise fee and the y-axis stands for startup cost these red plots are nothing but our data
- split the data set into a training set and test set we will be using a function called train test split which will split a whole dataset into a training set and test dataset.
- A fitting straight line through the points the best fitting line is called a regression line.
Training the model
A regression line first we need to import the linear regression class then we need to call the function linear regression which will return an object of its own type.
Then we need to fit the data by calling the fit function
Visualizing the regression line
A list of predicted values can obtain as an output from predict function
Interpretation of regression line
- Predicted values are on the regression line and red dots are true points. If points close to the regression line then our model is good.
- Vertical distance from the points to the regression line represents the error of prediction. So the error prediction for any redpoint will be this vertical distance from the regression line
- The Error prediction for a point is the value of the point minus the value predicted
- As we can see red points which are very near to the regression line its error of prediction is very small by contrast
- The point is much farther to the regression line its error of prediction is large
how to choose best line ?
- The most commonly used criteria for the best fitting line is the line that minimizes the mean squared error of prediction
- To find mean squared error we will import it from sklearn.metrices and we need to pass our testing data and our prediction data as parameters to it