[table id=1 /]
These all questions implies the answer would be quantitative or continuous in nature. Linear regression is the fundamental topic for the predicting the continuous variable target variable on the basis of given explanatory or input variable.
In this article, I will cover topics related to linear regression. Here I am mentioning the topic of linear which will be cover in this article.
- Simple Linear Regression
- The fitness of Model or Ordinary least square
- Model evaluation
- Multiple linear regression
- Polynomial linear regression
Simple Linear Regression
Simple linear regression model the relationship between single features(explanatory variable) and a continuous response variable. Let suppose we have explanatory variable x and target variable y then the linear model with one explanatory variable defined as:
where β is known as the intercept of y, on x and α is slope or coefficient of the x.
To understand this let’s take an example of pizza price according to diameter is given below:
import matplotlib.pyplot as plt x=[,,,,,,,,,] y=[,,,,,,,,,] plt.figure() plt.xlabel("Diameter in inches") plt.ylabel("price in dollar") plt.title("pizza price against diameter") plt.plot(x,y,'k.') plt.show()
In this figure, we can see, there is the positive relationship between the diameter of pizza and it’s price. From the given above data, we can predict the price of the unseen diameter of pizza using linear regression.
from sklearn.linear_model import LinearRegression lr=LinearRegression() lr.fit(x,y) lr.predict([,])
array([[ 13.96431121], [ 17.07780157]])
Based on our simple linear regression model we got the 12-inch diameter pizza price is 13.964 and 15-inch diameter pizza is 17.0778 but in our training data the price of 15-inch pizza price is 17.00 this difference is an error and this error is called residual error or training error. The residual error is the error it is the difference between the observed price and price predicted.
Since linear regression model the relationship between the explanatory variable and target variable using linear surface called hyperplane.
The fitness of Model or Ordinary least square
We can find the best fitting line using minimize the sum of residuals that is model fit the value very close to observed value. This is the measure known as residual sum of squares.
The formula of residual sum of square is
Where y_i is observed value and y^_i is predicted value.
Now Recall the equation(1) y=αx+β, we have to solve α and β so that minimize our residual sum of square.
For solving this equation we have to first eliminate the term α and the formula of α is Covariance(x,y)/var(x).
Covariance measure how much two variables changing together. It could be either positive or negative manner, if the value of both increases together called positive covariance or if one value increase and another decrease called negative covariance.
α=np.cov([5,6,7,8,9,10,13,14,15,16],[6,8,9,10,11,12,15,16,17,18]/np.var([5,6,7,8,9,10,13,14,15,16],ddof=1) α=16.1555555556/15.5666666667 α=1.037830121342531
β=y_bar - α.x_bar
Now solve the equation
β=12.2-1.037830121342531 x 10.3 β=1.510349750171931
For instances, if we want to predict the price of 12 inches pizza then the cost around will be $13.96 and our model also predicted same for 12 inches pizza. So we can say to us, Congratulations we have made it!
In the previous section, we have to discussed how to fit model for the regression model by estimating the model’s parameter from the training data. But it is crucial to test the model on data that hasn’t seen during training data. So we here an important question raises that how we know our model is a good representation of the real relationship or test datasets.
We can introduce MSE(Mean Squared Error) which is the average value of SSE cost function. It is very useful for comparing different regression models.
Another useful model evaluation techniques used called R² or r-squared. R² measure how well our observed value predicted by the model. R² is also called coefficients of determination. There are several methods for computing R² but in the case of linear regression, we preferred r squared is equal to Pearson product moment correlation coefficients or Pearson’s r.
Where, RSS= Residual sum of square TSS= Total sum of square defined as,
R² explain how much of the data is explained by the model. R² is bounded between 0 to 1. If R² is 80% or 0.80 means 80% of variations residing in residuals means the fitted line is good. R² more than 60% assume good.
from sklearn.linear_model import LinearRegression x=[,,,,,,,,,] y=[,,,,,,,,,] X_test = [, , , , ] y_test = [, [8.5], , , ] lr=LinearRegression() lr.fit(x,y) y_pred=lr.predict([,]) print(lr.score(x,y)) print(y_pred) print('R-squared: %.4f' % lr.score(X_test, y_test))
y predicted [[ 13.96431121] [ 17.07780157]] R-squared: 0.6598
Multiple Linear Regression
R² on test dataset is 65 %, it is not too good so we have to improve our model. We can introduce a new feature by which price of pizza can be directly related to example the toppings of the pizza. The price of pizza can also depend upon toppings. So we added a new explanatory variable topping and now we can proceed multiple linear regression with two explanatory variables (1) dimension (2) toppings.
In general, multiple linear regression is simply a generalization process of single linear regression which uses more than one explanatory variables.
from sklearn.linear_model import LinearRegression X=[[5,1],[6,0],[7,1],[8,0],[9,0],[10,1],[13,1],[14,1],[15,0],[16,1]] Y=[,,,,,,,,,] X_test = [[8,0], [9,1], [12,0], [15,1], [13,0]] y_test = [, [8.5], , , ] lr=LinearRegression() lr.fit(X,Y) print('R-squared: %.4f' % lr.score(X_test, y_test))
Now we can see the advantages of generalization of linear model our r-squared increased. Multiple linear regression improved the performance of our model.
In the previous sections, we discussed linear relationship-based model. A special type of multiple linear regression basically used for the nonlinearity assumptions by adding polynomial terms. It captured curvilinear relationship when adding polynomial terms.
Here a denotes the degree of the polynomial. Here I am going to define the quadratic form of regression with one explanatory variable.
import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures X_train = [, , , , ] y_train = [, , , [17.5], ] X_test = [, , , ] y_test = [, , , ] regressor = LinearRegression() regressor.fit(X_train, y_train) xx = np.linspace(0, 26, 100) yy = regressor.predict(xx.reshape(xx.shape, 1)) plt.plot(xx, yy) quadratic_featurizer = PolynomialFeatures(degree=2) X_train_quadratic = quadratic_featurizer.fit_transform(X_train) X_test_quadratic = quadratic_featurizer.transform(X_test) regressor_quadratic = LinearRegression() regressor_quadratic.fit(X_train_quadratic, y_train) xx_quadratic = quadratic_featurizer.transform(xx.reshape(xx.shape, 1)) plt.plot(xx, regressor_quadratic.predict(xx_quadratic), c='r',linestyle='--') plt.title('Pizza price regressed on diameter') plt.xlabel('Diameter in inches') plt.ylabel('Price in dollars') plt.axis([0, 25, 0, 25]) plt.grid(True) plt.scatter(X_train, y_train) plt.show() print('Simple linear regression r-squared', regressor.score(X_test, y_test)) print('Quadratic regression r-squared', regressor_quadratic.score(X_test_quadratic, y_test))
Simple linear regression r-squared 0.84818661609 Quadratic regression r-squared 0.914832282464
Now we can clearly see that a polynomial degree of the model can give the better result that simple linear regression even on a single explanatory variable but sometimes if when we try to predict with the higher degree of polynomial it then fits training data exactly but doesn’t work with unseen data which causes overfitting occurred. Overfitting means that our model performs very well on training data but poor perform on testing data.
This is all the basic you will need to get started with Linear regression this is the type of supervised learning and very useful for the prediction of the continuous target value. I recommend you to go through the whole process involved in linear regression and start your own project on linear regression from scratch.
There are also some additional topics come under while we learn linear regressions are Gradient Descent, Regularization, Bias-Variance tradeoff, Decision tree regression, Random forest regression and also different types of regression algorithms. I will cover all these topics in next articles.
If you liked this article be sure to like this article and you have any questions related to this answer I will do my best to answer.