Regularization is a collection of techniques that can be used to prevent over-fitting(It could be memorizes relations and structures that are noise and coincidence). Regularization adds information to a problem often in the form of penalty against complexity, and it can be defined in different types of learning model like linear regression,Logistics regression and support vector machine etc.
Let suppose our linear equation is: y=ax+b where a=coefficient, b=intercept term, x=explanatory variable, y=predicted value
There are two different types of loss function:-(1)L1 (2) L2
L1 loss function is also known as Least absolute deviations(LAD), least absolute error(LAE). It is basically minimizing the sum of absolute difference(S) between the target value(Yi) and estimated value(f(xi)).
L2 loss function is also known as least square error(LSE).It is basically minimizing sum of square of difference(S), target value(Yi) and estimated value(f(xi)).
The difference between L1 and L2 just than L2 is the sum of the square of weight while L1 is the sum of the weight.
L1 regularization on least squares:
This equation is use for LASSO(Least Absolute Shrinkage and Selection Operators) implementation
1. LASSO adds penalty equivalent to absolute value of the magnitude of coefficients
2. LASSO Minimization objective = LS Obj + α * (sum of absolute value of coefficients)
1.a=0: Same coefficients 2. a= ∞: All coefficients zero and 3. 0 < a < ∞: between 0 and that of simple linear regression
L2 regularization on least squares:
This equation is use for Ridge regression implementation.
1. It adds penalty equivalent to square of the magnitude of coefficients.
2. Ridge minimization objective = LS Obj + a * (sum of square of coefficients)
1. a=0 : The objective becomes same as simple linear regression
2. a = ∞:The coefficients will be zero. Why? Because of infinite weightage on square of coefficients
3. 0 < a < ∞=The coefficients will be somewhere between 0 and ones for simple linear regression
Difference between Ridge and LASSO
- Ridge regression produces models in which most parameters are small but nonzero but LASSO produces sparse parameters that is most of the coefficients will become zero, and the model will depend on a small subset of the features.
- When explanatory variables are correlated, the LASSO will shrink the coefficients of one variable toward zero but Ridge regression will shrink them more uniformly.
- There is another technique know as Elastic-net which combines both L1 and L2 regularization to balance the advantages and disadvantages of L1 and L2.