Optimization refers to the task of either maximizing or minimizing some function f(x) by altering x. The function we want to minimize or maximize is an objective function and when we are minimizing it we may also call it the cost function or loss function or error function.

Generally in optimization problems we know that how our data look like and what area we want to improve but in Machine learning we don’t have any idea how our new data looks like. So in machine learning we perform optimization on training data and check it performance in new validation data.

Many popular machine algorithms depend upon optimization techniques such as linear regression, k-nearest neighbors, neural networks, etc. The applications of optimization are limitless and is a widely researched topic in both academia and industries.

What is Gradient Descent?

Gradient descent sometimes define an analogy of blindfolded man who is trying to find his way from somewhere mountainside to lowest point of the valley. Actually he cannot see the topography so he takes a step in the direction with the steepest decline. He then take another step again. The sizes of his steps are proportionals to the steepness of the terrain at his current positions. He takes big step when the terrain is steep. The man takes smaller steps as terrain becomes less steep. If he were to continue taking large step he may accidentally step over the valley’s lowest point. He would then need to change direction and step lowest point of the valley again. By taking small step he can avoid steeping and forth over the lowest valley point. The blindfolded man walk continue until he cannot take a step that will decrease his altitude, at this point he will find bottom of the valley.

Formally Gradient descent algorithm is an optimization algorithm that can be used to estimate the local minimum of a function. We can use gradient descent to find the value of model’s parameter that minimize the cost function. Gradient descent iteratively updates the value of model parameter by calculating partial derivatives of cost function.

How Gradient Descent works?

As we know that gradient descent estimate the local minimum of a function. A three dimensional plot of convex cost function for all possible values of the parameter looks like bowl. The bottom of the bowl is the sole local minimum. Non-convex cost function have many local minima.  The plot of their cost functions have many local peaks and valley. Gradient Descent only guarantee to find the local minimum, it will find a valley but not necessarily to find the lowest valley.

An important hypermeter of Gradient descent is the learning rate which control the size of blindfolded man’s step. If the learning rate is small enough the cost will decrease with each iterations. If the learning rate decrease however the time required for gradient descent to converge increase, The blindfolded man will take longer time to increase if he will take small step but if he will take large step  the man may repeatedly overstep the bottom of the valley, that is, gradient descent could oscillate around the optimal values of the parameters.


Gradient Descent can be classified on the basis of two methods:-