The Gradient Descent Algorithm

Tanay Joshi
6 min readNov 5, 2020

--

Optimization plays an eminent role in Machine Learning. Gradient Descent is an optimization algorithm which is used to find the coefficients of a function. In Simple Linear Regression, this algorithm is implemented to obtain the coefficients of the linear equation i.e b0 and b1.

Consider a person who is standing on some arbitrary location on a mountain and he wants to get to the bottom. He takes one step at a time in the right direction which will bring him closer to the base. Gradually after taking many steps, he can get to the base. Likewise, this mountain is analogous to the cost function (convex cost function to be specific) and the base is analogous to the global minima and the objective is to reach to this minima by selecting arbitrary coefficient values.

Learning step or Learning Rate (a.k.a alpha or LR) is nothing but how big of a step we are taking to reach the global minimum. There is no standard value for this and it is totally problem-specific but we usually initialize it to 0.001. It is just a random value and one has to experiment by setting different values for LR.

We usually initialize our coefficients b0 and b1 to (0,0). This is considered as a standard practice in Machine Learning but any random values will produce the same outcome (given the loss function is convex is nature) because the Gradient Descent is adaptive.

The standard Line Equation is:

y = mx + c

OR

h(x) = b1*x + b0

where h(x) is our predicted value.

The rule for updating coefficients b0 and b1 is:

h(xi) refers to prediction of ith observation, yi refers to the corresponding actual value, xi is the ith observation and m is the total number of observations in the data-set.

Based on the predictions calculated by the line equation, the coefficients b0 (theta0) and b1(theta1) are adjusted. This is done by computing slope of the cost function with respect to the coefficients b0 or b1 depending on their updating rule. Our cost function is denoted by J(b0, b1) or simply J. This slope is indicated by the general form dJ/d_theta or dJ/db. The slope of the cost function is its derivative.

It is latent from the above image that the summation term is replaced by dJ/db0 for b0 updating rule and dJ/db1 for b1 updating rule. This is done to simplify the equation.

Calculating the Derivatives of the cost function J(b0, b1)

Let us consider the Mean Squared Error (MSE) throughout the implementation.

Calculating dJ/db0 and dJ/db0:

Derivative of cost function w.r.t b0 and b1

Thus, this derivative is the average of the sum of the differences between the predicted and the actual value. This sum will produce a single real value.

What is the significance of the slope of the cost function at coefficients? Why are we interested in calculating derivatives?

In general, the slope or derivative is the rate of change of quantity w.r.t some other quantity. A derivative is used to measure the steepness of the graph at some point ((b0, b1) in our problem). So, the slope of the cost function w.r.t the coefficients actually indicate how is our cost function changing at every step (b0, b1). We use this slope to update our coefficients.

Now, dJ/db0 or dJ/db1 could be either positive or negative i.e. the slope could be upwards or downwards. Now, for simplicity consider only the updating rule for the coefficient b0. If the calculated derivative dJ/db0 is positive, then the slope is upwards or the slope is positive. This essentially means that b0 is on the right-hand side of the global minimum and since we are interested to get as close to the global minimum which is to the left of the b0, we need to update the b0 by subtracting some value from the old b0. Since dJ/db0 is positive our updating rule becomes b0 = b0-alpha*(some positive value). This is how b0 is updated to shift to the left when the derivative is positive.

Conversely, if dJ/db0 i.e slope is negative, then we are on the left-hand side of the global minimum and we intend to shift b0 to the right. So, if dJ/db0 is negative, the updating rule for b0 then becomes: b0 = b0 -alpha*(-[dJ/db0]) i.e b0 = b0+alpha*(dJ/db0). This is how b0 is updated to shift to the right when the derivative is negative.

The same procedure is followed for b1.

Thus, the slope is calculated which makes the algorithm adaptive in nature.

Implementing the Gradient Descent

  1. We begin with some random values for coefficients b0 and b1.
  2. A prediction is made for an observation using the line equation and substituting the values of coefficients and the ith data point.
  3. The coefficients are updated according to their updating rule.
  4. Compute the loss function. Any loss function could be used: MSE, RMSE, MAE, etc.
  5. We repeat steps 1 to 4 until our cost function is significantly small i.e we repeat the procedure till we reach convergence.

Below is the Python implementation

import numpy as np
import matplotlib.pyplot as plt
plt.style.use('dark_background')
X = np.array([1,2,3,4,5,6,7,8,9,10])
y = np.array([10,20,30,40,56,60,74,87,91,102])
b0 , b1 = 0,0
db0 , db1 = 0,0 #Initializing db0 and db1 to 0.
def gradient_descent(X, y, a, b1, b0, epochs, title): plt.title(title)
plt.xlabel('b0') #Plotting graph of b0 vs J
plt.ylabel('J(b0)')
for i in range(epochs):
pred = b1*X + b0
plt.scatter(b1,(1/len(X)*sum((pred-y)**2)), color = 'orange')
db0 = (1/len(X))*(sum(pred-y))
db1 = (1/len(X))*(sum((pred-y)*X))
b1 = b1 - a*db1
b0 = b0 - a*db0
plt.show()
return b1, b0
a = 0.01
b1, b0 = gradient_descent(X, y, a, b1, b0, epochs = 1000, title = "Gradient descent with LR = {}".format(a))
#Checking Gradient Descent plot when LR is reducedb1, b0 = 0, 0 #Reset b1 and b0 to 0
a = 0.001
b1, b0 = gradient_descent(X, y, a, b1, b0, epochs = 1000, title = "Gradient descent with LR = {}".format(a))print("b1 = {} and b0 = {}".format(b1,b0))

After running the code, we get:

b1 = 10.206765724453385 and b0 = 1.1452627019638673

Plot of GDA when LR/a is set to 0.1

When LR or a is set to a higher value, our algorithm takes bigger steps and quickly reaches optimum value.

However, a very high learning rate value can cause the problem of divergence

Plot of GDA when LR/a is set to 0.001

On the other hand, when LR is reduced, it requires more steps to reach the global optimum because we are making very tiny updates to our parameters.

Comparison of different Learning rates

--

--