In this article, I am going to implement the Gradient Descent Algorithm from scratch,
1- What is Linear Regression?
2- Cost Function or Loss Function
3- Gradient Descent Algorithm
4- Implementing the python code for Gradient Descent Algorithm
Without further ado, let’s dive in.
Linear Regression is a Supervised Learning Algorithm that predicts continuous values, for example: Predicting house prices?
It is a linear approach to modeling the relationship between the dependent variable (Y) and one(Simple Linear Regression) or more(Multiple Linear Regression) independent variables (X).
We will define a linear relationship between these two variables as follows:
Cost Function or Loss Function:
difference between Loss function and Cost Function: https://stats.stackexchange.com/questions/179026/objective-function-cost-function-loss-function-are-they-the-same-thing
We will use the Mean Squared Error as our cost function. MSE measures the average squared difference between an observation’s actual and predicted values. The output is a single number representing the cost, or score, associated with our current set of weights. Our goal is to minimize MSE to improve the accuracy of our model.
Given our simple linear equation y = mx + b, we can calculate MSE as:
Now, we know that our Cost function is MSE, let’s dive into the interesting part which is minimizing it in order to find the best m and b values for our hypothesis: y = mx + b.
To minimize MSE we use Gradient Descent to calculate the gradient of our cost function.
Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters(m and b) of our model. Parameters refer to coefficients in Linear Regression.
Starting at the top of the mountain, we take our first step downhill in the direction specified by the negative gradient. Next, we recalculate the negative gradient (passing in the coordinates of our new point) and take another step in the direction it specifies. We continue this process iteratively until we get to the bottom of our graph, or to a point where we can no longer move downhill–a local minimum.(Note that our Cost Function MSE is a convex function(bowl shape), which means we will eventually end up at a local minimum)
Learning Rate (alpha):
The size of these steps is called the learning rate. With a high learning rate, we can cover more ground each step, but we risk overshooting the lowest point since the slope of the hill is constantly changing. With a very low learning rate, we can confidently move in the direction of the negative gradient since we are recalculating it so frequently. A low learning rate is more precise, but calculating the gradient is time-consuming, so it will take us a very long time to get to the bottom.
Now let’s run gradient descent using our new cost function. There are two parameters in our cost function we can control: m (weight) and b (bias). Since we need to consider the impact each one has on the final prediction, we need to use partial derivatives. We calculate the partial derivatives of the cost function with respect to each parameter and store the results in a gradient.
To solve for the gradient, we iterate through our data points using our new mm and bb values and compute the partial derivatives. This new gradient tells us the slope of our cost function at our current position (current parameter values) and the direction we should move to update our parameters. The size of our update is controlled by the learning rate.
m = m - L * Dm
b = b - L * Db
Where: L is the learning rate(alpha), Dm is the derivative with respect to m and Db is the derivative with respect to b.
If you have any questions or clarifications please, do not hesitate to contact me, Thank you.
Full Source Code + DataSet: https://github.com/jairiidriss/GradientDescentAlgorithmFromScratch