Implementing Gradient Descent from Scratch: A Step-by-Step Guide

Gradient Descent is a fundamental optimization algorithm used in machine learning and deep learning to minimize functions, particularly loss functions in training models. Understanding how it works and implementing it from scratch can provide deeper insights into machine learning algorithms and their optimization processes.

What is Gradient Descent?

Gradient Descent is an iterative optimization algorithm used to minimize a function by moving in the direction of the steepest descent, as defined by the negative of the gradient. It is extensively used in machine learning for optimizing cost functions in models like linear regression, logistic regression, and deep learning neural networks.

Key Concept

The algorithm starts with an initial set of parameters (weights) and iteratively updates them to minimize the loss function.
The learning rate ($\alpha$) determines the step size in each iteration.
The process continues until the changes become negligible or a predefined number of iterations is reached.

Types of Gradient Descent

There are three main types of Gradient Descent:

Batch Gradient Descent
- Computes the gradient using the entire dataset.
- More stable updates but computationally expensive for large datasets.
Stochastic Gradient Descent (SGD)
- Computes the gradient using a single random training example at each iteration.
- Faster but may have higher variance in updates.
Mini-Batch Gradient Descent
- Computes the gradient using a small batch of data samples.
- Strikes a balance between stability and computational efficiency.

The Mathematics Behind Gradient Descent

The goal is to find the optimal parameters $\theta$ that minimize the cost function $J(\theta)$. A common example is the mean squared error (MSE) for linear regression:

$$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) – y^{(i)})^2$$

where:

$h_\theta(x)$ is the hypothesis function $h_\theta(x) = \theta_0 + \theta_1 x$ for linear regression.
$y$ is the actual target value.
$m$ is the number of training examples.

The gradient of $J(\theta)$ with respect to $\theta_j$ is computed as:

$$\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) – y^{(i)}) x_j^{(i)}$$

The parameters $\theta$ are updated using:

$$\theta_j := \theta_j – \alpha \cdot \frac{\partial J(\theta)}{\partial \theta_j}$$

where $\alpha$ is the learning rate.

Implementing Gradient Descent from Scratch in Python

Let’s implement a simple Batch Gradient Descent for linear regression.

Step 1: Import Required Libraries

import numpy as np
import matplotlib.pyplot as plt

Step 2: Define the Cost Function

def compute_cost(X, y, theta):
    m = len(y)
    predictions = X.dot(theta)
    cost = (1/(2*m)) * np.sum(np.square(predictions - y))
    return cost

Step 3: Implement Gradient Descent

def gradient_descent(X, y, theta, learning_rate, iterations):
    m = len(y)
    cost_history = []
    
    for i in range(iterations):
        gradient = (1/m) * X.T.dot(X.dot(theta) - y)
        theta -= learning_rate * gradient
        cost_history.append(compute_cost(X, y, theta))
    
    return theta, cost_history

Step 4: Generate Data and Run the Algorithm

# Generate synthetic data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Add bias term (column of ones)
X_b = np.c_[np.ones((100, 1)), X]

# Initialize theta
theta = np.random.randn(2,1)

# Hyperparameters
learning_rate = 0.1
iterations = 1000

# Run gradient descent
theta_final, cost_history = gradient_descent(X_b, y, theta, learning_rate, iterations)

Step 5: Visualizing the Cost Function Convergence

plt.plot(range(iterations), cost_history, 'b-')
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.title('Convergence of Gradient Descent')
plt.show()

Applications and Best Practices

Applications

Linear Regression: Optimizing weights to fit a straight-line model.
Logistic Regression: Minimizing the loss function for classification tasks.
Neural Networks: Training deep learning models with backpropagation.
Support Vector Machines (SVMs): Finding optimal hyperplanes in classification.

Best Practices

Choosing the Right Learning Rate:
- Too high: Can overshoot and fail to converge.
- Too low: Slow convergence and higher computation time.
Feature Scaling:
- Normalizing data improves convergence speed.
Stopping Criteria:
- Use a threshold for cost reduction to stop training.
Momentum-based Optimizations:
- Adam, RMSprop, and other optimizers improve convergence.

Conclusion

Gradient Descent is a powerful optimization algorithm that underpins many machine learning models. Implementing it from scratch not only helps in understanding its inner workings but also provides a strong foundation for working with advanced optimizers in deep learning.

By carefully tuning hyperparameters and using advanced techniques like momentum-based optimization, one can improve performance and efficiency in training models. Understanding gradient descent thoroughly can significantly boost your expertise in machine learning and AI.

Spread the love