Lesson 5: Backpropagation & Training

What is Backpropagation?

Backpropagation (short for "backward propagation of errors") is the algorithm that enables neural networks to learn from their mistakes. While forward propagation makes predictions, backpropagation calculates how to adjust the network's weights to improve those predictions.

Think of it like learning to throw a basketball. After each shot, you observe how far off you were and adjust your aim accordingly. Backpropagation does the same thing - it calculates how "off" each weight was and adjusts them to reduce future errors.

Backpropagation Flow

Output

Error Source

←

Hidden

Adjust Weights

←

Hidden

Adjust Weights

←

Input

Error Propagated

Errors flow backward: Output → Hidden Layers → Input

The Complete Learning Cycle

How Neural Networks Learn

1 Forward Pass: Input data flows through the network to produce a prediction

2 Calculate Loss: Compare the prediction with the actual answer to measure error

3 Backward Pass: Calculate gradients (how much each weight contributed to the error)

4 Update Weights: Adjust weights in the direction that reduces error

5 Repeat: Continue this process thousands of times until the network learns

Understanding Loss Functions

Before we can fix errors, we need to measure them. Loss functions quantify how wrong our predictions are:

Mean Squared Error (MSE)

                MSE = (1/n) × Σ(actual - predicted)²
              

Used for: Regression problems (predicting continuous values like house prices)

Why it works: Heavily penalizes large errors, encourages accurate predictions

Cross-Entropy Loss

                CE = -Σ y × log(ŷ)
              

Used for: Classification problems (predicting categories like cat vs dog)

Why it works: Encourages high confidence in correct predictions

The Mathematics of Backpropagation

Backpropagation uses the chain rule from calculus to calculate gradients efficiently:

The Chain Rule in Action

To find how much a weight in an early layer affects the final error, we chain together partial derivatives:

$$\frac{\partial L}{\partial w^{[1]}} = \frac{\partial L}{\partial a^{[3]}} \times \frac{\partial a^{[3]}}{\partial z^{[3]}} \times \frac{\partial z^{[3]}}{\partial a^{[2]}} \times \frac{\partial a^{[2]}}{\partial z^{[2]}} \times \frac{\partial z^{[2]}}{\partial w^{[1]}}$$

This allows us to calculate the gradient for any weight in any layer!

Step-by-Step Backpropagation

For the output layer:

# Calculate error gradient dA = prediction - actual_value # Calculate weight gradients dW = (1/m) × dA × previous_layer_output db = (1/m) × sum(dA)

For hidden layers:

# Propagate error backward dA_prev = W_next.T × dA_current # Apply activation derivative dZ = dA_prev × activation_derivative(Z) # Calculate gradients dW = (1/m) × dZ × previous_layer_output db = (1/m) × sum(dZ)

Gradient Descent: The Optimization Engine

Once we know how to adjust each weight (the gradients), gradient descent determines by how much to adjust them.

🏔️ Interactive MNIST Training Visualization

Watch how gradient descent trains a digit classifier! The ball represents our model's accuracy rolling down the "error mountain":

Learning Rate: 0.1 How big steps to take

Training Progress: Ready to train MNIST classifier

Current Error: 45.0%

MNIST Accuracy: 55.0%

Training Step: 0

The Update Rule

Weight Update Formula:

$$w_{new} = w_{old} - \alpha \times \frac{\partial L}{\partial w}$$

Where:

α (alpha): Learning rate - controls step size
∂L/∂w: Gradient - direction of steepest increase
We subtract because we want to go downhill (minimize error)

Learning Rate: The Critical Hyperparameter

Choosing the Right Learning Rate

⚠️ Too High (α = 0.1): Takes huge steps, might overshoot the minimum and diverge

✓ Just Right (α = 0.01): Makes steady progress toward the minimum

⚡ Too Low (α = 0.0001): Takes tiny steps, learning is very slow

Advanced Training Concepts

Mini-Batch Training

Instead of using one example or all examples at once, we typically use mini-batches (32-256 examples):

Advantages: More stable updates, efficient use of hardware, faster convergence
Process: Calculate gradients for the batch, then update weights once

Epochs and Iterations

Training Terminology:

Iteration: One forward + backward pass on one mini-batch
Epoch: One complete pass through the entire training dataset
Example: 10,000 samples, batch size 100 = 100 iterations per epoch

Common Training Challenges

Vanishing Gradients

Problem: Gradients become very small in deep networks, causing early layers to learn slowly.

Solutions: Use ReLU activation, proper weight initialization, residual connections.

Exploding Gradients

Problem: Gradients become very large, causing unstable training.

Solutions: Gradient clipping, proper weight initialization, batch normalization.

Overfitting

Problem: Network memorizes training data but fails on new data.

Solutions: Dropout, regularization, early stopping, more training data.

Putting It All Together: A Complete Training Example

# Training Loop Pseudocode for epoch in range(num_epochs): for batch in training_data: # Forward pass predictions = forward_pass(batch.inputs) # Calculate loss loss = loss_function(predictions, batch.targets) # Backward pass gradients = backward_pass(loss) # Update weights weights = weights - learning_rate * gradients # Validate performance validate_model(validation_data)

Why Backpropagation Works

Backpropagation is powerful because it:

Efficiently calculates gradients for millions of parameters
Uses the chain rule to propagate error information backward
Automatically finds the optimal weight adjustments
Scales to very deep networks with proper techniques

Knowledge Check

Test your understanding of backpropagation and training

1. What is the main purpose of backpropagation?

A) To make predictions

B) To calculate gradients and update weights to reduce error

C) To preprocess input data

D) To evaluate model performance

2. In which direction does information flow during backpropagation?

A) Input to output

B) Output to input (backward)

C) Bidirectional

D) Random direction

3. What mathematical concept enables backpropagation to work efficiently?

A) Integration

B) The chain rule

C) Matrix addition

D) Linear regression

4. In the weight update rule w_new = w_old - α × ∂L/∂w, why do we subtract the gradient?

A) To increase the error

B) To move in the direction of steepest descent (minimize error)

C) To normalize the weights

D) It's a mathematical convention

5. What happens if the learning rate is too high?

A) Training becomes very slow

B) The model might overshoot the minimum and fail to converge

C) The gradients become zero

D) The model achieves perfect accuracy

6. Which loss function is typically used for binary classification?

A) Mean Squared Error

B) Cross-entropy loss

C) Hinge loss

D) Absolute error

7. What is an epoch in neural network training?

A) One forward pass

B) One backward pass

C) One complete pass through the entire training dataset

D) One weight update

8. What is the vanishing gradient problem?

A) Gradients become too large

B) Gradients become very small in deep networks, slowing learning

C) Gradients disappear completely

D) Learning rate becomes zero

Quiz Complete!

0/8

Great job! You understand how neural networks learn.