Lesson 4: Forward Propagation

What is Forward Propagation?

Forward Propagation is the process by which input data flows through a neural network from the input layer to the output layer to produce a prediction. Think of it as the "thinking" process of the neural network - data enters, gets processed layer by layer, and produces a final answer.

Imagine a factory assembly line where raw materials (input data) enter at one end, go through various processing stations (hidden layers), and emerge as a finished product (prediction) at the other end. Each station transforms the materials based on specific instructions (weights and biases).

Data Flow in Neural Networks

Input

Raw Data

→

Hidden

Processing

→

Hidden

Processing

→

Output

Prediction

Information flows in one direction: Input → Hidden Layers → Output

The Step-by-Step Process

Forward Propagation Algorithm

1 Input Layer: Feed the input data into the network. No computation happens here - just data entry.

2 Weight Multiplication: Multiply each input by its corresponding weight for connections to the next layer.

3 Sum and Add Bias: Calculate the weighted sum of inputs and add the bias term for each neuron.

4 Apply Activation: Pass the weighted sum through an activation function to get the neuron's output.

5 Repeat for Each Layer: Use the outputs from the current layer as inputs to the next layer.

6 Final Output: The output layer produces the network's prediction or classification.

Mathematical Formulation

Let's break down the mathematics behind forward propagation:

For each neuron in layer l:

$$z^{[l]} = W^{[l]} \cdot a^{[l-1]} + b^{[l]}$$ $$a^{[l]} = \sigma(z^{[l]})$$

Where:

z^[l]: Weighted sum (pre-activation) for layer l
W^[l]: Weight matrix for layer l
a^[l-1]: Activations from previous layer
b^[l]: Bias vector for layer l
σ: Activation function
a^[l]: Activations for layer l

Real MNIST Digit Classification

Let's follow a handwritten digit "8" through a complete neural network:

MNIST Network Architecture:

Input Layer: 784 neurons (28×28 pixel image of handwritten digit)
Hidden Layer: 128 feature detectors (curves, lines, corners)
Output Layer: 10 neurons (probability for each digit 0-9)

Step 1: Digit Image Input

Input Image: 28×28 pixel handwritten "8"
Raw pixels: [0.0, 0.2, 0.8, 0.9, 0.8, 0.2, 0.0, ...] (784 values)
Each pixel: 0.0 = white background, 1.0 = black ink

Step 2: Feature Detection (Hidden Layer)

Feature detectors identify digit patterns:

Curve Detector: z₁ = Σ(pixels × curve_weights) + bias = 3.2
Top Loop: a₁ = ReLU(3.2) = 3.2 ✅ Strong curve detected

Middle Cross: z₂ = Σ(pixels × cross_weights) + bias = 2.8
Bottom Loop: a₂ = ReLU(2.8) = 2.8 ✅ Intersection found

Step 3: Digit Classification (Output Layer)

Combining features to classify the digit:

All 10 outputs after softmax:
[0.01, 0.02, 0.03, 0.02, 0.01, 0.04, 0.02, 0.03, 0.78, 0.04]
Prediction: Digit "8" with 78% confidence! 🎯

Common Activation Functions in Forward Pass

Activation Functions and Their Roles

📊 ReLU (Hidden Layers): f(x) = max(0, x) - Fast computation, prevents vanishing gradients

📈 Sigmoid (Binary Classification): f(x) = 1/(1 + e^(-x)) - Outputs probability between 0 and 1

🎯 Softmax (Multi-class): Converts outputs to probability distribution that sums to 1

⚖️ Tanh (Normalized Outputs): f(x) = (e^x - e^(-x))/(e^x + e^(-x)) - Outputs between -1 and 1

Matrix Operations in Practice

Neural networks use matrix operations for efficient computation:

Vectorized Forward Pass:

              # Instead of computing each neuron individually:

              for i in range(neurons):

                z[i] = sum(w[i] * x) + b[i]

              # Use matrix multiplication:

              Z = W @ X + B  # Much faster!

This allows processing entire batches of data simultaneously, making training much more efficient.

Practical Considerations

Batch Processing

In practice, we don't process one example at a time. Instead, we process batches of examples simultaneously using matrix operations. This is much more efficient and allows for better hardware utilization.

Layer Dimensions

The dimensions of weight matrices are crucial. For a layer with n inputs and m outputs, the weight matrix is m×n. This ensures proper matrix multiplication: (m×n) × (n×1) = (m×1).

🔢 MNIST Digit Classification Network

Watch a real digit recognition network in action! Adjust pixel intensities to see classification:

Top Region: 0.9 Upper pixels intensity

Middle Region: 0.7 Center pixels intensity

Pixel Inputs

0.9

Top

0.7

Middle

→

Feature Detection

0.76

Curves

0.58

Lines

→

Digit Recognition

0.82

Digit "8"

MNIST Classification Process

Curve Feature Detector:

z₁ = (0.9 × 0.6) + (0.7 × 0.4) + 0.1 = 0.92

curves = ReLU(0.92) = 0.92

Line Feature Detector:

z₂ = (0.9 × 0.2) + (0.7 × 0.3) + 0.15 = 0.54

lines = ReLU(0.54) = 0.54

Digit "8" Classifier:

z = (0.92 × 0.8) + (0.54 × 0.2) + (-0.1) = 0.84

confidence = Sigmoid(0.84) = 0.70 → Predicted: "8" (70% confident)

Forward Propagation vs. Training

It's important to distinguish between forward propagation and training:

Forward Propagation: Used both during training and inference (making predictions)
During Training: Forward pass → calculate loss → backward pass (backpropagation) → update weights
During Inference: Forward pass only → get prediction

Knowledge Check

Test your understanding of forward propagation

1. What is the main purpose of forward propagation?

A) To update weights in the network

B) To calculate the loss function

C) To process input data through the network to make predictions

D) To optimize the learning rate

2. In which direction does information flow during forward propagation?

A) Output to input

B) Input to output

C) Bidirectional

D) Random direction

3. What happens at each neuron during forward propagation?

A) Only weight multiplication

B) Weighted sum, add bias, apply activation function

C) Only activation function application

D) Random number generation

4. What is the mathematical formula for the pre-activation value in layer l?

A) z[l] = W[l] + a[l-1] + b[l]

B) z[l] = W[l] × a[l-1] + b[l]

C) z[l] = W[l] × a[l-1] - b[l]

D) z[l] = W[l] / a[l-1] + b[l]

5. Why do we use matrix operations in forward propagation?

A) They are more accurate

B) They allow efficient batch processing and faster computation

C) They reduce memory usage

D) They are easier to understand

6. What activation function is commonly used in hidden layers?

A) Softmax

B) Linear

C) ReLU

D) Step function

7. When is forward propagation used?

A) Only during training

B) Only during inference

C) Both during training and inference

D) Only during weight initialization

8. In a layer with 100 inputs and 50 outputs, what are the dimensions of the weight matrix?

A) 100 × 50

B) 50 × 100

C) 150 × 1

D) 1 × 150

Quiz Complete!

0/8

Great job! You understand forward propagation.