You now understand what neural networks are—whether feedforward, CNNs, RNNs, or Transformers—but how do they actually learn? How does a network go from making random predictions to recognizing handwritten digits, understanding language, or identifying objects in photos?
The answer lies in two interconnected processes: backpropagation and gradient descent. These algorithms work together to systematically adjust millions or billions of weights, turning a randomly initialized network into an intelligent system. This is where the mathematical concepts from Chapter 2—particularly derivatives—become essential for understanding modern AI.
Let's explore how neural networks transform from random guessers into sophisticated pattern recognition systems.
Adjusting Millions of Weights
Consider the scale of what we're trying to accomplish. A (rather small) deep network might have:
- Input layer: 784 neurons (for pixel images).
- Hidden layer 1: 512 neurons.
- Hidden layer 2: 256 neurons.
- Output layer: 10 neurons (for digit classification 0-9).
This creates approximately weights that need to be learned! Each of these weights influences the final prediction, but their effects are indirect—flowing through multiple layers and activation functions. The fundamental question becomes: when the network makes a mistake, which weights should we adjust and by how much?
Traditional optimization approaches fail here because:
- We can't test each weight individually (would take millions of experiments).
- Weights interact with each other in complex ways.
- Small changes can have large effects due to non-linear activation functions.
This is where backpropagation provides an elegant mathematical solution.
Forward Pass: Making a Prediction
Before we can learn from mistakes, we need to make a prediction. This "forward pass" sends input data through the network layer by layer.
✍🏼 Handwritten Digit Example: Let's trace a handwritten "7" through our network:
- Input layer: Receives 784 pixel brightness values.
- Hidden layer 1: Each of 512 neurons computes: .
- Hidden layer 2: Each of 256 neurons processes outputs from layer 1.
- Output layer: 10 neurons produce scores for digits 0-9.
For each neuron, the computation is:
In this function, is used to denote the activation function. So we "pull" the now familiar formula trough the activation function. This happens millions of times as information flows forward through the network.

After making a prediction the network might output some scores. Remember we get one score per class:
The highest score (0.8) corresponds to position 7, so the network predicts "7"—correct! But what if it's wrong?
Measuring Mistakes: Loss Functions
When our network makes a wrong prediction, we need to quantify exactly how wrong it was. This is where loss functions come in—they provide a single number representing the network's error. We have seen these before during Machine Learning, remember?
For our digit recognition task, we use something called cross-entropy loss, which heavily penalizes confident wrong predictions:
Let's go trough some examples:
- Correct and confident: Predicts 0.9 for correct digit → (small penalty).
- Correct but uncertain: Predicts 0.6 for correct digit → (medium penalty).
- Wrong and confident: Predicts 0.9 for wrong digit → (large penalty).
This loss function encourages the network to be both accurate and confident in its correct predictions.
For problems predicting continuous values (like house prices), we often use the Mean Squared Error (MSE). This is a function we've seen before, see how these ML techniques help us in Deep Learning?
The choice of loss function shapes how the network learns. Cross-entropy pushes the network toward confident, correct classifications, while MSE encourages predictions close to target values.
Backpropagation: Tracing Responsibility
Here's where the magic happens. Backpropagation solves the credit assignment problem: which weights are responsible for the error, and by how much?
Remember derivatives from Chapter 2? Backpropagation uses the chain rule of calculus to trace how each weight influences the final loss through all the intermediate layers. The process, conceptually, is as follows:
- Output layer: Calculate how much each output neuron contributed to the loss.
- Hidden layer 2: Use output layer gradients to calculate hidden layer 2 gradients.
- Hidden layer 1: Use hidden layer 2 gradients to calculate hidden layer 1 gradients.
- Continue backward: Until we reach the input layer.

Mathematically, for a weight connecting neuron to neuron :
This lets us compute the gradient for every single weight in the network efficiently.
Gradient Descent: The Weight Update Process
Once backpropagation tells us how each weight affects the loss, gradient descent uses this information to improve the weights. It's technique follows a simple update rule:
Let's break this down:
- Gradient: The derivative telling us which direction increases the loss.
- Negative gradient: The direction that decreases the loss (where we want to go).
- Learning rate: How big a step to take in that direction.
The learning rate is what we call a hyperparameter, a choice you set in advance—the model doesn't learn it. It defines how big the step we take towards decreasing the loss should be.

- Too low (0.001): Very slow learning, may never reach optimal performance.
- Just right (0.01): Steady improvement toward optimal weights.
- Too high (0.1): Network bounces around, never settling on good weights.
Finding the right learning rate may involve some trial and error, but there are some rules of thumb to keep in mind:
- Start with common values (0.01, 0.001).
- Monitor training loss, it should decrease steadily.
- Adjust based on network behavior and problem complexity.
In practice, choosing an appropriate learning rate is one of the most important decisions for ensuring stable and efficient training.
Stochastic Gradient Descent
Processing the entire dataset for each weight update (batch gradient descent) is computationally expensive and often unnecessary. Instead, most neural networks use variations that update weights more frequently.
Instead of using all training examples, we use a technique called Mini-Batch Gradient Descent. This is where we use small batches (typically 32-256 examples):
- Forward pass: Process a batch of examples.
- Compute loss: Average loss across the batch.
- Backpropagation: Calculate gradients based on batch.
- Update weights: Apply gradient descent step.
- Repeat: Move to next batch.
Batching has a lot of advantages, just like doing calculations in matrix form (remember when we did this in Chapter 2?), amongst them are:
- Computational efficiency: Vectorized operations are faster.
- Memory management: Can't fit entire datasets in memory.
- Regularization effect: Noise from different batches helps avoid overfitting.
- Faster convergence: More frequent updates often lead to faster learning.
This balance of efficiency, scalability, and regularization is why mini-batch stochastic gradient descent has become the standard optimization method in deep learning.
The Complete Training Loop
Let's put it all together in the context of training our digit recognition network and see how a full loop of training looks like. During training we loop trough the entire dataset, this is what we call an "epoch". Then for each epoch the following happens:
-
Forward Pass:
- Feed batch of images through network.
- Get predictions for each image.
-
Loss Calculation:
- Compare predictions to correct labels.
- Compute cross-entropy loss.
-
Backpropagation:
- Calculate gradients for all weights.
- Trace error backward through all layers.
-
Weight Update:
- Apply gradient descent to all weights.
- new_weight = old_weight - lr × gradient.
-
Repeat:
- Move to next batch.
As the loss decreases, the accuracy of the model increases:

A typical training timeline might look something like this:
- Epoch 1: 30% accuracy, high loss (network learning basic patterns).
- Epoch 5: 70% accuracy, decreasing loss (recognizing common digit features).
- Epoch 10: 85% accuracy, low loss (fine-tuning difficult cases).
- Epoch 20: 92% accuracy, very low loss (near-optimal performance).
Through this iterative process, the network gradually refines its parameters until it can reliably recognize digits with high accuracy.
Advanced Optimization Techniques
While basic gradient descent works, modern deep learning uses more sophisticated optimizers that adapt the learning process.
Momentum: Remembers previous gradients to smooth out updates and accelerate learning in consistent directions.
Adam Optimizer: Adapts learning rates for each parameter individually based on historical gradients. Combines benefits of momentum with adaptive learning rates. Most popular optimizer for deep learning.
Advanced optimizers converge more quickly, reaching effective solutions faster than traditional methods. They’re also more robust, showing less sensitivity to the learning rate, and can dynamically adjust their behavior during training to better suit the problem at hand.
Monitoring Training: How to Know It's Working
Training neural networks requires careful monitoring to ensure the process is working correctly:
- Training loss: Should decrease steadily over time.
- Validation accuracy: Should increase and track reasonably with training.
- Learning curves: Plots showing progress over time.
Sings of healthy training are:
- Loss decreases smoothly (with some noise).
- Validation performance improves along with training performance.
- Network makes increasingly sensible predictions.
We will explore some warnings of bad training in the next chapter!
Final Takeaways
Training deep neural networks combines backpropagation and gradient descent to systematically adjust millions of parameters from random initialization to intelligent behavior. The forward pass generates predictions by flowing data through layers of weighted connections and activation functions. Loss functions quantify prediction errors in ways that guide learning toward desired outcomes. Backpropagation uses the chain rule of calculus to efficiently compute gradients for every weight, determining how each parameter contributes to the overall error.
Gradient descent uses these gradients to update weights in directions that reduce loss, with the learning rate controlling step size. Modern optimizers techniques upon basic gradient descent by adapting learning rates. Understanding this training process reveals how neural networks transform from random functions into sophisticated pattern recognition systems through systematic mathematical optimization.