AI for Dummies

We've covered a lot of ground in this chapter—from single perceptrons to deep networks, from backpropagation to common challenges. Now it's time to see how everything fits together in a complete, practical example.

Let's build a neural network to recognize handwritten digits using the famous MNIST dataset. This example will walk through every step: preparing data, designing architecture, training the network, and evaluating results. You'll see how concepts from previous chapters—functions, matrices, statistics, gradient descent—all combine to create a working AI system.

This is the same type of problem that launched the Deep Learning revolution and remains a perfect introduction to practical neural network development.

Step 1: Understanding the Problem and Data

✍🏼 The MNIST Dataset: MNIST (Modified National Institute of Standards and Technology) contains 70.000 images of handwritten digits collected from American census workers and high school students.

We have talked about this dataset before, as it is a very famous example. Let's break it down:

60.000 training images for learning patterns.
10.000 test images for final evaluation.
Image size: $28×28$ pixels (784 total pixels).
Pixel values: 0 (white) to 255 (black) grayscale intensities.
Labels: Digits 0 through 9 (10 classes total).

Grid of handwritten digits 0-9 from MNIST dataset showing variation in writing styles

While humans recognize handwritten digits effortlessly, it's challenging for computers because:

Variation in writing styles, everyone writes differently.
Noise and imperfections, scanned images aren't perfect.
Rotation and scaling, digits might be tilted or different sizes.
Ambiguity, some 1s look like 7s, some 6s look like 0s.

Remember matrices from Chapter 2? Each image becomes a $28×28$ matrix of pixel intensities. For our neural network, we'll flatten this into a vector of 784 numbers—transforming visual information into mathematical input the network can process.

Step 2: Designing the Neural Network Architecture

We'll use a multi-layer perceptron (MLP)—a feedforward network with fully connected layers:

Input Layer: 784 neurons (one per pixel).
Hidden Layer: 128 neurons with ReLU activation.
Output Layer: 10 neurons with softmax activation (one per digit).

Network diagram showing 784 input nodes, 128 hidden nodes, and 10 output nodes

Let's go trough the reasons for our design choices:

Input Layer (784 neurons): Each neuron receives one pixel intensity value. This direct pixel-to-neuron mapping is simple but effective for MNIST's small images.

Hidden Layer (128 neurons with ReLU): This layer learns to detect important features like edges, curves, and patterns that distinguish different digits. We use ReLU activation to introduce non-linearity and avoid vanishing gradients.

Output Layer (10 neurons with softmax): Each neuron represents one digit class (0-9). Softmax activation ensures outputs sum to 1, creating a probability distribution over possible digits.

Now let's do a final parameter count for our model:

Input to hidden → $784 \times 128 = 100.352 \ \text{weights} + 128 \ \text{biases}$ .
Hidden to output → $128 \times 10 = 1.280 \ \text{weights} + 10 \ \text{biases}$ .
Total → $101.770$ parameters to learn!

This architecture strikes a balance between simplicity and expressive power, making it well-suited for learning to recognize handwritten digits.

Step 3: Preparing the Data

An important step in building any AI algorithm is data preprocessing. Before training, we need to prepare our data properly:

Normalization: Convert pixel values from $[0, 255]$ to $[0, 1]$ by dividing by 255. This helps training stability and faster convergence.

One-Hot Encoding: Convert digit labels to vectors:

Label 0 → $[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]$ .
Label 3 → $[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]$ .
Label 7 → $[0, 0, 0, 0, 0, 0, 0, 1, 0, 0]$ .

Data Split: From our 70.000 training images, we'll use:

50.000 for training (learning patterns).
10.000 for validation (monitoring progress and preventing overfitting).
10.000 test images remain untouched until final evaluation.

With proper preprocessing, we set up the data in a form that allows our network to learn efficiently and generalize effectively.

Step 4: Training Process

Loss Function: We use cross-entropy loss again, perfect for multi-class classification:

\text{Loss} = - \log (\text{predicted probability of correct class})

If the network predicts $0.9$ probability for the correct digit, $\text{loss} = -log(0.9) = 0.05$ (low penalty). If it predicts $0.1$ , $\text{los}s = -log(0.1) = 2.3$ (high penalty).

Hyperparameters: We set the following hyperparameters, remember, these are set in advance and can't be changed during training.

Learning rate: 0.001 (aggressive enough for progress, conservative enough for stability).
Batch size: 64 (balance between memory usage and gradient quality).
Epochs: 20 (enough iterations to converge without overfitting).

Training Loop: For each epoch and for each batch of 64 images:

Forward pass: images → network → predictions.
Loss calculation: compare predictions to true labels.
Backpropagation: compute gradients for all 101.770 parameters.
Weight update: apply optimizer with gradients.
Track metrics: loss, accuracy on training and validation sets.

Training Results: The following results are found during training:

Epoch 1: 85% accuracy (network learns basic digit shapes).
Epoch 5: 94% accuracy (refining feature detection).
Epoch 10: 97% accuracy (handling difficult cases).
Epoch 15: 98% accuracy (fine-tuning edge cases).
Epoch 20: 98.2% accuracy (convergence).

Through this structured training process, our network evolves from random guesses to a highly accurate digit recognizer.

Step 5: What the Network Learns

The 128 hidden neurons automatically learn to detect meaningful patterns:

Early Training: Random, noisy feature detectors.

After Training: Specialized detectors for:

Horizontal and vertical lines.
Curves and loops (important for 0, 6, 8, 9).
Corners and intersections.
Stroke endpoints and directions.

Decision Making Process: For a handwritten "8":

Input layer: Receives 784 pixel intensities.
Hidden layer: Detects features like "two loops," "vertical symmetry," "closed curves".
Output layer: Combines features to strongly activate the "8" neuro.

The network learns this hierarchy automatically—we never explicitly taught it what loops or curves are!

Step 6: Evaluation and Results

Test Set Performance: After training on 50.000 images and validating on 10.000, we evaluate on the final 10.000 test images that the network has never seen. We get an accuracy of 98.1%!

This means our network correctly identifies 9.810 out of 10.000 handwritten digits!

Some common mistakes we might detect if we dig deeper:

4 → 9: Some handwritten 4s with closed tops look like 9s.
7 → 1: Minimally written 7s without clear horizontal strokes.
6 → 0: Poorly formed 6s that look circular.
8 → 3: Broken or poorly formed middle connection in 8s.

Heatmap showing which digits are confused with which others

These errors make intuitive sense—they're the same mistakes humans might make!

This evaluation shows that while the model achieves outstanding accuracy, its remaining errors highlight both the challenges of handwriting variability and the limits of even well-trained neural networks.

Step 7: Putting It All Together

Let's check what we've accomplished. Starting with raw pixel data, our network learned to:

Extract meaningful features from visual input automatically.
Build hierarchical representations from simple edges to complex digit shapes.
Make confident, accurate classifications on new, unseen examples.
Generalize beyond training data with 98%+ accuracy.

Key Concepts

From Chapter 2 (Math Foundations):

Functions: Network implements complex function mapping 784 inputs to 10 outputs.
Matrices: Efficient computation using matrix operations for batched processing.
Derivatives: Backpropagation uses chain rule to compute gradients.
Statistics: Training data provides statistical foundation for learning.

From Chapter 3 (Machine Learning):

Supervised learning: Learning from labeled input-output pairs.
Data splits: Proper evaluation methodology.
Overfitting prevention: Monitoring validation performance.
Function approximation: Network learns complex mapping function.

From Chapter 4 (Deep Learning):

Multi-layer architecture: Hidden layers enable feature learning.
Activation functions: ReLU and softmax serve different purposes.
Backpropagation: Efficient gradient computation through all layers.
Optimization: Our hyperparameters handle the complex loss landscape.

By integrating these mathematical foundations, machine learning principles, and Deep Learning techniques, we’ve built a complete system that transforms raw data into accurate, real-world predictions.

Final Takeaways

This MNIST example demonstrates how mathematical foundations, machine learning principles, and Deep Learning techniques combine to solve real problems. Starting with 784 pixel intensities, our network learned to extract meaningful features, build hierarchical representations, and make accurate classifications through the iterative process of forward propagation, loss calculation, backpropagation, and weight updates. The 98%+ accuracy achieved shows that even a simple neural network can learn complex patterns when given sufficient data and proper training.

Most importantly, the network discovered these patterns automatically—learning that loops distinguish 0s from 1s, that intersections characterize 8s, and that curves define 3s—without explicit programming. This automatic feature learning capability explains why Deep Learning has revolutionized fields from computer vision to natural language processing, turning the MNIST digit recognition principles into the foundation for modern AI systems.

Example: Recognizing Handwritten Digits