Multi-Layer Networks

We've seen how a single perceptron can make basic decisions by drawing straight-line boundaries through data. But real-world problems—recognizing faces, understanding speech, playing chess—require much more sophisticated pattern recognition than any single perceptron can provide.

The breakthrough came from a simple but powerful idea: what if we connect multiple perceptrons together in layers? Just like how individual musicians combine to create a symphony, or how simple LEGO blocks build complex structures, stacking perceptrons unlocks capabilities far beyond what any single unit could achieve.

This layered approach transforms the linear limitations of single perceptrons into the flexible, powerful systems that drive modern AI.


The XOR Problem: Why We Need Multiple Layers

Let's reiterate the XOR problem that demonstrates why single perceptrons hit a wall:

  • (0,0) → circle
  • (0,1) → star
  • (1,0) → star
  • (1,1) → circle

No matter how you draw a straight line, you can't separate the stars from the circles. This isn't a limitation of our specific perceptron—it's a fundamental mathematical constraint. Single perceptrons can only create linear decision boundaries.

But here's the key insight: while one line can't solve XOR, two lines working together can! We can break the problem into simpler parts that individual perceptrons can handle, then combine their outputs.


Introducing Multi-Layer Perceptrons (MLPs)

A Multi-Layer Perceptron solves complex problems by organizing perceptrons into layers, where each layer processes information and passes results to the next layer.

The Three-Layer Architecture:

  • Input Layer: Receives raw data (like pixel values or sensor readings).
  • Hidden Layer(s): Extract patterns and features from the input.
  • Output Layer: Makes final decisions based on processed information.
Network diagram showing input layer, hidden layer, and output layer with connections

You might recognise some familiar concepts like features (x1,x2x_1, x_2), weights (w11,w22w_11, w_22), and the output (yy). Now we add the hidden layers: h1,h2h_1, h_2.

Each neuron in one layer connects to every neuron in the next layer, creating a "fully connected" network. This allows information to flow forward through the network, with each layer building on the work of the previous layer.


How Hidden Layers Learn Features

Hidden layers are where the real intelligence emerges. Instead of working directly with raw inputs, they learn to detect meaningful patterns that make the final decision easier.

✍🏼 Handwritten Digit Recognition Example:

  • Input Layer (784 neurons): Each neuron receives one pixel brightness value from the 28×2828×28 image.
  • Hidden Layer 1 (128 neurons): Learn to detect simple features like edges, curves, and line segments.
  • Hidden Layer 2 (64 neurons): Combine simple features into more complex patterns like "closed loops" or "vertical lines".
  • Output Layer (10 neurons): One for each digit 0-9, with the highest activation indicating the prediction.

Think of it like learning to recognize handwriting as a human:

  1. First, you notice basic strokes and curves.
  2. Then, you combine strokes into character parts.
  3. Finally, you recognize complete digits.

The network learns this same hierarchical process automatically from examples!


Activation Functions: Adding Non-Linear Power

Remember how we needed curved relationships for complex problems like polynomial regression in Chapter 3? Neural networks use activation functions to introduce these crucial non-linear capabilities.

Without activation functions, stacking linear perceptrons just creates another linear function—no matter how many layers you add! Activation functions let each layer add curves and bends to the decision boundary.

Common Activation Functions:

ReLU (Rectified Linear Unit): f(x)=max(0,x)f(x) = \text{max}(0,x)

  • Simple rule: if input is positive, pass it through; if negative, output zero.
  • Fast to compute and works well in practice.
  • Allows networks to model complex patterns while avoiding certain mathematical problems.

Sigmoid Function: f(x)=11+exf(x) = \frac{1}{1+e^{-x}}

  • Maps any input to a value between 0 and 1.
  • Useful for probabilities and the output layer.
  • Historically important but less common in hidden layers today.

Each activation function shapes how neurons respond to inputs, creating the non-linear transformations that let networks solve complex problems. The choice of activation function affects how well and how quickly the network learns.


Solving XOR with Multiple Layers

Let's see how a two-layer network can solve the XOR problem that stumped our single perceptron.

The Solution Strategy:

  • First hidden layer creates two perceptrons that can each solve part of the problem.
  • Output layer combines the results from the hidden layer.

Hidden Layer Logic:

  • Neuron 1 learns to output 1 when "A OR B" (at least one input is 1).
  • Neuron 2 learns to output 1 when "A AND B" (both inputs are 1).

Output Layer Logic:

  • Final neuron outputs 1 when "Neuron 1 AND NOT Neuron 2" (exactly one input is 1).

This demonstrates a crucial principle: complex problems can often be decomposed into simpler sub-problems that individual neurons can handle.


The Universal Approximation Theorem

Here's one of the most remarkable results in machine learning theory: with enough neurons in a single hidden layer, an MLP can approximate any continuous function to any desired accuracy.

  • Theoretically: MLPs can learn any pattern that exists in data.
  • Practically: With enough neurons and training data, MLPs can solve incredibly diverse problems.

But there is a catch: while this theorem guarantees that solutions exist, it doesn't tell us:

  • How many neurons we need.
  • How to find the right weights.
  • How much training data is required.

This is why Deep Learning remains both an art and a science—the theory tells us it's possible, but experience and experimentation guide us to practical solutions.


From MLPs to Modern Deep Learning

Multi-layer perceptrons laid the foundation for all modern deep learning. It's historical impact can't be understated:

  • Proved that neural networks could solve non-linear problems.
  • Established the layer-based architecture still used today.
  • Led to the development of backpropagation (which we'll explore next).

The core theory of MLPs is what drives modern architectures we'll encounter shortly:

  • Convolutional Neural Networks (CNNs): Specialized for images.
  • Recurrent Neural Networks (RNNs): Designed for sequences like text.
  • Transformers: The architecture behind ChatGPT and modern language models.

All these architectures build on the same fundamental principle: combine simple processing units in layers to create complex, intelligent behavior. We will catch up with them later!


Final Takeaways

Multi-layer networks transform the linear limitations of single perceptrons into flexible systems capable of learning complex patterns. By organizing perceptrons into layers—input, hidden, and output—networks can decompose difficult problems into manageable sub-problems that individual neurons can solve. Hidden layers automatically learn to extract meaningful features from raw data, building hierarchical representations from simple to complex.

Activation functions introduce the non-linear transformations essential for solving real-world problems, while the universal approximation theorem guarantees that MLPs can theoretically learn any pattern. Understanding multi-layer networks provides the conceptual foundation for all modern deep learning architectures, from image recognition to language models.