AI for Dummies

We’ve explored how images are stored as pixels, how convolutional networks detect visual features, and how classifiers make decisions. Now let’s put it all together in a complete example: building a system that can tell whether an image contains a cat or a dog.

This problem has become a benchmark in computer vision because it’s simple enough to explain but challenging enough to show why modern methods like CNNs matter. Just like the MNIST digits did for deep learning, cat-vs-dog classification demonstrates how core concepts come together into a working pipeline.

Step 1: Understanding the Problem and Data

As always, we start with a dataset. This time, we need a dataset that contains tens of thousands of images labeled “cat” or “dog”.

Image size: varies (resized to $224×224$ for training)
Color: full RGB channels (3 numbers per pixel)
Labels: 2 classes (cat or dog)

Example images of cats and dogs showing diversity in lighting, angles, and backgrounds

Why is this difficult?

Variation: Cats and dogs appear in many poses, lighting conditions, and angles.
Background clutter: Animals may sit on grass, couches, or laps.
Overlap: Some breeds look surprisingly similar — a fluffy cat can resemble a small dog.

Each image is transformed into a $224×224×3$ matrix — nearly 150.000 values — the raw material our model must interpret.

Step 2: Designing the Architecture

A fully connected network would drown in parameters. Instead, we use a Convolutional Neural Network (CNN). Hopefully you can recognise some components and see the same structure we used earlier.

Let's sketch the architecture of the network:

Input: $224×224×3$ pixels.
Conv Layer 1: 32 filters ( $3×3$ ) → detects edges, simple textures.
Pooling Layer: 2×2 max pooling → reduces size to $112×112$ .
Conv Layer 2: 64 filters ( $3×3$ ) → detects curves, fur textures, shapes.
Pooling Layer → reduces to $56×56$ .
Conv Layer 3: 128 filters ( $3×3$ ) → detects object parts (ears, noses).
Fully Connected: 256 neurons → combines all features.
Output Layer: 2 neurons with softmax → probabilities for cat vs dog.

This layered structure builds progressively from edges → textures → parts → objects.

Step 3: Preparing the Data

Before training, we prepare the images so the network learns effectively:

Normalization: Rescale pixel values from $[0, 255]$ to $[0, 1]$ .
Resizing: Standardize to $224×224$ .
Augmentation: Random flips, rotations, and color shifts make the model robust.

These steps mimic the invariance challenges we discussed earlier — teaching the network that a rotated cat is still a cat.

Step 4: Training the Network

Loss Function: We use binary cross-entropy, a way of measuring how well the model predicts the correct class when there are only two options. It is very similar to the cross-entropy loss we have seen before:

\text{Loss} = - [ y \times \log(p) + (1-y) \times \log(1-p) ]

Let's break it down:

$y$ is the true label (0 = cat, 1 = dog).
$p$ is the probability the model assigns to “dog”.

This formula works by giving a small penalty when the model is confident and right, and a big penalty when it’s confident and wrong.

🏞️ Prediction Example: If the true label is dog ( $y = 1$ ) and the model predicts $p = 0.95$ (95% dog), the loss is very small: $-\log(0.95) \approx 0.05$ . If the model instead predicts $p = 0.05$ (only 5% dog), the loss is huge: $-\log(0.05) \approx 3.0$ .

So the closer $p$ is to the correct label, the lower the loss.

Training loop:

Forward pass: image → CNN → probabilities.
Compute loss: how wrong was the prediction?
Backpropagate: update convolutional filters and fully connected weights.
Repeat across thousands of batches.

Over epochs, filters evolve:

Early filters become edge detectors.
Later filters specialize in fur textures or snout shapes.
Deep layers assemble these into the concept of “dog” or “cat”.

Through this iterative process, the network gradually builds from simple patterns to rich concepts, enabling accurate recognition of complex objects.

Step 5: What the Network Learns

During training, CNNs build a hierarchy of features:

Layer 1: horizontal/vertical edges, simple color blobs.
Layer 2: fur-like textures, curved ear outlines.
Layer 3: combinations of textures and edges → noses, whiskers, eyes.
Fully Connected: “catness” vs “dogness” scores.

For a photo of a tabby cat:

Edges and textures highlight stripes.
Middle layers detect fur and pointed ears.
Final layers combine them into a high probability of cat.

In this way, CNNs turn raw pixels into layered representations that culminate in confident, high-level classifications.

Step 6: Evaluation

Once trained, we evaluate on unseen test images:

Accuracy: 94% — correctly classifies most cats and dogs.
Confusions: Some small fluffy dogs are mistaken for cats, and some exotic cats resemble dogs.

Common mistakes reveal limits of pattern-based recognition: the model doesn’t “know” what makes an animal a dog biologically — it only compares visual patterns.

Step 7: From Classification to Generation

Here’s where it gets exciting: the same principles can be reversed.

In classification, CNNs extract features and map them to labels.
In generative models, we sample from a latent space and map features back to pixels.

That’s how systems like GANs and diffusion models can create entirely new cats and dogs — ones that never existed. A generated cat image “looks” real because it respects the same visual rules (fur texture, ear placement, symmetry) learned during classification.

Putting It All Together

This cat vs dog example brings the entire computer vision journey full circle:

Pixels → numbers: Every photo starts as raw RGB values.
Feature extraction: CNNs detect edges, textures, shapes.
Hierarchies: Local features combine into global representations.
Classification: Softmax outputs probabilities for categories.
Generation: Flipping the pipeline produces new, synthetic images.

What started with pixel grids becomes a full AI system that can both recognize and create.

Final Takeaways

Just like the MNIST digits illustrated the power of deep learning, cat vs dog classification illustrates the essence of modern computer vision. It shows how convolution, pooling, hierarchical features, and supervised learning transform raw pixels into meaningful predictions. And it foreshadows the generative turn: once a network can represent what makes a cat or dog look real, it can also create entirely new ones.

This journey from recognition to generation encapsulates the core story of computer vision today.

Example: Recognizing Cats and Dogs