Example: Predicting the Next Word

We’ve explored how words can be represented as vectors, how neural networks capture meaning, and how transformers use attention to handle context. Now let’s see how these ideas fit together in a complete example: teaching a model to predict the next word in a sentence.

This is the foundational task behind modern NLP. Whether a system is summarizing documents, answering questions, or chatting like ChatGPT, at its core it is repeatedly predicting the most likely next token based on everything that came before.


Step 1: Understanding the Problem and Data

The basic task is simple: given a sentence like "The cat sat on the ___" the model should learn that likely continuations include mat or sofa, but not volcano.

To train this ability, we use a dataset of text—books, articles, or conversations. The data is tokenized (split into units like words or subwords) and represented as numerical IDs.

  • For example, the sentence "cats chase mice"[42,381,97][42, 381, 97].
  • Vocabulary size: often 30,000–50,000 tokens.
  • Input format: sequences of tokens.
  • Output: probability distribution over the vocabulary for the next token.
Illustration of predicting next word: 'The cat sat on the ___'

In essence, the model learns to turn sequences of tokens into probability distributions that capture which words are most likely to come next.


Step 2: Representing Words

Before a network can process text, words must be turned into vectors.

  • One-hot encoding: Each word is represented as a sparse vector with one “1” and the rest “0s.” Simple but inefficient.
  • Embeddings: Instead, we use dense vectors where similar words are close in space. For example:
vec(cat)[0.12,0.85,0.33,...]\text{vec}(\text{cat}) \approx [0.12, -0.85, 0.33, ...]

These embeddings are learned during training, capturing semantic relationships like king – man + woman ≈ queen.


Step 3: Building the Model

We use a Transformer language model architecture:

  • Embedding layer: Converts tokens into vectors.
  • Positional encoding: Adds information about word order.
  • Attention layers: Each word can “look at” others in the sentence, weighting the most relevant ones.
  • Feedforward layers: Process and transform representations.
  • Output layer: A softmax function over the vocabulary gives probabilities for the next word.

For “The cat sat on the ___” the model might output:

  • mat → 0.65.
  • floor → 0.20.
  • sofa → 0.10.
  • volcano → 0.0001.

This architecture allows Transformers to capture meaning, context, and relationships, producing coherent predictions that reflect how words fit together.


Step 4: Training the Network

Loss Function: We use cross-entropy loss, which measures how far off the model’s predicted probabilities are from the correct answer.

Loss=i=1Vyi×log(pi)\text{Loss} = - \sum_{i=1}^{V} y_i \times \log(p_i)

You might remember this formula from Chapter 3, let's break it down:

  • VV is the size of the vocabulary (all possible words).
  • yiy_i is the true label for word ii (it’s 1 for the correct word and 0 for all others).
  • pip_i is the probability the model assigned to word ii.

🐱 Prediction Example: if the model predicts “mat” with probability 0.90.9 (90%), the loss is log(0.9)0.05-\log(0.9) \approx 0.05 (a very small penalty). But if it only gives “mat” 0.10.1 (10%), the loss is log(0.1)2.3-\log(0.1) \approx 2.3 (a much larger penalty).

So the model is rewarded for putting high probability on the correct word and penalized otherwise.

Training loop:

  1. Input sentence: “The cat sat on the”.
  2. Model predicts probabilities for the next word.
  3. True word is “mat” → loss is computed.
  4. Backpropagation updates all attention weights and embeddings.
  5. Repeat millions of times on different sentences.

Over time, the model shifts from random guesses to meaningful predictions.


Step 5: What the Model Learns

Attention mechanisms let the model capture both local grammar and long-range dependencies.

  • In “The cat that was sleeping finally woke up”, attention links "cat" with "woke", not just nearby words.
  • In “The animal didn’t cross the road because it was too tired”, attention helps resolve that "it" refers to "animal", not "road".

This ability to weigh relevance dynamically is what makes transformers so powerful.


Step 6: Evaluation

To test our model, we measure perplexity, a metric of how well it predicts text:

Perplexity=2Cross-Entropy Loss\text{Perplexity} = 2^{\text{Cross-Entropy Loss}}
  • Low perplexity → model is confident and accurate.
  • High perplexity → model is uncertain or wrong.

A good model might generate sentences like:

  • “The cat sat on the mat, purring softly”, while a weaker model might produce:
  • “The cat sat on the banana, purring computers”.

The difference illustrates how well the network has captured language patterns.


Step 7: From Prediction to Conversation

At scale, this simple “predict the next word” task leads to remarkable abilities:

  • Sentence completion: Predict the rest of a phrase.
  • Summarization: Predict shortened versions of text.
  • Translation: Predict the next word in a different language.
  • Dialogue: Predict the next turn in a conversation.

ChatGPT is essentially this process repeated many times, with additional fine-tuning and safety layers to make responses conversational and useful.


Putting It All Together

This example ties together the core NLP journey:

  1. Tokens to vectors: Words become embeddings.
  2. Attention layers: Context decides which words matter most.
  3. Prediction: Softmax outputs probabilities for the next word.
  4. Learning: Backpropagation adjusts embeddings and attention weights.
  5. Evaluation: Metrics like perplexity measure success.

What starts as raw text becomes a system that can model language—not just matching words, but capturing relationships, syntax, and meaning.


Final Takeaways

Next-word prediction is deceptively simple but incredibly powerful. It shows how embeddings, attention, and transformers combine to produce language that feels natural. The same mechanism scales up to create chatbots, summarizers, translators, and more.

Most importantly, the model learns all of this automatically from data. Just as CNNs discover edges and shapes, language models discover grammar and meaning—transforming the basic task of predicting “what comes next” into the foundation of modern natural language processing.