Computers are fundamentally mathematical machines—they process numbers, not words. When you ask ChatGPT a question or search for something online, the AI system doesn't "read" your text the way humans do. Instead, it converts every word into numbers that neural networks can work with using the mathematical operations we've learned about.
This transformation from human language to numbers is the first step in natural language processing. Without it, even the most sophisticated AI would be useless—like a brilliant mathematician trying to solve equations written in a language they can't read.
The Challenge: Words Don't Have Natural Numbers
Human language creates several challenges for computers that don't exist with other types of data. Unlike images (pixels have brightness values) or audio (sounds have frequencies), words don't come with numerical values. How much is "cat" worth mathematically? What's the numerical difference between "happy" and "sad"?
Context Changes Everything:
- "Bank" means different things in "river bank" vs "money bank".
- "Lead" is pronounced differently in "lead the teamå" vs "lead pipes".
- The same letters can have completely different meanings depending on context.
Infinite Creativity: Language constantly evolves with new words like "selfie" or "ghosting", novel combinations, and countless ways to express the same ideas.
Complex Relationships: Words connect through:
- Synonyms: "happy" and "joyful".
- Categories: "cat" and "animal".
- Grammar: "run" and "running".
- Analogies: "king" to "queen" as "man" to "woman".
Simple numbers don't capture these rich relationships.
Step 1: Breaking Text into Pieces
Before assigning numbers, we decide what units of language to work with through tokenization—breaking continuous text into discrete tokens.
Word-Level Tokenization: The first step is dissecting a sentence into words: "The quick brown fox jumps" → .
However, tokenizing like this already brings some challenges:
- Punctuation handling: Is "don't" one token or two?
- Capitalization: Are "Cat" and "cat" the same?
- Unknown words: What about words never seen before?
Modern systems break words into meaningful pieces using subword tokenization:
- "unhappiness" → .
- "impossible" → .
This handles unknown words better—even if we've never seen "unhappiness", we might know "un-", "happy", and "-ness" separately.
Step 2: Creating a Vocabulary
We create a vocabulary—a complete list of tokens our system recognizes, each with a unique ID number. Here's how to do that:
- Collect massive amounts of text from books, websites, articles.
- Tokenize everything using our chosen method.
- Count how often each token appears.
- Keep the most common tokens.
- Assign each token a unique number
🗣️ Vocabulary Example: Take a look at what a subset of the gathered data might look like:
| Token | ID | Usage |
|---|---|---|
| <UNK> | 0 | unknown words |
| the | 1 | very common |
| and | 2 | very common |
| cat | 4 | common |
| elephant | 47 | rare |
Notices we use a special token, <UNK>, used for unknown words not in the vocabulary. Other special tokens are:
<PAD>: Makes sentences same length for processing<START>/<END>: Mark sentence boundaries
Common words get low numbers, rare words get high numbers or are excluded entirely.
Step 3: Converting Text to Numbers
Now we transform any text into number sequences neural networks can process. Let's transform the sentence "The cat is happy":
- Original: "The cat is happy".
- Tokenized: .
- ID Numbers: .
If we were to handle unknown words not seen in the vocabulary:
- Original: "The zebra is happy".
- If "zebra" unknown: (zebra →
<UNK>)
By tokenizing, handling unknown words, and padding sequences, we prepare text data in a consistent numerical form ready for neural network training.
What We've Achieved and What's Missing
This process converts any text into mathematical input that neural networks can process, handles large vocabularies systematically, and forms the foundation for all text-based AI systems.
However, these number assignments are arbitrary and meaningless:
- Number 4 ("cat") is mathematically as similar to 5 ("dog") as to 47.000 (rare technical term).
- No encoding that "happy" and "joyful" mean similar things.
- "Bank" gets same number in "river bank" and "savings bank".
- Rich concepts reduced to single arbitrary numbers.
This loses crucial information about relationships and context.
Real-World Scale
Keep in mind these concepts expand drastically for actual AI models, let’s look at how they scale in real-world language models.
🤖 Modern Language Model Example: The scale for modern language models is huge, and keeps increasing:
- GPT-3: ~50.000 token vocabulary.
- Training data: Hundreds of billions of words.
- Processing speed: Thousands of sentences per second.
However, we cannot keep building vocabularies endlessly due to storage and computing times. This gives us a vocabulary dilemma:
- Small vocabularies: Can't represent many words, lots of unknowns.
- Large vocabularies: Harder to learn, more memory needed.
- Solution: Subword tokenization as compromise.
Balancing vocabulary size with efficient learning is a core challenge at the heart of large-scale natural language processing.
The Path Forward
Simple ID numbers point toward more sophisticated approaches. Instead of "cat" = 4, what if we used where each number captures aspects of "cat"-ness?
This is exactly what embeddings do—create dense, meaningful representations where similar words have similar numbers and relationships are preserved in mathematical structure. Embeddings open the door to richer language understanding by translating words into mathematical spaces where meaning and relationships naturally emerge.
Final Takeaways
Converting human language into mathematical representations requires systematic tokenization, vocabulary creation, and numerical encoding. While simple ID-based representations solve the basic problem of making text processable by neural networks, they create sparse, arbitrary representations that don't capture semantic meaning or relationships between words.
Understanding this process and its limitations motivates the need for more sophisticated approaches—embeddings—that can encode semantic similarity and relationships in their mathematical structure, making them much more powerful for language understanding tasks.