It is one of the most amusing and counterintuitive quirks of modern artificial intelligence. Today’s Large Language Models (LLMs) can write complex Python code, draft watertight legal contracts, translate obscure languages, and compose sonnets in the style of Shakespeare. Yet, if you ask an LLM a seemingly trivial question like, “How many ‘r’s are in the word strawberry?” or “How many words are in the previous sentence?”, it will frequently—and confidently—give you the wrong answer.
To a human, this failure seems inexplicable. How can a system so extraordinarily “smart” fail at a task a preschooler can master?
The answer lies in the fundamental architecture of artificial neural networks. LLMs are not built to be calculators, nor do they “read” text the way humans do. To understand why an AI cannot count, we must look under the hood at how these models perceive the world, process data, and generate language.
1. The Tokenization Bottleneck: Seeing the Forest, Missing the Trees
The primary culprit behind an LLM’s inability to count letters or words is a preprocessing step called Tokenization.
When a human reads the word “strawberry,” they see ten distinct letters arranged in a specific sequence. They can visually scan the word and mentally tally each instance of the letter ‘r’.
LLMs, however, do not process raw text. Before a single word reaches the neural network, the text is fed into a tokenizer—usually based on an algorithm like Byte-Pair Encoding (BPE). The tokenizer chops words up into chunks called “tokens.” These tokens can be whole words, syllables, or arbitrary clusters of characters based on how frequently they appear in the model’s training data.
For example, the tokenizer might split “strawberry” into two distinct tokens: straw and berry. To the AI, these are essentially atomic units, represented as numerical IDs (e.g., Token 4912 and Token 813).
The consequence? The model is entirely blind to the individual letters. Asking an LLM how many ‘r’s are in “strawberry” is akin to handing a human a sealed, opaque jar of jam and asking how many individual seeds are inside. The model knows the concept of a “strawberry,” and it knows how the word relates to “fruit” or “red,” but it cannot “see” the internal spelling of the token.
2. Autoregressive Generation: Predicting vs. Computing
Even if we bypass tokenization (for instance, by asking the model to count the number of words in a sentence, where each word is a token), LLMs still struggle. This brings us to the second core issue: the nature of autoregressive generation.
Standard software programs written by humans execute algorithms. If you write a Python script to count words, the computer utilizes a simple, deterministic algorithm: it creates a variable (a counter) set to zero, loops through the text, and increments the counter by one for every item it finds.
LLMs do not run algorithms; they run statistical probabilities. They are Next-Token Predictors.
When an LLM is asked, “How many words are in this paragraph?”, it does not run a loop. Instead, it looks at the sequence of words in the prompt and calculates the mathematical probability of what the next word should be. It relies on its vast training data to “guess” a number that statistically makes sense in that context.
Because counting is a strict, rule-based mathematical operation—and not a linguistic pattern—the LLM’s probabilistic engine often defaults to highly plausible-sounding but mathematically incorrect guesses. It is the equivalent of a human trying to guess the exact number of jellybeans in a jar by just looking at its shape, rather than counting them one by one.
3. The Absence of Working Memory and State
Counting requires a specific cognitive function: working memory. When a human counts, they maintain a running tally in their head (“One, two, three…”). This is a sequential, stateful process.
Transformer architectures—the foundation of modern LLMs—process information in parallel, not sequentially. They analyze the entire context window at once using an “Attention Mechanism.” Because they process everything in a single, massive mathematical forward pass, they lack an internal “scratchpad” or state variable to hold an accumulating number.
Computer scientists describe counting as a process that requires $O(n)$ time steps (where $n$ is the number of items). However, a standard Transformer tries to jump directly to the final answer in a fixed number of computational layers, essentially an $O(1)$ operation relative to the generation step. Without externalizing the intermediate steps, the model simply cannot execute the sequential logic required to count accurately.
4. The Inconsistency of Numbers
When it comes to counting actual numbers and doing arithmetic, tokenization strikes again. Human mathematics relies on a strict, positional base-10 system. But tokenizers treat numbers highly inconsistently.
For example:
- The number
123might be one token. - The number
1234might be tokenized as12and34. - The number
12345might be tokenized as123and45.
Because the segmentation of numbers is seemingly random, the model struggles to align the digits properly to perform arithmetic or count numerical values. It attempts to learn math as a language translation task (translating “2 + 2” to “4”) rather than learning the underlying rules of arithmetic.
5. Bridging the Gap: How We Teach AI to Count
If LLMs are fundamentally unsuited for counting, how do we get around this limitation? Researchers and engineers have developed several clever workarounds:
- Chain of Thought (CoT) Prompting: We can force the model to simulate working memory by making it write out its thought process. If we tell the model, “Write out every letter in the word ‘strawberry’ separated by hyphens, and number the ‘r’s as you find them,” we are forcing it to generate a step-by-step external scratchpad. This transforms the task from a single leap of probability into a sequence of smaller, manageable linguistic steps.
- Tool Use (Code Interpreters): Modern LLMs are increasingly paired with external environments. When asked to count, a highly capable model can write a small Python script, send it to a compiler, and return the exact output. It outsources the deterministic math to a traditional computer processor.
- Byte-Level Models: Future iterations of AI models are experimenting with byte-level or character-level architectures, bypassing the BPE tokenizer entirely so the model natively “sees” every single character.
The Difference Between Language and Logic
The fact that large language models cannot reliably count is not a “bug” so much as a feature of their design. It highlights a profound difference between organic human cognition and artificial neural processing. LLMs are savants of pattern recognition and linguistic synthesis, but they are devoid of the rigid, step-by-step logical processors that define traditional computing.
Understanding why an AI cannot count a handful of letters teaches us a valuable lesson: we must use the right tool for the right job. For generating creative ideas, writing, and summarizing, LLMs are unparalleled. But for counting? You might be better off using a basic calculator—or just using your own fingers.
