
How many times does the letter ārā appear in the word āstrawberryā? According to formidable AI products like GPT-4o and Claude, the answer is twice.
Large language models (LLMs) can write essays and solve equations in seconds. They can synthesize terabytes of data faster than humans can open up a book. Yet, these seemingly omniscient AIs sometimes fail so spectacularly that the mishap turns into a viral meme, and we all rejoice in relief that maybe thereās still time before we must bow down to our new AI overlords.
The failure of large language models to understand the concepts of letters and syllables is indicative of a larger truth that we often forget: These things donāt have brains. They do not think like we do. They are not human, nor even particularly humanlike.
Most LLMs are built on transformers, a kind of deep learning architecture. Transformer models break text into tokens, which can be full words, syllables, or letters, depending on the model.
āLLMs are based on this transformer architecture, which notably is not actually reading text. What happens when you input a prompt is that itās translated into an encoding,ā Matthew Guzdial, an AI researcher and assistant professor at the University of Alberta, said. āWhen it sees the word āthe,ā it has this one encoding of what ātheā means, but it does not know about āT,ā āH,ā āE.āā
This is because the transformers are not able to take in or output actual text efficiently. Instead, the text is converted into numerical representations of itself, which is then contextualized to help the AI come up with a logical response. In other words, the AI might know that the tokens āstrawā and āberryā make up āstrawberry,ā but it may not understand that āstrawberryā is composed of the letters ās,ā āt,ā ār,ā āa,ā āw,ā āb,ā āe,ā ār,ā ār,ā and āy,ā in that specific order. Thus, it cannot tell you how many letters ā let alone how many ārās ā appear in the word āstrawberry.ā
This isnāt an easy issue to fix, since itās embedded into the very architecture that makes these LLMs work.
Kyle Wiggers dug into this problem last month and spoke to Sheridan Feucht, a PhD student at Northeastern University studying LLM interpretability.
āItās kind of hard to get around the question of what exactly a āwordā should be for a language model, and even if we got human experts to agree on a perfect token vocabulary, models would probably still find it useful to āchunkā things even further,ā Feucht said. āMy guess would be that thereās no such thing as a perfect tokenizer due to this kind of fuzziness.ā
This problem becomes even more complex as an LLM learns more languages. For example, some tokenization methods might assume that a space in a sentence will always precede a new word, but many languages like Chinese, Japanese, Thai, Lao, Korean, Khmer and others do not use spaces to separate words. Google DeepMind AI researcher Yennie Jun found in a 2023 study that some languages need up to 10 times as many tokens as English to communicate the same meaning.
āItās probably best to let models look at characters directly without imposing tokenization, but right now thatās just computationally infeasible for transformers,ā Feucht said.
Image generators like Midjourney and DALL-E donāt use the transformer architecture that lies beneath the hood of text generators like ChatGPT. Instead, image generators usually use diffusion models, which reconstruct an image from noise. Diffusion models are trained on large databases of images, and theyāre incentivized to try to re-create something like what they learned from training data.
Asmelash Teka Hadgu, co-founder ofĀ LesanĀ and a fellow at theĀ DAIR Institute, said, āImage generators tend to perform much better on artifacts like cars and peopleās faces, and less so on smaller things like fingers and handwriting.ā
This could be because these smaller details donāt often appear as prominently in training sets as concepts like how trees usually have green leaves. The problems with diffusion models might be easier to fix than the ones plaguing transformers, though. Some image generators have improved at representing hands, for example, by training on more images of real, human hands.
āEven just last year, all these models were really bad at fingers, and thatās exactly the same problem as text,ā Guzdial explained. āTheyāre getting really good at it locally, so if you look at a hand with six or seven fingers on it, you could say, āOh wow, that looks like a finger.ā Similarly, with the generated text, you could say, that looks like an āH,ā and that looks like a āP,ā but theyāre really bad at structuring these whole things together.ā
Thatās why, if you ask an AI image generator to create a menu for a Mexican restaurant, you might get normal items like āTacos,ā but youāll be more likely to find offerings like āTamilos,ā āEnchidaaā and āBurhiltos.ā
As these memes about spelling āstrawberryā spill across the internet, OpenAI is working on a new AI product code-named Strawberry, which is supposed to be even more adept at reasoning. The growth of LLMs has been limited by the fact that there simply isnāt enough training data in the world to make products like ChatGPT more accurate. But Strawberry can reportedly generate accurate synthetic data to make OpenAIās LLMs even better. According to The Information, Strawberry can solve the New York Timesā Connections word puzzles, which require creative thinking and pattern recognition to solve and can solve math equations that it hasnāt seen before.
Meanwhile, Google DeepMind recently unveiled AlphaProof and AlphaGeometry 2, AI systems designed for formal math reasoning. Google says these two systems solved four out of six problems from the International Math Olympiad, which would be a good enough performance to earn as silver medal at the prestigious competition.
Itās a bit of a troll that memes about AI being unable to spell āstrawberryā are circulating at the same time as reports on OpenAIās Strawberry. But OpenAI CEO Sam Altman jumped at the opportunity to show us that heās got a pretty impressive berry yield in his garden.

