Newsletter · Jul 24, 2025

The (Real) Cost of AI Localization

There is one thing we don't talk about nearly enough these days: AI. Or, ehm, the cost of multilingual AI when using AI for language-related tasks.

AI may feel like air these days; it's everywhere. But unlike the air we breathe, it's not free. Any "free" or semi-free AI available today is essentially a proven user acquisition strategy.

There are real costs behind AI, with zillions pouring into AI chips, data centers, and electricity use. Not to mention the $100M signup packages AI experts may expect to join your company, which, remember, come on top of the free coffee and all the ping pong tables.

These are very real costs that will only continue to rise, and someone will need to pay for them (any volunteers here?).

What does it mean for AI in localization? Let's dive in.

By the same token: the basics

The cost of AI localization is typically measured in terms of the number of tokens processed by language models for any given language-related task. This is how commercial APIs charge users. A token is a unit of text, a word or a sub-word, generated through a tokenization process.

Tokens are essentially needed for LLMs to convert our natural language inputs into numerical representations, which are then transformed into embedding vectors that serve as inputs to the neural network layers. These are probability machines, after all, not modern-day replicas of the Library of Alexandria.

There are multiple ways tokenization works, and providers of foundational LLMs constantly strive to fine-tune or develop new tokenization methods in order to optimize the process (and save computing requirements).

Tokenization methods and languages

Beyond basic word-level or character-level tokenization, the dominant approach today (used by OpenAI GPT, Meta's Llama Models, and Google's BERT) is Byte-Pair Encoding (BPE).

This sub-word method handles out-of-vocabulary words by breaking them into known subword components. It maintains manageable vocabulary sizes while preserving linguistic patterns.

BPE tokenizers aren't generic or universal. They depend heavily on the training data used for a specific LLM. They're essentially compressed representations of the linguistic patterns in the training corpus. Their efficiency therefore depends on both the language composition of the training data and the languages we use with LLMs.

This word	Will split into
localization (English)	2 tokens: local-ization
Lokalisierung (German)	3 tokens: Lok-alis-ierung
lokalizace (Czech)	4 tokens: lok-al-iz-ace
ローカライゼーション (Japanese)	5 tokens: ロ-ーカ-ライ-ゼ-ーション

These figures are based on GPT-4o, which features improved tokenization efficiency. The same words would require more tokens in earlier versions. For example, ローカライゼーション would consume 11 tokens in GPT-3.5.

You can experiment with OpenAI's Tokenizer tool here: https://platform.openai.com/tokenizer.

Reverse engineering LLMs

Interestingly, BPE tokenizers can be used to reverse-engineer the composition of their training data. By analyzing the merge rules, researchers can estimate the languages, content types, and domains used to train specific models, as elegantly demonstrated in the recent study Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?.

Training data mixture predictions for several commercial tokenizers

Training data mixture predictions for several commercial tokenizers. Source: https://arxiv.org/pdf/2407.16607

Another valuable resource recently released in Germany is the AI Language Proficiency Monitor (https://huggingface.co/spaces/fair-forward/evals-for-every-language). This tool tracks how "multilingual" LLMs truly are by benchmarking both their comprehension of specific languages and their translation performance, both overall and for individual languages.

The AI Language Proficiency Monitor

The AI Language Proficiency Monitor. Source: https://huggingface.co/spaces/fair-forward/evals-for-every-language

All languages created equal? You wish!

Costs are directly linked to how texts in different languages split into tokens, but this split varies across languages. Tokenizers trained primarily on English data work efficiently with English content, but fragment non-English texts into many more tokens, significantly increasing costs.

The more "distant" a language is linguistically from English, in terms of morphology or syntax, the worse a tokenizer performs.

Morphologically rich languages like Finnish, Turkish, and Hungarian often result in more tokens per word due to inflections and compound words.

This works both ways, which is why "sovereign" LLMs optimized for non-English languages frequently use language-specific tokenizers or tokenizer extensions better suited for these languages.

For Chinese-based language models, it makes sense to use character-level tokenization, where each Chinese character is treated as an individual token. There's also the newer sub-character tokenization method, which converts Chinese characters into sequences based on glyph (visual structure) or pronunciation information.

In India, OpenHathi, a Hindi large language model built on Llama, uses an extended version of Llama2's tokenizer specifically optimized for Hindi.

For Japanese, Swallow LLM uses a similar approach. It deploys the base Llama 2 tokenizer but extends its original vocabulary to include Japanese characters and linguistic patterns, rather than training an entirely new tokenizer from scratch.

Made in Japan: Swallow LLMs are based on Llama but feature enhanced Japanese language capabilities

Tokenization efficiency is also related to word frequency, with common words typically being single tokens. This creates a penalty for words not frequently used in training data (like those from less-resourced languages). Since tokenizations are learned statistically, not phonemically or syllabically, there's no fixed correspondence with linguistic units.

The ASCII advantage

For models trained primarily on English content, there’s also penalty for languages using non-Latin scripts, which stems from Unicode encoding. UTF-8's variable-length encoding assigns different numbers of bytes to different characters. English characters fit into a single byte in UTF-8, achieving maximum compression, thanks to ASCII.

Meanwhile, Latin-based languages with special characters may require two bytes per character. East Asian languages typically need about 3 bytes per character. Some special cases, including emojis and pictographic symbols, can consume up to 4 bytes per character (emojis are serious token-eaters, so avoid using them with LLMs).

The extra baggage's gonna cost you

So, just the simple fact of working with languages other than English often means costs multiply. But don't worry, it gets worse. Unlike NMT, where processing usually involves a single pass with costs directly tied to word count, using LLMs for language tasks comes with a lot of "extra baggage."

First, we pay twice: for the input tokens (prompts) and for the output tokens (generated content). This interaction with LLMs is what the real experts call "LLM inference".

Output tokens often cost more because they require greater computational resources. While input tokens are processed in parallel, output tokens must be generated sequentially. This means GPU memory is occupied longer, resulting in higher costs.

Provider	Model	Input ($/1M tokens)	Output ($/1M tokens)	Notes
OpenAI	GPT-4o	$2.50	$10.00	Most popular premium model
OpenAI	GPT-4o Mini	$0.15	$0.60	Cost-effective option
Anthropic	Claude Opus 4	$15.00	$75.00	Highest capability tier
	Claude Sonnet 4	$3.00	$15.00	Balanced performance/cost
	Claude Haiku 3.5	$0.80	$4.00	Fastest, lowest cost
Google	Gemini 2.5 Pro	$1.25-$2.50	$10.00-$15.00	Context length dependent
	Gemini 2.5 Flash	$0.30	$2.50	Popular mid-tier option
	Gemini 2.0 Flash	$0.10	$0.40	Competitive pricing

Sample of current pricing of some popular LLMs (public rates, no volume discounts, not batch API prices, data as of July 24, 2025)

Second, when using LLMs for basic translation tasks, we're not just feeding in the source text. We also need to include instructions to guide the AI's translation approach, context information, content details, target audience, purpose, domain, style preferences, terminology, examples, and formatting guidelines for maintaining specific text structure.

This same pattern applies to every step in the typical multi-step AI localization workflow. Since workflows may require multiple agents, and each agent represents at least one prompt-response pair, tokens accumulate quickly.

On the output side, it's rarely just one perfect translation. LLMs often provide alternative translations and explanations for their translation choices. All these elements contribute to the total token count.

The triple whammy for less-resourced languages

So LLM costing is more complex than MT, despite being cheaper at the basic per-unit level. NMT pricing is typically measured in dollars per million source characters, remains consistent across languages, and is largely predictable. LLMs, however, have a much more variable cost structure.

For less-resourced languages, which often command lower rates generally, this creates a triple disadvantage:

A typical commercial LLM won't be trained on much data in that language, so the generated output is likely to be suboptimal.
Any processing will consume many more tokens than English, making it significantly more costly.
Since many more tokens are needed for both input and output, there's a smaller context window available (though this is becoming less of an issue as context windows expand to millions of tokens with newer models).

Worse quality, higher costs. What's not to like about that?

Storage is also becoming an increasingly important cost consideration. This is because moving files is more expensive than processing in place... especially for storage-heavy content like video and audio.

It’s also worth noting that while more tokens mean higher costs, they also mean more processing time is needed. LLM latency goes up too.

How to manage LLM costs for localization

LLM pricing may be deceptively cheap on the per-unit basis, but as shown above, costs can multiply quickly, so effective cost management is still needed. Here are a few options that exist, other than optimizing actual LLMs:

Developing language-specific prompts, where it makes sense, that minimize token usage while maintaining quality.
English-first tokenization: limited or selective translation of prompts, allowing more tokens to process English text.
Prompt compression: reducing the length of prompts by removing low-information, unnecessary tokens. The recent efforts to standardize the structure and content of style guides is another example.
Content segmentation: breaking content into smaller segments that are still contextually coherent.
Batch, summarize, and compress to reduce the number of tokens needed. This is particularly important for localization where the same prompt is often sent multiple times.

Prompt caching - storing frequently used prompt content to avoid redundant processing - can significantly reduce costs. Cached input tokens are typically much cheaper (75% less with OpenAI) compared to standard input tokens.

Additionally, you can minimize repetitive prompting by storing and referencing summarized or compressed context through embeddings or system prompts.
Consider using different models for different content types (or use cases). Because, not all models cost the same. For example, use cheaper models for short low-impact strings, and save the high end models for content that requires more sensitive or nuanced content.

While all these are mostly direct costs, the whole area of prompt engineering has a huge, if not decisive impact. Operating NMT is straightforward, but prompt engineering is anything but. Done well, it can significantly reduce total costs. Done poorly... well...

The TL;DR

It may be trendy to say that NMT, loved and used for years, is just a form of AI, and so we intimately know how multilingual AI works. This is only partially true, of course, and the economics is one area where each technology works differently. AI costs may seem negligible. Just $10 for 1M tokens? Gimme two!

But costs can rise fast. And as LLMs continue to evolve, the cost dynamics of AI localization will inevitably shift. Today's triple whammy of cost, quality, and context limitations for non-English languages reflects current technological realities. We need to account for these hidden costs in our localization strategies while pushing AI providers to develop more linguistically democratic (aka multilingual) models.

PS Huge thanks to Erik Vogt and Martin Chrastek for reviewing this article and making it so much better and (hopefully) factually 100% correct.

Artificial Intelligence - A Modern Approach Book What I'm reading

With AI, it's back to school, and I'm taking this literally. I'm currently reading the university textbook Artificial Intelligence: A Modern Approach, Global Edition.

I do occasionally get over-ambitious, as you can see, but this text provides a much wider perspective on AI, well beyond LLMs and their associated tips and tricks. It covers machine learning, deep learning, multi-agent systems, robotics, NLP, probabilistic programming, and more.

Some of the math is punishingly hard for me, and would be even when I was at my "math peak" during my engineering studies, so I largely ignore those parts. It's a steep learning curve and definitely not beach reading, but I'm enjoying it nonetheless.

All languages created equal? You wish!

The ASCII advantage

The triple whammy for less-resourced languages

The TL;DR

Opt into the newsletter