Skip to content
multilingual.ai
  • Home
  • Newsletter
  • Articles
  • Resources
  • About & Contact
Global Marketing · Jun 02, 2026

The Languages of the Internet 2019-2026 and the Multilingual Opportunity

Libor Safar

What does AI actually read? Mostly the open web. And Common Crawl, the page-by-page archive that sits at the foundation of GPT, LLaMA, Gemini and most major language models, looks nothing like the world that speaks.

English: 41 % of pages, falling slowly. German, Japanese and Chinese: clustered near 5 %. Arabic: 0.7 %, despite 400 million speakers. Eight years of snapshots, read page by page. The training data that shapes how machines understand language is wildly uneven.

Get the next issue in your inbox.

© 2026 Libor Safar

Home

Newsletter

Articles

Resources

About & Contact