What does AI actually read? Mostly the open web. And Common Crawl, the page-by-page archive that sits at the foundation of GPT, LLaMA, Gemini and most major language models, looks nothing like the world that speaks.
English: 41 % of pages, falling slowly. German, Japanese and Chinese: clustered near 5 %. Arabic: 0.7 %, despite 400 million speakers. Eight years of snapshots, read page by page. The training data that shapes how machines understand language is wildly uneven.