For the first time, the majority of newly indexed English-language web pages are predominantly machine-generated, according to a study released this week by researchers at the Allen Institute for AI.
The study examined a stratified sample of 4.2 million URLs first indexed by Common Crawl during the first quarter of 2026, finding that 56% scored above the institute's threshold for "predominantly machine-generated" content. The figure was 31% one year earlier and below 8% in early 2023.
The training-data problem
The implications for the next generation of foundation models are significant. Empirical work over the past year has shown that uncritical inclusion of synthetic data in pre-training corpora produces measurable distributional artefacts — a phenomenon researchers have begun calling "model collapse drift."
Most major labs now disclose at least some filtering of suspected synthetic content, but verification remains difficult, and the most aggressive filtering pipelines reduce corpus size to a degree that itself harms model quality. The trade-off is becoming a defining technical constraint for the field.