Research

AI-generated content now exceeds human-written text on the public web, study finds

Researchers at the Allen Institute estimate that 56% of newly indexed English-language pages in Q1 2026 were predominantly machine-generated. Training-data implications are profound.

Theo Bergmann in Berlin

· Published May 5, 2026 · 08:46 | Updated 08:46 · 1 min read

AI-generated content now exceeds human-written text on the public web, study finds

For the first time, the majority of newly indexed English-language web pages are predominantly machine-generated, according to a study released this week by researchers at the Allen Institute for AI.

The study examined a stratified sample of 4.2 million URLs first indexed by Common Crawl during the first quarter of 2026, finding that 56% scored above the institute's threshold for "predominantly machine-generated" content. The figure was 31% one year earlier and below 8% in early 2023.

The training-data problem

The implications for the next generation of foundation models are significant. Empirical work over the past year has shown that uncritical inclusion of synthetic data in pre-training corpora produces measurable distributional artefacts — a phenomenon researchers have begun calling "model collapse drift."

Most major labs now disclose at least some filtering of suspected synthetic content, but verification remains difficult, and the most aggressive filtering pipelines reduce corpus size to a degree that itself harms model quality. The trade-off is becoming a defining technical constraint for the field.

AI-generated content now exceeds human-written text on the public web, study finds

The training-data problem

OpenAI president defends his motives in for-profit restructuring as he reveals $30bn personal stake

Inside the AI startup that quietly hired half of OpenAI's superalignment team

Cursor closes $1.2bn Series D, valuing the AI code editor at $25bn