Preserving Pre-AI Content: The Digital Equivalent of Low-Background Steel
This article is exclusively available to Business Insider subscribers. If you're looking to keep up with these innovative developments, consider becoming an Insider.
In the era following the nuclear age, scientists stumbled upon a fascinating issue: steel produced after 1945 bore the marks of contamination. The detonation of atomic bombs had released a wave of radioactivity into the atmosphere, rendering much of the metal produced during that time tainted and, consequently, unsuitable for precise applications.
This contamination presented a significant challenge, especially for craftsmen and scientists in need of high-precision instruments, such as Geiger counters and other sensitive sensors. The solution was to reclaim old steel from sunken battleships that had been inactive since before the war. These ships, resting deep on the ocean floor, remained untouched by nuclear fallout and housed what became known as low-background steel. This steel, prized for its rarity and purity, offered a clean alternative for the creation of high-accuracy instruments.
Fast forward to the year 2025, and a parallel issue is emerging, not in the depths of the ocean but across the vast landscape of the internet.
Since the launch of ChatGPT in late 2022, the realm of AI-generated content has proliferated across blogs, search engines, and social media platforms. The digital world is increasingly filled with content that is not produced by human hands but is instead synthesized by advanced models and chatbots. Similar to the pervasive nature of radioactivity, this AI-generated content poses significant challenges for the average user, making it difficult to discern genuine human writing from machine-generated text. This environment is altering the very fabric of our digital landscape.
The implications of this phenomenon are particularly concerning for AI researchers and developers. Traditional AI models have relied on extensive datasets collected from the internet, which historically included a rich tapestry of human-generated content—messy, insightful, biased, and occasionally brilliant. However, if today's AI systems are predominantly trained on text generated by previous AI models, which themselves were trained on other AI outputs, there is a real risk of what experts are calling "model collapse." This term describes a process where originality and nuance are diluted as models essentially begin to mimic each other, creating a cycle of repetition.
In simpler terms, AI models are designed to comprehend human thought processes. However, when they are primarily trained on their own outputs, the result can resemble the phenomenon of photocopying a photocopy, where each iteration becomes increasingly blurry, ultimately losing distinctiveness and creativity.
As highlighted by Will Allen, a vice president at Cloudflare—one of the internet's largest networks—the value of human-generated content from before 2022 has surged. Such content serves as a vital grounding for AI models and society at large, providing a shared reality that is becoming increasingly rare. This grounding is crucial, especially as AI technologies are adopted in fields requiring precision and trust, such as medicine, law, and finance. Allen notes that he prefers his doctor to rely on research conducted by human experts from authentic trials rather than AI-generated information.
“The data that has that connection to reality has always been critically important and will be even more crucial in the future,” Allen stated. “Without that foundational truth, navigating complexities becomes significantly more challenging.”
Real-world implications of this situation are already manifesting. Venture capitalist Paul Graham recently shared his own experience searching for information on how to set the temperature on a pizza oven. He found himself sifting through the publication dates of online articles, seeking content that wasn't merely “AI-generated SEO-bait,” as he described it in a post on X (formerly Twitter).
In response, Malte Ubl, the CTO of AI startup Vercel and a former engineer at Google Search, remarked that Graham was effectively filtering the internet for content that was “pre-AI-contamination.” Ubl, who offered a compelling analogy, stated, “The analogy I've been using is low background steel, which was made before the first nuclear tests.”
Matt Rickard, another former Google engineer, echoed these sentiments in a blog post earlier this year, warning that the datasets available for AI training are becoming increasingly compromised. “AI models are trained on the internet. More and more of that content is being generated by AI models,” Rickard pointed out. “As a result, output from AI models is becoming relatively undetectable, making it increasingly difficult to find training data that is unmodified by AI.”
Some industry leaders suggest that the answer lies in preserving digital artifacts that resemble low-background steel—specifically, human-generated data from before the onset of the AI boom. This concept encompasses the idea of creating a digital foundation that reflects the internet in its authentic, human-authored form, free from the contamination of AI-generated filler and SEO-optimized noise.
One such advocate for preserving pre-AI content is John Graham-Cumming, a board member at Cloudflare and the company's CTO. He spearheads a project called LowBackgroundSteel.ai, which catalogs datasets, websites, and media that existed prior to 2022, the year that marked the explosive growth of generative AI content. Among his preserved treasures is GitHub's Arctic Code Vault, an archive of open-source software buried in a decommissioned coal mine in Norway, captured in February 2020, just before the AI-assisted coding revolution took off.
John Graham-Cumming’s initiative aims to create an archive of content that encapsulates the web as it once was, documented by humans with intent and context. Another reference he highlights is "wordfreq," a project tracking the frequency of words used across the internet, which was maintained by linguist Robyn Speer until 2021. In a recent update she provided in 2024, Speer noted the detrimental effects of generative AI on her project, stating that “Generative AI has polluted the data.” This shift in how words are used skews internet data, making it a less reliable resource for understanding human thought and expression. For instance, she observed that ChatGPT demonstrates an unusual fixation on the word “delve,” leading to its increased frequency in online discourse, a pattern that diverges from genuine human usage.
As Cloudflare's Will Allen observed, while AI models trained on synthetic content can indeed boost productivity and eliminate tedium in creative tasks, there remains a critical need to stay grounded in some level of truth. He is an advocate for the responsible use of AI tools like ChatGPT, Google’s Gemini, and Claude, recognizing their potential benefits while also acknowledging the importance of preserving human-generated content.
However, it's essential to note that the analogy to low-background steel isn't perfect. Scientists have developed various methods to produce steel using pure oxygen, demonstrating that innovation can lead to alternatives. Nevertheless, Allen insists that maintaining a foundation grounded in reality is indispensable.
The stakes here extend beyond the technical performance of AI models; they delve deep into the very fabric of our shared reality. Just as scientists have relied on low-background steel for precise measurements, we may increasingly need to depend on carefully curated pre-AI content to accurately gauge the human experience—understanding how we think, reason, and communicate in a world increasingly influenced by machines that emulate our behavior. While the era of the pure internet may be behind us, the diligent efforts of preservationists remind us that salvaging the past might be our only way to construct a trustworthy future.