Table of Contents
“Challenges a growing assumption that content created through substantial investment can be collected, stored, repurposed, and monetised simply because it is technically accessible.” — DCN CEO Jason Kint
The AI copyright debate has, until now, largely been framed as a dispute between content creators and the large AI companies whose products most visibly use their work. OpenAI, Anthropic, Google, Meta — these are the names that appear in lawsuits, in congressional testimony, and in the headlines of stories about who benefits from the internet’s content and who doesn’t.
This week, Digital Content Next, the trade body representing US digital publishers, directed its legal attention somewhere different: not toward the AI companies that use the data, but toward the organisation that collected it in the first place. The cease and desist sent to the Common Crawl Foundation targets the infrastructure layer of AI training — and makes a legal argument that, if it gains traction, would change the default assumption that has governed web crawling for nearly three decades.
What Common Crawl Is and Why It Matters So Much
Common Crawl is, in many ways, the invisible foundation of the modern AI industry. Founded in 2007, it has operated as a nonprofit organization whose mission is to build and maintain an open, accessible archive of the web. Its crawler, CCBot, has been systematically collecting billions of web pages per month for nearly 20 years, and the resulting datasets are available for free download to anyone with the technical capability to use them.
The scale is difficult to overstate. OpenAI’s own technical documentation for GPT-3, published in 2020, identified filtered Common Crawl data as comprising 60% of the model’s training mix. That figure has become a landmark reference point in the AI copyright discussion, but its implications are easy to underappreciate. Sixty percent of the training data for one of the most commercially successful AI systems in history came from a single nonprofit archive — an archive built by crawling the web without asking permission from the sites being crawled.
Common Crawl’s position has generally been that this is both legally and ethically consistent with how the web works. The web is public; crawling indexes it; the content that websites publish publicly is available to read and archive. On this view, Common Crawl is doing what every search engine does: making its output available to a broader range of researchers and developers.
DCN’s cease and desist challenges that position at its root.
What DCN’s Letter Actually Demands
The letter makes two substantive demands and advances one key legal argument.
Demand 1: Stop Future Collection
DCN asks Common Crawl to cease scraping, retaining, or sharing copyrighted, paywalled, subscriber-only, or otherwise protected content from DCN member companies. This is the more conventional ask — it mirrors what many publishers already try to enforce through robots.txt and crawler blocks, but frames it as a legal obligation rather than a voluntary preference.
Demand 2: Remove Existing Content
This is where the letter becomes genuinely novel. DCN asks Common Crawl to remove member content already present in its archived datasets. This goes well beyond what robots.txt can accomplish — blocking a crawler stops future collection but does nothing about the terabytes of already-archived content that anyone can currently download.
The removal demand targets the historical archive, which is what AI companies have actually been training on. Stopping future crawls while leaving years of archived content accessible does relatively little to address the core concern.
The Legal Argument: Opt-In, Not Opt-Out
DCN’s central legal claim is captured in the phrase “copyright law is not an opt-out regime.” Under US copyright law, copyright protection is automatic — it attaches to original creative work at the moment of creation, without registration, without notification, without any action required from the creator.
The opt-out model that robots.txt represents assumes the opposite: that content is available unless specifically restricted. DCN argues this is legally backwards. Publishers shouldn’t have to actively opt out of having their copyrighted material collected and used. Common Crawl should need affirmative permission to collect and distribute it.
If this argument prevails in court — or influences regulatory thinking — it would invert the architecture of web crawling entirely. The default would shift from “accessible unless blocked” to “unavailable unless permitted.”
The Removal Credibility Problem
DCN’s letter doesn’t just demand removal — it raises serious questions about whether previous removal requests have been honoured. The lawyers are specifically examining whether Common Crawl’s statements to publishers about removal “may have been inaccurate or misleading.”
This is a reference to findings reported by The Atlantic in November, which found content from The New York Times and Danish publishers still present in Common Crawl’s datasets after the organization had reportedly agreed to remove it. Common Crawl maintains a public registry of sites that have requested not to be scraped, which includes entries for the Associated Press, the BBC, and a large News/Media Alliance submission covering hundreds of domains — but the presence of content in the registry doesn’t appear to guarantee its absence from available datasets.
Common Crawl’s executive director Rich Skrenta has previously addressed this directly. His explanation is technical: the archive’s WARC file format can’t be edited after publication without breaking its structural integrity. Common Crawl’s response to removal requests is therefore to make affected URLs inaccessible through its public tools and indices and to exclude them from future crawls — which is not the same as deleting them from the underlying files, which remain available in the raw archive.
The critical distinction: removing a URL from Common Crawl’s public index is not the same as removing it from the downloadable archive. Anyone who downloaded the dataset before the removal request was processed retains access to the content.
For AI companies that downloaded Common Crawl datasets months or years ago and used them for training, the removal process — whatever form it takes — has no effect on those existing models. The data is already inside the weights.
What Common Crawl Is Actually Saying in Response
Skrenta declined to comment on the DCN letter specifically when contacted by the press. But his public positions on similar claims are on record.
In a November blog post responding to The Atlantic, he denied that Common Crawl lied to publishers or scrapes paywalled content. He said the archive is transparent about the complexity of removal and that “no one at Common Crawl has ever claimed this work was instantaneous or complete.”
In a recent forum post, Skrenta indicated that Common Crawl is contributing to open standards work on how websites express AI scraping preferences — essentially advocating for an improved opt-out mechanism rather than accepting the opt-in framework DCN proposes.
These positions are not unreasonable on their own terms. But they accept a framework — that opt-out is the appropriate model — that DCN is specifically challenging.
Why This Case Has Implications Beyond Common Crawl
The significance of DCN’s action extends beyond Common Crawl itself. Common Crawl is, in many ways, a convenient and relatively sympathetic target: it’s a nonprofit with a genuine open-access mission, it doesn’t directly profit from the AI applications built on its data, and it has at least attempted to create removal mechanisms. The for-profit AI companies that downloaded and used the data are significantly better resourced and harder to compel through a cease and desist alone.
But targeting Common Crawl allows DCN to establish the legal principle at the infrastructure level. If Common Crawl is found to require publisher permission before collecting and distributing their content, that finding creates pressure on every other crawler operating under similar assumptions — including the ones maintained by AI companies directly.
The UK’s CMA has already moved in a related direction, requiring Google to let publishers opt out of AI search features — a different mechanism but a similar underlying logic about publisher consent. In the US, active copyright litigation involving AI companies is working through the courts in parallel.
Conclusion: The Battle for the Archive Is the Battle for the Future of AI
DCN’s cease and desist to Common Crawl is, on the surface, a legal letter from a trade group to a nonprofit. Underneath, it’s a direct challenge to the foundational assumption that made large-scale AI training possible at the pace and cost at which it occurred: that publicly accessible content is available content.
Whether the legal argument succeeds will depend on courts, regulators, and the speed of legislative response — none of which move quickly. But the framing is out there now, the case is being made, and the principle is being tested.
For publishers, the practical takeaway is twofold: blocking future crawls is necessary but insufficient, and the historical archive question is one that no individual publisher can resolve through robots.txt. The answer, if there is one, will come through collective legal and regulatory action — which is exactly what DCN is trying to build.
READY TO TAKE ACTION?
Concerned about how AI crawlers are accessing and using your brand’s content?
The Brisk Digital helps businesses understand their content exposure, build smart protection strategies, and maintain visibility in an AI-first search landscape.
Let’s audit your digital footprint together.
No Comments