Table of Contents
“The question isn’t which bots to block anymore. It’s which bots to let in.” — A fundamental reversal that’s been 30 years in the making.
The World Wide Web has operated on a simple, assumed-open architecture since the first web crawler appeared in 1993. Every bot gets access unless a publisher explicitly names it and tells it to stop.
Robots.txt was built on this model — it’s an opt-out system, where the default is openness and restriction requires deliberate action by the content owner.
That default is now being challenged directly by some of the most recognised names in news publishing. Reuters and Time have both moved to a “default-deny” model: every bot is blocked unless it has been specifically approved and added to an allowlist. The shift in logic is the opposite of robots.txt — it’s an opt-in system, where access must be earned rather than assumed.
The immediate cause is AI training and inference crawlers, which have proliferated at a pace that traditional blocklist approaches simply cannot keep up with. But the implications extend well beyond AI scraping and touch fundamental questions about how content creators relate to the systems that use their work.
Why Robots.txt Alone No Longer Works
To understand why major publishers are moving beyond robots.txt, you need to understand what robots.txt is — and what it isn’t.
Robots.txt is a publicly accessible text file that lives at the root of any web domain. It communicates instructions to crawlers about which parts of a site they’re permitted to access. It’s been the primary mechanism for content access control on the web for three decades. And it relies entirely on one thing: voluntary compliance.
The compliance assumption held reasonably well for most of the web’s history. Major search engines — Google, Bing, Yahoo — have strong incentives to respect robots.txt directives because their entire business model depends on maintaining publisher trust and avoiding legal exposure. When a Googlebot respects your robots.txt, it’s not because Google is legally obligated to — it’s because compliance is in Google’s interest.
AI training crawlers are in a different position. Many are operated by companies whose primary product is built on the training data collected — meaning the value of access is extremely high, the enforcement mechanisms are weak, and the reputational cost of non-compliance has been, until recently, relatively low.
The data makes the gap concrete. Research from Tollbit found that 30% of AI bot scrapes don’t comply with explicit robots.txt permissions. Nearly a third of AI crawlers are simply ignoring the instructions meant to govern them.
A blocklist — which names specific bots to exclude — faces a compounding version of this problem. You can only block bots you know about. As People Inc. discovered when it switched from a blocklist to an allowlist, the number of user agents it was blocking jumped from around 2,100 to more than 30,000.
No publisher maintains an up-to-date awareness of 30,000 distinct crawler identities. The blocklist model simply cannot scale to the current crawling environment.
How Default-Deny Actually Works
The default-deny model inverts the architecture entirely. Instead of “every bot gets in unless blocked,” the starting position is “no bot gets in unless approved.”
For Reuters, the implementation is visible in its live robots.txt file. It explicitly lists approved crawlers — from Amazon, Google, Bing/Microsoft, Yahoo, and OpenAI — and then disallows other bots from most of the site. The allowlist isn’t secret, but access to it is earned, not assumed.
Reuters’ head of Reuters Professional, Josh London, described the approval criteria to Digiday as a “fair value exchange” covering four possible contributions: paying for content through licensing, sending traffic back to Reuters, keeping the site operational, or supporting monetisation. A crawler that offers none of these four things doesn’t get approved — and any crawler that isn’t approved gets blocked.
Time, People Inc., and The Atlantic have adopted similar approaches, joining within the past year. The SPUR Coalition is building shared industry standards around licensing and content use, and grew to 36 organisations in the most recent month by adding 30 members.
People Inc. went from blocking ~2,100 user agents under a blocklist model to blocking more than 30,000 under an allowlist model — without any change in its content or publishing practices. The difference is entirely in the architecture of the decision.
The Economics: Why Friction Is the Point
Reuters says the default-deny switch hasn’t cost it traffic — a crucial data point for publishers considering the same move. But the more interesting claim is about leverage.
When a crawler is blocked at the network level rather than through robots.txt, routing around it requires technical workarounds that cost money. The executives at Reuters and Time are explicit: that cost is intentional. When scraping has a price, even an indirect one, the economics shift toward negotiation.
Scrapers who previously assumed free access suddenly face a choice between paying for workarounds or coming to the table for a licensing conversation.
Reuters has existing licensing infrastructure from its newswire business, which means it has mechanisms for these conversations. The content doesn’t just go dark — it becomes a product with terms attached to it.
This is the theoretical power of collective action in the default-deny model. One publisher blocking bots is easy for a large AI company to absorb or ignore. Thirty-six publishers coordinating under shared standards creates a market-level signal that licensing is becoming the expected norm rather than an exceptional arrangement.
Who Does This Actually Work For?
The honest analysis of default-deny requires acknowledging the significant complications, particularly for publishers who don’t share Reuters’ market position.
The AI Visibility Cost
Blocking an AI crawler means losing whatever that crawler was sending back. For search-adjacent AI systems — like those that power AI Overviews or AI search engines — that might mean reduced visibility in AI-generated results. For a publisher that depends on referral traffic from AI-powered search features, that’s a real cost, and it arrives immediately.
Reuters can absorb that cost because its traffic model isn’t primarily dependent on AI referrals and because its brand recognition means users actively seek it out. A publisher whose traffic is substantially driven by AI-powered discovery doesn’t have the same cushion.
The Leverage Gap
Default-deny creates leverage only if the AI company values access enough to negotiate. For Reuters and The New York Times, that value is clear — these are globally recognised sources that AI systems benefit significantly from including.
For a mid-size vertical publisher with a strong but niche audience, the leverage is much weaker. A company training an AI model may simply deprioritise content from sources that require negotiation when adequate alternatives exist.
The Small Publisher Problem
The payment pools available through AI licensing deals are, by most accounts, small relative to traditional search advertising revenue and concentrated among the largest publishers. If the licensing economy only works for the top tier, default-deny becomes a tool that benefits the already-powerful while offering smaller publishers risk without meaningful reward.
DCN’s “Copyright Is Not an Opt-Out Regime” Argument
Alongside the technical and economic shifts, there’s a legal dimension that’s becoming increasingly prominent. Digital Content Next’s cease and desist to Common Crawl (covered separately in this week’s roundup) articulates a legal position that is relevant to default-deny adoption more broadly: the argument that copyright law requires permission before use, not just tolerance of opt-out.
If that legal position gains traction — and there are active court cases in multiple jurisdictions that will shape whether it does — it would shift the burden fundamentally. Instead of publishers having to actively block crawlers, AI companies would need to affirmatively seek permission. Default-deny would essentially be the current legal reality rather than a technical workaround.
That outcome is far from certain. But it’s the direction that publisher coalitions and legal advocates are pushing toward.
What to Watch Next
- Whether smaller publishers adopt the default-deny model and whether they see the same absence of traffic cost that Reuters reports.
- How the SPUR Coalition’s shared licensing standards develop and whether they create a workable market for content access.
- Legal outcomes in AI training copyright cases in the US, UK, and EU, which will determine how much legal weight the “permission first” argument carries.
- Whether AI companies begin to publicly distinguish between publishers with allowlists and those without in their indexing and visibility decisions.
Conclusion: The Default Is Changing, And The Question Is Whether You Change With It
The robots.txt era assumed good faith and voluntary compliance. That era is ending — not dramatically, but methodically, as major publishers discover that the only reliable control is access control, not request control.
Default-deny is a meaningful and logical response to the current crawling environment. For large publishers with existing licensing infrastructure and strong brand leverage, it makes clear strategic sense. For smaller publishers, the calculation is more nuanced — the leverage is weaker and the AI visibility cost is more significant.
What’s not in question is that this is the direction the market is moving. The question for every publisher isn’t whether to engage with this issue, but when and how.
READY TO TAKE ACTION?
Is your website’s content being scraped without your knowledge or consent?
The Brisk Digital helps brands build smart content protection strategies while maintaining maximum search and AI visibility. Talk to our team about what’s right for your digital footprint.
No Comments