Oct 7, 2024 7:30 AM

The Race to Block OpenAI’s Scraping Bots Is Slowing Down

OpenAI’s spree of licensing agreements is paying off already—at least in terms of getting publishers to lower their guard.

A photo illustration depicting a woman dressed in allblack entering into a newspaper.

Photo-Illustration: Darrell Jackson/Getty Images

It’s too soon to say how the spate of deals between AI companies and publishers will shake out. OpenAI has already scored one clear win, though: Its web crawlers aren’t getting blocked by top news outlets at the rate they once were.

The generative AI boom sparked a gold rush for data—and a subsequent data-protection rush (for most news websites, anyway) in which publishers sought to block AI crawlers and prevent their work from becoming training data without consent. When Apple debuted a new AI agent this summer, for example, a slew of top news outlets swiftly opted out of Apple’s web scraping using the Robots Exclusion Protocol, or robots.txt, the file that allows webmasters to control bots. There are so many new AI bots on the scene that it can feel like playing whack-a-mole to keep up.

OpenAI’s GPTBot has the most name recognition and is also more frequently blocked than competitors like Google AI. The number of high-ranking media websites using robots.txt to “disallow” OpenAI’s GPTBot dramatically increased from its August 2023 launch until that fall, then steadily (but more gradually) rose from November 2023 to April 2024, according to an analysis of 1,000 popular news outlets by Ontario-based AI detection startup Originality AI. At its peak, the high was just over a third of the websites; it has now dropped down closer to a quarter. Within a smaller pool of the most prominent news outlets, the block rate is still above 50 percent, but it’s down from heights earlier this year of almost 90 percent.

But last May, after Dotdash Meredith announced a licensing deal with OpenAI, that number dipped significantly. It then dipped again at the end of May when Vox announced its own arrangement—and again once more this August when WIRED’s parent company, Condé Nast, struck a deal. The trend toward increased blocking appears to be over, at least for now.

These dips make obvious sense. When companies enter into partnerships and give permission for their data to be used, they’re no longer incentivized to barricade it, so it would follow that they would update their robots.txt files to permit crawling; make enough deals and the overall percentage of sites blocking crawlers will almost certainly go down. Some outlets unblocked OpenAI’s crawlers on the very same day that they announced a deal, like The Atlantic. Others took a few days to a few weeks, like Vox, which announced its partnership at the end of May but which unblocked GPTBot on its properties toward the end of June.

Robots.txt is not legally binding, but it has long functioned as the standard that governs web crawler behavior. For most of the internet’s existence, people running webpages expected each other to abide by the file. When a WIRED investigation earlier this summer found that the AI startup Perplexity was likely choosing to ignore robots.txt commands, Amazon’s cloud division launched an investigation into whether Perplexity had violated its rules. It’s not a good look to ignore robots.txt, which likely explains why so many prominent AI companies—including OpenAI—explicitly state that they use it to determine what to crawl. Originality AI CEO Jon Gillham believes that this adds extra urgency to OpenAI’s push to make agreements. “It’s clear that OpenAI views being blocked as a threat to their future ambitions,” says Gillham.

So far, OpenAI has struck deals with 12 publishers, and while most have updated their robots.txt files, there are a few exceptions. Time magazine, for example, continues to block GPTBot. (Time did not respond to WIRED’s request for comment on why it still had GPTBot blocked.) However, once the deals are made, it’s unimportant, according to OpenAI spokesperson Kayla Wood, as OpenAI no longer accesses the data in the same way it approaches crawling what it calls “publicly available” data. “We leverage direct feeds,” she says.

Meanwhile, there are a few notable media outlets that have unblocked OpenAI’s web crawler despite not making any sort of partnership announcement, as data journalist Ben Welsh pointed out to WIRED. (He tracks how news outlets block top AI bots using slightly different metrics, and he first noticed the slight decline in block rates a few weeks ago.) Alex Jones’ conspiracy-theory hub Infowars and the newly reinvigorated comedy mainstay The Onion both caught his attention.

Does this mean these sites have unannounced deals with OpenAI, or are attempting to negotiate with the company? “Fuck no,” says Onion CEO Ben Collins, who says the unblocking was likely connected to the outlet migrating its website to a new hosting service and content management system last month. “Obviously we are not doing any business with the Plagiarism Machine.”

Infowars did not respond to requests for comment. But OpenAI, for its part, has confirmed that it does not have any partnership with Infowars.

While the first rush to block OpenAI’s bots appears to have ended, it’s unclear whether this lull will last. Gillham suspects that there may be additional spikes in blocking in the future, if publishers begin to see it as a bargaining tactic. “Is step one in a negotiation with OpenAI to block them? Does that bring them to the table?” he says. Whatever happens, this is a revealing moment: While publishers had initially responded to the rise of AI scraping bots with the shared impulse to block them, OpenAI’s active pursuit of partnerships has cooled that industry-wide drive.

You Might Also Like …