OpenAI’s offer to stop web crawler comes too late

[ad_1]

Earlier this month, in response to mounting criticism around how OpenAI scoops up data to train ChatGPT, its groundbreaking chatbot, the company made it possible for websites to block it from scraping their content. A short piece of code would tell OpenAI to go away (and it would kindly obey).

Since then, hundreds of sites have shut the door. A Google search reveals many of them: Major online properties such as Amazon, Airbnb, Glassdoor and Quora have added the code to their “robots.txt” file, a kind of rules of engagement for the many bots — or spiders as they are also known — that scour the internet.

When I got in touch with the companies, none were willing to discuss their reasoning, but it’s quite obvious: They want to put a stop to OpenAI taking content that doesn’t belong to them in order to train its artificial intelligence. Unfortunately, it’s going to take a lot more than a line of code to stop that from happening.

Other online resources with the kind of data that an AI system would love also have moved to block the crawler: Furniture store Ikea, jobs site Indeed.com, vehicle comparison resource Kelley Blue Book, and BAILII, the UK’s court records system, similar to the US’s PACER (which doesn’t appear to be blocking the bot).

Coding resources website StackOverflow is blocking the crawler, but not its rival GitHub — perhaps unsurprising given that GitHub’s owner, Microsoft, is a major investor in OpenAI. And, as major media companies begin negotiating with (or possibly suing) the likes of OpenAI over access to their archives, many have also taken the step to block the bot. Research reported by Business Insider suggested 70 of the top 1,000 websites globally have added the code. We can expect that number to grow.

Problem solved? Not likely. While it’s very generous of OpenAI to give sites the ability to prevent its robot from siphoning their content, the gesture rings hollow when you consider that OpenAI’s bot has already been out there gathering this data for some time. The AI horse has very much bolted: Adding the code at this stage is like shouting “And don’t come back, ya hear!” at a burglar as they disappear into the night with your belongings.

In fact, the move could serve to strengthen OpenAI’s early lead. By setting this precedent, it can argue newer competitors should do the same, pulling up the ladder and enjoying the benefits of being one of AI’s first movers. “What is certain is that OpenAI isn’t giving the data it collected back,” noted tech worker-turned-commentator Ben Thompson in a recent edition of his email newsletter.

Of course, web crawlers are just one way in which OpenAI and other AI companies collect data to be used to train their systems. Recent legal battles between content owners and AI companies have centered on the fact that OpenAI, Meta, Google and others often use bulk datasets provided by third parties, such as “Books3,” a data set containing around 200,000 books, compiled by an independent AI researcher. Several authors are suing over its use.

OpenAI declined to comment, including on the question of whether sites that blocked OpenAI’s web crawler could be confident OpenAI wouldn’t use their data if sourced via other means. It certainly won’t alter what has been scooped up already. We can take only small comfort from the fact OpenAI has acknowledged that consent is a factor in future scraping efforts. There are hundreds of other bots out there, unleashed by AI companies less well-known than OpenAI, that won’t provide any kind of option for sites to opt out.

Google, which has built a rival chat tool called Bard, wants to start a discussion on the best mechanism for administering consent on AI. But as the writer Stephen King put it recently, the data is already in the “digital blender” — and there seems to be very little anyone can do about it now.

OpenAI’s offer to stop web crawler comes too late

More From Bloomberg Opinion:

Leave a Reply Cancel reply