Major publishers are blocking the Internet Archive's access in a bid to thwart AI scrapers, which could be used as a workaround.
Several prominent publications, including The New York Times and Financial Times, have taken steps to selectively block how their content is cataloged by the Internet Archive. This move comes amidst growing concerns that AI companies' bots are using the Internet Archive's vast collections of online content to train large language models without permission or proper authorization.
Publishers claim that this unauthorized access could lead to the theft of valuable intellectual property, including copyrighted materials and proprietary research data. According to Robert Hahn, head of business affairs and licensing for The Guardian, "a lot of these AI businesses are looking for readily available, structured databases of content" β and the Internet Archive's API would have been an obvious target.
The Wall Street Journal, New York Post, The Atlantic, and other media outlets have also sued companies like Perplexity and Google for allegedly using their libraries to train AI models without permission. These lawsuits aim to protect intellectual property rights in the face of growing threats from artificial intelligence.
However, some experts argue that financial deals with publishers might provide compensation rather than truly protecting writers' rights. Meanwhile, other creative fields such as fiction writers, visual artists, and musicians are also fighting against AI tools due to copyright and piracy concerns.
As AI continues to reshape the media landscape, the cat-and-mouse game between content providers and scrapers raises important questions about access, ownership, and the value of intellectual property in a rapidly changing digital world.
Several prominent publications, including The New York Times and Financial Times, have taken steps to selectively block how their content is cataloged by the Internet Archive. This move comes amidst growing concerns that AI companies' bots are using the Internet Archive's vast collections of online content to train large language models without permission or proper authorization.
Publishers claim that this unauthorized access could lead to the theft of valuable intellectual property, including copyrighted materials and proprietary research data. According to Robert Hahn, head of business affairs and licensing for The Guardian, "a lot of these AI businesses are looking for readily available, structured databases of content" β and the Internet Archive's API would have been an obvious target.
The Wall Street Journal, New York Post, The Atlantic, and other media outlets have also sued companies like Perplexity and Google for allegedly using their libraries to train AI models without permission. These lawsuits aim to protect intellectual property rights in the face of growing threats from artificial intelligence.
However, some experts argue that financial deals with publishers might provide compensation rather than truly protecting writers' rights. Meanwhile, other creative fields such as fiction writers, visual artists, and musicians are also fighting against AI tools due to copyright and piracy concerns.
As AI continues to reshape the media landscape, the cat-and-mouse game between content providers and scrapers raises important questions about access, ownership, and the value of intellectual property in a rapidly changing digital world.