Wikipedia Is Making a Dataset for Training AI Because It’s Overwhelmed by Bots

It seems that AI developers have essentially blackmailed Wikipedia into offering up its data for training. On Wednesday, the Wikimedia Foundation announced it is partnering with Google-owned Kaggle—a popular data science community platform—to release a version of Wikipedia optimized for training AI models. Starting with English and French, the foundation will offer stripped down versions of raw Wikipedia text, excluding any references or markdown code.
Being a non-profit, volunteer-led platform, Wikipedia monetizes through donations and does not own the content it hosts, allowing anyone to use and remix content from the platform. It is fine with other organizations using its vast corpus of knowledge for all sorts of cases—Kiwix, for example, is an offline version of Wikipedia that has been used to smuggle information into North Korea.
But a flood of bots constantly trawling its website for AI training needs has led to a surge in non-human traffic to Wikipedia, something it was interested in addressing as the costs soared. Earlier this month, the foundation said bandwidth consumption has increased 50% since January 2024. That is not great for a company that does not directly monetize its website and instead relies on regular donation drives. Offering a standard, JSON-formatted version of Wikipedia articles should dissuade AI developers from bombarding its website.
“As the place the machine learning community comes for tools and tests, Kaggle is extremely excited to be the host for the Wikimedia Foundation’s data,” Kaggle partnerships lead Brenda Flynn told The Verge. “Kaggle is excited to play a role in keeping this data accessible, available, and useful.”
It is no secret that tech companies fundamentally do not respect content creators and place little value on any individual’s creative work. There is a rising school of thought in the AI industry that all content should be free and that taking it from anywhere on the web to train an AI model constitutes fair use because the AI models ingest the text and transform it into something entirely new.
But someone has to create the content in the first place, which is not cheap, and AI startups have been all too willing to ignore previously accepted norms around respecting a site’s wishes not to be crawled. Language models that produce human-like text outputs need to be trained on vast amounts of material, and training data has become something akin to oil in the AI boom. It is well known that the leading models are trained using copyrighted works, and several AI companies remain in litigation over the issue.
Some contributors to Wikipedia may dislike their content being made available for AI training. All writing on the website is licensed under the Creative Commons Attribution-ShareAlike license, which allows anyone to freely share, adapt, and build upon a work, even commercially, as long as they credit the original creator and license their derivative works under the same terms. It is unclear how Wikimedia would ensure AI companies respect these requirements, but Gizmodo has reached out for comment.
gizmodo