- Microsoft-backed OpenAI, Google and Anthropic ban the use of their content to train other AI models.
- However, these companies have been using other online content for their own model training.
- Can Big Tech have it both ways? Reddit and others are trying to stop this.
In the new age of generative AI, big tech companies are following a "do as I say, not as I do" strategy when it comes to the use of online content.
Microsoft-backed OpenAI, along with Google, and Google-backed Anthropic have for years been using online content created by companies to train their generative AI models. This was done without asking for specific permission, and it's part of a brewing legal battle that will decide the future of the web and how copyright laws are applied in this new world.
The tech industry will likely argue that their approach is fair use. That has yet to be decided. However, these big tech companies won't let their own content be used to train other AI models. So why should they be allowed to do this to everyone else?
Take a look at the terms of service for Claude, Anthropic's AI assistant:
"You may not access or use the Services in the following ways, and if any of these restrictions are inconsistent with or ambiguous in relation to the Acceptable Use Policy, the Acceptable Use Policy controls: To develop any products or services that compete with our Services, including to develop or train any artificial intelligence or machine learning algorithms or models."
Here's an excerpt from the top of Google's generative AI terms of use:
"You may not use the Services to develop machine learning models or related technology."
And here's the relevant section from OpenAI's terms of use. This is the company behind ChatGPT.
"You may not... use output from the Services to develop models that compete with OpenAI."
These companies are not dumb, but they are hypocritical
These companies are not dumb. They know that quality content is vital for training new AI models. So it makes sense that they won't allow their output to be used this way.
But why would any other website or company let their content be freely used by these giant tech companies to train their models?
Insider asked OpenAI, Google and Anthropic for comment on Friday. At the time of publication, they had not responded.
Reddit and other companies say enough is enough
Other companies are just beginning to realize what's been happening, and they are not happy. Reddit, which has been used for years in AI model training, plans to start charging for access to its data.
"The Reddit corpus of data is really valuable. But we don't need to give all of that value to some of the largest companies in the world for free," said Steve Huffman, CEO of Reddit.
In April, Elon Musk accused Microsoft, the main backer of OpenAI, of illegally using Twitter's data to train AI models. "Lawsuit time," he tweeted.
"There is so much wrong w/ this premise I don't even know where to start," a Microsoft spokesman wrote in an email to Insider when asked for comment.
OpenAI's CEO Sam Altman is trying to be more thoughtful on this issue, by working on new AI models that respect copyright. "We're trying to work on new models where if an AI system is using your content, or if it's using your style, you get paid for that," he said recently, according to Axios.
Publishers, including Insider which produced this story, have a vested interest here. Some publishers, including News Corp., are already pushing tech companies to pay to use their content for training AI models.
The current way AI models are trained 'breaks' the web
One former Microsoft executive believes something is wrong here. Steven Sinofsky recently said the current way AI models are trained "breaks" the web.
"Crawling used to be allowed in exchange for clicks. But now the crawling simply trains a model and no value is ever delivered to the creator(s) / copyright holders," he tweeted. Insider asked him for comment, but he was traveling on Friday and couldn't respond.