When a generative artificial intelligence (AI) system outputs something strikingly similar to the data it was trained on, is it copyright infringement or a bug in the system? This is the question at the heart of The New York Times’ recent lawsuit against ChatGPT maker OpenAI.
The New York Times alleges that OpenAI used more content from the NYT website to train its AI models than nearly any other proprietary source — with only Wikipedia and data sets containing United States patent documents trumping it.
OpenAI says training on copyrighted data is “fair use” and The New York Times’ lawsuit is “without merit.”
We build AI to empower people, including journalists.— OpenAI (@OpenAI) January 8, 2024
Our position on the @nytimes lawsuit:
• Training is fair use, but we provide an opt-out
• "Regurgitation" is a rare bug we're driving to zero
• The New York Times is not telling the full storyhttps://t.co/S6fSaDsfKb
The suit could be settled out of court; it could end with damages, dismissal or myriad other outcomes. But beyond financial relief or injunctions (which could be considered temporary, pending appeal or triggered upon unsuccessful appeal), the ramifications could impact U.S. society at large, with potential global impact beyond.
Firstly, were the courts to find in favor of OpenAI that training AI systems on copyrighted material is fair use, it could have a substantial impact on the U.S. legal system.
As King’s College senior lecturer Mike Cook recently wrote in The Conversation:
“If you’ve used AI to answer emails or summarize work for you, you might see ChatGPT as an end justifying the means. However, it perhaps should worry us if the only way to achieve that is by exempting specific corporate entities from laws that apply to everyone else.”
The New York Times argues that such an exemption would represent a clear threat to its business model.
OpenAI has admitted that ChatGPT has a “bug” wherein it occasionally outputs passages of text bearing striking similarities to existing copyrighted works. According to The NYT, this could serve to bypass paywalls, deprive the company of advertising revenue, and affect its ability to perform its primary functions.
Were OpenAI allowed to continue training on copyrighted material without restriction, the long-term impacts for The New York Times and any other journalism outlets whose work could be used to train AI systems could be catastrophic, according to the lawsuit.
The same could arguably be said for other fields where copyrighted material drives profits, including film, television, music, literature and other forms of print media.
On the other hand, in documents submitted to the United Kingdom’s House of Lords communications and digital committee, OpenAI said, “It would be impossible to train today’s leading AI models without using copyrighted materials.”
The AI firm added:
“Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment but would not provide AI systems that meet the needs of today’s citizens.”
The black box
Complicating matters further is the fact that compromise could be hard to come by. OpenAI has taken steps to stop ChatGPT and other products from outputting copyrighted material, but there are no technological guarantees that it won’t continue to do so.
AI models such as ChatGPT are referred to as “black box” systems. This is because the developers who create them have no way of knowing exactly why the system generates its outputs.
Because of this black box and the method by which large language models such as ChatGPT are trained, there’s no way to exclude The New York Times or any other copyright holder’s data once a model has been trained.
Related: OpenAI faces fresh copyright lawsuit a week after NYT suit
Based on current technology and methods, there’s a significant chance that OpenAI would have to delete ChatGPT and start over from scratch if it were banned entirely from using copyrighted material. Ultimately, this may prove too expensive and inefficient for it to be worthwhile.
OpenAI hopes to deal with this by offering partnerships to news and media organizations alongside a promise to continue work to eliminate the regurgitation “bug.”
The worst-case scenario
The worst-case scenario for the field of artificial intelligence would be losing the ability to monetize models trained on copyrighted materials. While this wouldn’t necessarily affect, for example, endeavors related to self-driving cars or AI systems used to conduct supercomputer simulations, it could make generative products such as ChatGPT illegal to bring to market.
And, when it comes to copyright holders, the worst case would be a court declaration that copyrighted material can be freely used to train AI systems.
This, theoretically, could give AI companies free reign to redistribute slightly modified copyrighted materials while holding end-users legally responsible for any instances where the modifications don’t meet the legal requirement for avoiding copyright infringement.