Generative AI: Regurgitating Training Data or Creative Remixers?

The mimicking abilities of Generative AI have caused controversy among creators, giving rise to concerns about the validity of the fair use copyright defense.

American Pie remake worsens GenAI copyright dispute

🎵 Don McClean has always had to share “American Pie.” Since its release in 1971, the hit song has re-emerged in covers by Madonna, parodies by Weird Al Yankovic, serenades by 🇰🇷 South Korean presidents, subplots in Marvel movies, and even CIA torture techniques. But these days, McClean’s leading imitators aren’t even human. 🤖

You can interrogate the culprits for yourself. Just load OpenAI’s ChatGPT and prompt the text generator to “write the lyrics to a song about the day the music died.” Invariably, the tool’s output will spit out lyrics or themes from “American Pie” — and sometimes the same chorus.

This regurgitation emerges despite the prompt making no order for “American Pie” or the story that inspired it — the 1959 plane crash that killed rock and roll pioneers Buddy Holly, Ritchie Valens, and The Big Bopper.

The Remix Algorithm

It’s further evidence that ChatGPT can’t create anything truly original. Instead, the system is closer to a remix algorithm. The real creativity is in its training data, which is scraped from the web without consent. 😮

Dr Max Little, an AI expert at the University of Birmingham, describes the tool as an “infringement machine.” He scoffs at any suggestion that large language models (LLM) are independently creative.

“This is not the case because they cannot produce anything at all without being trained on astronomical amounts of text,” Little tells TNW.

It’s an approach that’s ubiquitous in generative AI. Rigorous studies have shown that LLMs can regurgitate large chunks of their original training text, including verbatim paragraphs from books and poems. Just last week, a report found that 60% of OpenAI’s GPT-3.5 outputs contained plagiarism.

Nor does the issue solely apply to text generators. From Stable Diffusion’s images to Google Lyria’s music and GitHub Copilot’s code, GenAI tools across modalities can produce outputs of gobsmacking quality — and eerie familiarity.

The Threat to Creative Industries

Their mimicry poses an existential threat to creative industries. It also poses a threat to the GenAI industry.

Artists say that GenAI’s relentless march is trampling over their copyright conventions. Unsurprisingly, tech companies disagree. Their defenses typically invoke the “fair use” doctrine.

Details vary by jurisdiction, but a central tenet of “fair use” is that the outputs have a “transformative” purpose and character. Rather than merely copying or reproducing their training data, they add something new and significant. At least, that’s what the GenAI leaders are contending in court.

Stability AI, the UK-based startup behind the image-generator Stable Diffusion, made that argument last year to the US Copyright Office. OpenAI also cited the doctrine in a recent motion to dismiss two class-action lawsuits.

Several authors, including comedian Sarah Silverman and Canadian novelist Mona Awad, had sued the company for allegedly training LLMs on illegally acquired datasets.

Because their work was baked into ChatGPT, they said the tool itself was a “derivative work” covered by copyright.

OpenAI rebuffed the claim. According to the startup’s legal team, “the use of copyrighted materials by innovators in transformative ways does not violate copyright.” A judge also dismissed the allegation that every ChatGPT output is derivative.

But when the outputs are identical to their training data, the legal waters start to muddy. Reproduction is a dubious basis for transformation. It’s also a common phenomenon.

The Tipping Point

As well as American Pies, GenAI tools have regurgitated film scenes, cartoon characters, video games, product designs, and code.

They’ve also copied newspapers — which may lead to a tipping point.

In December, the New York Times sued OpenAI and its business partner Microsoft. The news outlet alleges the unauthorized use of its articles in training data breaches intellectual property (IP) rights. Legal experts describe the suit as “the best case yet alleging that generative AI is copyright infringement.”

Lawyers for the NYT highlighted the “substantial similarity” between the outlet’s content and ChatGPT outputs. To substantiate the claim, they provided 100 examples of the bot reproducing the newspaper’s reporting.

“Their suit also challenges another key aspect of “fair use”: the impact on the market for the original work.”

According to the plaintiffs, OpenAI not only replicates NYT content but also exploits the content to compete in the same markets. At the same time, the company diverts traffic away from the newspaper’s website.

As evidence, they point to Browse with Bing, a premium feature powered by the same tech behind ChatGPT. The tool can summarize product recommendations made by NYT reviewers. By offering users this information, the lawyers said, OpenAI removes their incentive to visit the original article. This also means they don’t click on the product links that generate revenues for the publisher.

“There is nothing ‘transformative’ about using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it,” the complaint declared.

Naturally, the GenAI giants disagree.

OpenAI responded to the lawsuit in a reproachful blog post. The company suspects that the NYT either “instructed the model to regurgitate” or “cherry-picked their examples from many attempts.”

Industry insiders have concurred. Daniel Jeffries, Stability AI’s chief intelligence officer, described the prompts in the lawsuit as “obviously manipulated.” He said the copies were “almost certainly” produced via a technique called retrieval augmented generation (RAG), which optimizes LLM outputs by accessing external sources of information.

AI’s Contamination Issue

Whatever the method, OpenAI said regurgitation is a “rare bug” that the company was “working to drive to zero.” But critics question the powers of preventive mechanisms.

Little points to ChatGPT’s reproduction of “American Pie.”

“Sometimes direct verbatim copyright infringement… is detected by the algorithm and a warning is presented,” he says.

“Nonetheless, the algorithm can still easily be made to produce output which is clearly plagiarized from the training data, as in this case, the theme of the lyrics is always the Holly/Valens/Bopper 1959 crash event.”

Rare as it may be in ChatGPT, regurgitation is widespread in GenAI tools. When they demonstrably duplicate their training data and then compete in the same market, the foundations of fair use appear shaky.

Ben Maling, managing associate at intellectual property law firm EIP, is keeping a close eye on the instability. Outputs that are verbatim copies or derivatives of their training data threaten “another potential copyright infringement,” he warns. Either the system or the end-user could be liable for damages.

“Many of the big AI providers are so worried about the potential of this to scare away customers that they are offering [assurances] promising to defend them against infringement actions,” Maling told TNW via email.

The Dilemma of GenAI

That’s not the only evidence of worrying at OpenAI. Last month, the GenAI flagbearer told the British Parliament that it’s “impossible” to create AI tools like ChatGPT without copyrighted material. Searching for legal protection, the company requested a special exemption for the practice.

The request elevated the fears around regurgitated training data.

If politicians exempted OpenAI, the startup “would be free to copy and remix any and all original text from anywhere and at any time,” Little says. As a result, they risk “destroying the creative industries which depend upon copyright to even exist.”

A Treatment for Regurgitation

GenAI’s regurgitation isn’t necessarily terminal. Analysts have prescribed numerous treatments for the awkward affliction.

One was created by Ed Newton-Rex, the former vice president of audio at Stability AI. During his stint at the startup, Newton-Rex developed Stable Audio, a music generator trained on licensed content. The 36-year-old wants other companies to follow his lead.

“You may slow down the AI industry a bit because they’d have to go and spend more time, more money, and more effort on licensing,” Newton-Rex tells TNW. “But in the process, frankly, you would save the creative industries. I think there’s an existential threat here.”

Artists who face this threat have applied a more extreme antidote: poison.

The most popular delivery method is a tool called Nightshade. This software “poisons” training data by applying invisible changes to images. When companies scrape the creations without consent, they can disrupt the AI model’s outputs.

The method has proven popular. Within five days of going live, Nightshade surpassed 250,000 downloads.

Nonetheless, Little expects AI to continue regurgitating American Pies. He doubts that tools trained on scraped creative content can ever escape the plagiarism problem. “Because by design,” he says, “they are just algorithms which remix their training data.”

One of the themes of this year’s TNW Conference is Ren-AI-ssance: The AI-Powered Rebirth. If you want to go deeper into all things artificial intelligence or simply experience the event (and say hi to our editorial team), we’ve got something special for our loyal readers. Use the code TNWXMEDIA at checkout to get 30% off your business pass, investor pass, or startup packages (Bootstrap & Scaleup).

Q: Can GenAI tools create anything truly original?
A: No, they cannot. GenAI tools, such as OpenAI’s ChatGPT, are like remix algorithms. They rely heavily on training data scraped from the web without consent, and their output often reproduces large chunks of their original training text.

Q: What is the “fair use” doctrine and how does it relate to GenAI’s use of copyrighted material?
A: The “fair use” doctrine is a legal concept that allows the use of copyrighted material under certain circumstances, such as for transformative purposes. GenAI companies argue that their use of copyrighted material is transformative and adds something new and significant.

Q: Is there evidence of GenAI tools reproducing copyrighted content?
A: Yes, there have been cases where GenAI tools have reproduced copyrighted content without authorization. For example, the New York Times sued OpenAI and Microsoft, claiming that their articles were used in training data without permission.

Q: How does GenAI’s reproduction of copyrighted content impact the original creators?
A: GenAI’s reproduction of copyrighted content can harm original creators by diverting traffic away from their work and competing in the same market. This can potentially result in financial losses for creators and a decreased incentive for audiences to visit the original content.

Q: How are GenAI companies responding to allegations of copyright infringement?
A: GenAI companies, like OpenAI, have defended their use of copyrighted material by arguing that it falls under fair use and that they are creating transformative products. However, there is ongoing debate and legal action surrounding these claims.

Q: What remedies or treatments are being proposed to address the issue of GenAI reproducing copyrighted material?
A: One proposed treatment is licensing the content used for training GenAI models. This would require AI companies to spend more time and money on licensing, but it could help protect the creative industries. Another approach is “poisoning” training data, which disrupts the AI models’ outputs when unauthorized use occurs.

References:

  1. AI expert Dr. Max Little on ChatGPT plagiarism
  2. Research on LLMs regurgitating training text
  3. Report on OpenAI’s GPT-3.5 outputs containing plagiarism
  4. Stability AI’s argument for fair use with Stable Diffusion
  5. New York Times lawsuit against OpenAI and Microsoft
  6. Nightshade tool for disrupting AI models