AI and Copyright: Training Data and Transformative Fair Use

I’ve been loosely tracking the AI and copyright cases, most notably the Thaler litigation, where Thaler keeps losing the argument that work solely by an AI should get copyright protection. To summarize: everybody who has ruled on that said that only work involving humans can get copyright protection. As I said at the time, I think a good policy reason in support of this argument is that if pure AI could get copyright, it could produce millions of copyrighted images in almost zero time. That’s got nothing to do with incentivizing human creation. It was easy to miss given the deluge of atrocious Supreme Court decisions, but last week, a pair of district court judges ruled on a different (but not unrelated, in terms of markets) AI copyright question – whether scraping online text for training data is fair use. Both cases are in the Northern District of California, so we can expect the 9^th Circuit to have the first appellate decision on this topic.

By way of background: fair use is an affirmative defense against copyright infringement. That means that if you accuse me of infringement, I can defend myself as having engaged in “fair use,” which basically means “use that the copyright owner doesn’t like, but that we as a society think should be allowed for policy reasons.” It could also mean “use that everybody thinks is ok, but for which licensing would be so inefficient that a licensing market would never emerge.” Fair use is supposed to be decided case-by-case. It depends on four factors: the “nature and purpose” of the (allegedly infringing) use, the nature of the copyrighted work, the amount of it used, and the market effects of the infringing use. The middle two factors tend not to matter much. The first factor is usually decided by a deciding whether the use in question is “transformative.” For example, consider parody; the Supreme Court ruled back in 1994 that a 2 Live Crew parody of Roy Orbison’s “Pretty Woman.” was fair use. The most closely analogous case I know to the training data was an appellate decision about Google thumbnails.

In it, Perfect 10, a porn site, sued Google over its use of thumbnail images in search results. Perfect 10 had firewalled their images, but Google was using images from pirate sites, which then became links to the pirate sites. Citing the parody case, the court reasoned that:

“Although an image may have been created originally to serve an entertainment, aesthetic, or informative function, a search engine transforms the image into a pointer directing a user to a source of information. Just as a “parody has an obvious claim to transformative value “because “it can provide social benefit, by shedding light on an earlier work, and, in the process, creating a new one,” a search engine provides social benefit by incorporating an original work into a new work, namely, an electronic reference tool.” (Perfect 10, p. 25 in the linked pdf).

Thus the question for last week: is turning copyrighted text into LLM-feed transformative in the relevant sense? (I’ve thought that it was for a while, based on this prescient paper by Peter Henderson, Xuechen Li, Dan Jurafsky, Tatsunori Hashimoto, Mark Lemley, and Percy Liang). Both cases last week thought that it was, though they differ in significant ways.

The first case, Bartz. v Anthropic, ruled on Anthropic’s Claude. Anthropic created a library of all the text they could find, including from pirate sites like LibGen. They then bought paper copies of a lot of these books and scanned them into digital copies. Claude trained on this corpus, a process that involved making a number of copies of the work. Judge Alsup got into the technical weeds, detailing that the work were copied to get into the dataset for the LLM, cleaned of page headings and the like, tokenized, and then compressed such that “In essence, each LLM’s mapping of contingent relationships was so complete it mapped or indeed simply “memorized” the works it trained upon almost verbatim. So, if each completed LLM had been asked to recite works it had trained upon, it could have done so” (7). Judge Alsup also emphasized that no such recitation had occurred – this case was only about the training process.

Alsup summarizes his reasoning as follows:

“The use of the books at issue to train Claude and its precursors was exceedingly transformative and was a fair use under Section 107 of the Copyright Act. And, the digitization of the books purchased in print form by Anthropic was also a fair use but not for the same reason as applies to the training copies. Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies. However, Anthropic had no entitlement to use pirated copies for its central library. Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic’s piracy” (9)

He thus ruled in Anthropic’s favor on summary judgment for fair use for training Clause – on any construal of the facts presented, training Claude in this way was fair use as a matter of law – but left open the question of Anthropic’s use of pirated copies, which he clearly thought was going to be found infringing. This may turn out to be a big deal – the AI companies clearly think everything on the Internet is theirs for the taking. As one commentor at the Copyright Alliance argues, however, “but while its training analysis misses the mark, the order makes clear that using pirated copies of works to build a “central library” is not fair use and could result in massive damages for willful infringement. Ultimately, the decision could have a significant impact on generative AI infringement litigation where most AI developer defendants have collected training material from the same piracy-laden dataset”

Judge Alsup explains the transformativeness analysis:

“The purpose and character of using copyrighted works to train LLMs to generate new text was quintessentially transformative. Like any reader aspiring to be a writer, Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different. If this training process reasonably required making copies within the LLM or otherwise, those copies were engaged in a transformative use” (13-14).

The analogy is to learning to read and memorizing the books you study while you learn to write:

“Authors argue that using works to train Claude’s underlying LLMs was like using works to train any person to read and write, so Authors should be able to exclude Anthropic from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for training or learning as such. Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory, each time they later draw upon it when writing new things in new ways would be unthinkable. For centuries, we have read and re-read books. We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems” (12).

Judge Alsup continued the analogy when he got to the other important fair use factor – the effect on the market for the plaintiff’s works. Since both parties conceded that Claude did not actually regurgitate any of the works in question, the question was whether the LLM would use the authors’ works to create a bazillion competing works. Judge Alsup:

“Authors’ complaint is no different than it would be if they complained that training schoolchildren to write well would result in an explosion of competing works. This is not the kind of competitive or creative displacement that concerns the Copyright Act. The Act seeks to advance original works of authorship, not to protect authors against competition” (28).

This sounds wrong when you read it, because of course even the smartest schoolchildren can’t produce competing works at the pace of Claude. This is clearly the weakest part of the decision, and it gets criticized in the next day’s decision by Judge Alsup’s colleague, Judge Vince Chhabria, in Kadrey v. Meta.

The Meta decision agrees on the key point – that using texts for training data was transformative. Judge Chhabria writes that “there is no serious question that Meta’s use of the plaintiffs’ books had a “further purpose” and “different character” than the books—that it was highly transformative. The purpose of Meta’s copying was to train its LLMs, which are innovative tools that can be used to generate diverse text and perform a wide range of functions” (16). He was unperturbed by the fact that Meta got their books via sites like LibGen and even using BitTorrent (which is what all the students who got sued 10 years ago for downloading songs were using), unlike Judge Alsup.

What did bother Judge Chhabria was the market effects. He dismisses Judge Alsup’s analogy to schoolchildren:

“when it comes to market effects, using books to teach children to write is not remotely like using books to create a product that a single individual could employ to generate countless competing works with a miniscule fraction of the time and creativity it would otherwise take. This inapt analogy is not a basis for blowing off the most important factor in the fair use analysis” (3)

Unfortunately, plaintiffs in this case made almost no effort at market analysis, and no effort to show that Llama could flood the market for their works. He thus says, somewhat despondently, that “Given the state of the record, the Court has no choice but to grant summary judgment to Meta on the plaintiffs’ claim that the company violated copyright law by training its models with their books” (5). For market effects, the plaintiffs argued that Llama could regurgitate their works (guardrails prevented enough of that from happening to be an issue) and that the fair use hurt the development of a licensing argument (true, but there’s no right to a licensing market. Judge Alsup said this too).

This does not deter him from explaining what the plaintiffs should have said! The third way to argue market effects is “harm from this form of competition is the harm of market dilution” (28). Dilution won’t hurt really famous authors whose works people look for by name, but:

“It’s easy to imagine that AI-generated books could successfully crowd out lesser-known works or works by up-and-coming authors. While AI-generated books probably wouldn’t have much of an effect on the market for the works of Agatha Christie, they could very well prevent the next Agatha Christie from getting noticed or selling enough books to keep writing” (29). Engaging the literature on the subjet, he argues that “indirect substitution is still substitution: If someone bought a romance novel written by an LLM instead of a romance novel written by a human author, the LLM-generated novel is substituting for the human-written one. This is different from the (non-cognizable) harm caused by criticism or commentary, which can harm demand for an original work without serving as a replacement for it” (31).

Meta anticipated this line of attack, and made some (he thinks mediocre) arguments to the effect that Llama hasn’t and won’t cause market harm. But when uncontested, even mediocre arguments win, so that’s that. Meta also said that stopping all the fair use would shut down the development of LLMs altogether, which Judge Chhabri called “nonsense” (38), noting that either licensing markets would develop, or Meta would decide to scan only public domain works.

In sum:

“In cases involving uses like Meta’s, it seems like the plaintiffs will often win, at least where those cases have better-developed records on the market effects of the defendant’s use. No matter how transformative LLM training may be, it’s hard to imagine that it can be fair use to use copyrighted books to develop a tool to make billions or trillions of dollars while enabling the creation of a potentially endless stream of competing works that could significantly harm the market for those books. And some cases might present even stronger arguments against fair use. For instance, as discussed above, it seems that markets for certain types of works (like news articles) might be even more vulnerable to indirect competition from AI outputs. On the other hand, though, tweak some facts and defendants might win. For example, using copyrighted books to train an LLM for nonprofit purposes, like national security or medical research, might be fair use even in the face of some amount of market dilution. Or plaintiffs whose works are unlikely to face meaningful competition from AI-generated ones may be unable to defeat a fair use defense.” (39, internal citation omitted)

Next stop, 9^th Circuit. For a critique of the transformative fair use analysis in the Anthropic case, see this column by Prof. Terry Hart; the argument would apply to the Meta case too. See also the piece by Kevin Madigan of the Copyright Alliance, linked above.

recent posts

about

Leave a comment Cancel reply

recent posts

about

AI and Copyright: Training Data and Transformative Fair Use

Share this:

Leave a comment Cancel reply