Law: Doe vs. GitHub Is a Non-Crisis
Despite worrisome headlines in the media, Doe v. GitHub, Inc. would protect licensed software code without blocking AI systems from using internet data for “learning”Headline at The Verge: “The lawsuit that could rewrite the rules of AI copyright.” Wired similarly declares: “This Copyright Lawsuit Could Shape the Future of Generative AI.” The subtitle warns: “Algorithms that create art, text, and code are spreading fast — but legal challenges could throw a wrench in the works.”
Indeed, two putative class action lawsuits were filed in the Northern District of California federal district court in November 2022 against GitHub, GitHub’s owner Microsoft, OpenAI and others. The lawsuits allege that two interrelated artificial intelligence (AI) software systems are continuously violating the 1998 Digital Millennium Copyright Act (DMCA) as well as breaching contracts, engaging in unlawful competition, and violating California state privacy laws. Attorney and programmer Matthew Butterick is leading the cases, both of which are captioned Doe v. Github, Inc.
The alleged ongoing DMCA and open-source code license violations arise from GitHub’s product called Copilot and the OpenAI product called Codex. With Codex integrated to translate natural language into computer code, Copilot operates in the way that a mammoth “auto complete” does in a word processor. The programmer just describes the basics of a certain function in natural language, and Copilot supplies an AI-suggested block of code for the programmer to use immediately or modify to achieve the intended purpose.
How does Copilot decide what computer code to suggest? Both Codex and Copilot scan millions to billions of lines of computer code accessible on the Internet in “open source” communities. A myriad of software programs, large and small, are publicly available for programmers to view, copy, and modify. Widely used examples of open source programs include GNU/Linux, Mozilla Firefox, and VLC media player. Copilot and Codex ingested an enormous amount of existing software code online called “training data,” and with human input and internal algorithms, they collected statistical patterns about how certain processing functions were solved by software instructions. GitHub’s website proclaims: “Trained on billions of lines of code, GitHub Copilot turns natural language prompts into coding suggestions across dozens of [computer] languages.”
Computer software commonly needs a function called a “random number generator.” A programmer using GitHub would begin keying in the concept of a random number function, and, voilà, GitHub would supply a random number generator subroutine that it statistically found was used often in other working programs. The subroutine delivered might not be identical to any one of the many GitHub sifted through, but it would very strongly resemble some. Sounds like a really handy programming aid – so what’s the problem?
The Authors’ Club Analogy
Suppose you and friends establish an author’s group where all members share written works (books, articles, poems, clever expressions) with each other, free of charge. Suppose your group imposes only three conditions – when you use another member author’s work, you must include in the text or in footnotes: (1) the name of the original author (“Attribution”); (2) a Copyright Notice identifying the original author’s reserved rights; and (3) an Internet link to the full text of the License that governs your using the other author’s material.
One day a non-member gets free access to the group members’ written materials, stores them in a fee-for-use library, and makes them available to other non-member writers without fulfilling any of the group’s three conditions. Has the non-member violated any copyright laws or other enforceable restrictions?
The first thing to realize: The group member authors’ written works are all protected by copyright law upon their creation. The non-member who makes copies of another author’s written works infringes on that author’s copyrights. (To sue a copyright infringer, the author would have to have obtained a copyright registration, but the author is still the legal owner beginning at creation, not registration.)
Under U.S. law, copying a creator’s written work product constitutes a violation of the creator’s copyrights (unless the creator gave permission or certain other exceptions apply). In our author group scenario, the non-member who obtains the group members’ written works and makes copies is an infringer who has violated the creators’ copyrights. The infringer’s liability for the violation and damages is increased because the infringer copied the material knowingly, distributed it widely to others, and did so for profit.
By direct analogy, GitHub’s CoPilot system obtains computer code created by authors, makes copies of it, stores it for distribution to other people, and charges a fee for the service. On its face, CoPilot is violating the copyright ownership rights of the computer code’s creators.
But Wait: The Computer Code is “Open Source”
Reportedly, CoPilot searches through and catalogs computer code drawn chiefly from “open source” programs and libraries. An “open source” computer program, such as the huge GNU/Linux operating system, is a programming project where other programmers volunteer to add functions or fix bugs. Many specific computer functions are stored in open source libraries online. Open-source software typically is offered free to users. As attorney-programmer Butterick explains, however: “The vast majority of open-source software packages are released under licenses that grant users certain rights and impose certain obligations (e.g., preserving accurate attribution of the source code).”
Such open source materials are not usually in the “public domain” as would allow indiscriminate copying and use. Programmers may use such materials if they abide by the open source license requirements, typically: Attribution, Copyright Notice, and License Link. CoPilot reportedly provides the copied software code, sometimes verbatim, without abiding by any of those license requirements. In such a case, CoPilot would be violating the terms of the open sources licenses that apply.
But Wait: Isn’t Copying Open Source Code Merely “Fair Use?”
Copyright law does allow people to copy limited portions of author’s works under the doctrine of “fair use.” The U.S. Copyright Office explains:
Fair use is a legal doctrine that promotes freedom of expression by permitting the unlicensed use of copyright-protected works in certain circumstances. Section 107 of the Copyright Act provides the statutory framework for determining whether something is a fair use and identifies certain types of uses—such as criticism, comment, news reporting, teaching, scholarship, and research—as examples of activities that may qualify as fair use.
Deciding whether copying a segment of someone else’s software code amounts only to “fair use” involves examining a number of factors, including the quantity of code and whether the copy will be used in a revenue-producing project or otherwise adversely affects the market for the copied code. The Doe v. Github, Inc. lawsuit alleges CoPilot does not limit its copying and doesn’t care how users will employ the copied code.
American courts could well declare CoPilot’s delivering of code segments is not protected as “fair use.” There is no simple answer, however, as the complicated decision rendered by the Supreme Court in Google LLC v. Oracle Am., Inc. (2021) exemplified, when it held that 11,500 lines of code copied by Google from Oracle’s Java’s millions of lines of computer code did amount to “fair use” under its rather specialized circumstances. The “fair use” analysis is typically case-by-case, not always easily predicted.
What About Programs That Sample Internet Sources to Locate Usages and Trends?
CoPilot finds and collects software code segments from Internet-resident sources using “AI learning” methods. CoPilot’s “learning” involves scanning through a huge collection of data to detect statistically significant patterns in the data and then abstracting facts and trends from those patterns.
Suppose an AI system does no more than read and analyze data and text on the Internet, and generates reports about how frequently words, phrases and even paragraphs were used and composed, and perhaps describes how they occur in written language contexts. Would that AI system be violating copyright laws? Under existing copyright law, reading and analyzing other authors’ copyright protected works is not an infringement of rights. And it is “fair use” to copy and/or publish selected portions for scholarly and research purposes, and for analysis and criticism.
Remember the Copyright Office’s Circular 33 advisory (emphasis added):
Copyright law expressly excludes copyright protection for “any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied.” … [C]opyright protection will extend only to the original expression in that work and not to the underlying idea, methods, or systems described or explained.
In the simplest terms relevant there: Copyright law doesn’t protect ideas, it protects how ideas are expressed in written form. You can write a program to carry out a chess-playing algorithm, for example, using ideas you glean from analyzing another program. But you may not just copy the other program’s computer code.
Does Doe v. GitHub, Inc. Endanger AI Systems’ Scanning Internet Data?
It’s probably too early to worry that the Doe v. GitHub, Inc. lawsuit could potentially illegalize AI systems’ large-scale use of available online data for “learning”. The Doe complaint does not even ask the federal court for an injunction or other relief that would stop AI systems from reading and analyzing the open source software collections. The Doe complaint doesn’t ask the court to stop people from being inspired by other people’s software solutions, or by other people’s musical compositions or other creative work products. Although the 56-page lawsuit is complex in some ways, at bottom it is trying to stop infringers from copying open source code without abiding by the applicable licenses that require users to give Attribution, Copyright notices, and License Links, and to stop infringers from profiting from taking other people’s creative work product without permission.
Despite worrisome headlines in the media, the Doe lawsuit appears to be protecting legitimate copyright protection for licensed work products. It doesn’t appear to interfere with people’s rights to read, use, and learn from online collections of computer code or other written expressions of ideas and information.