Generative AI: How it works, content ownership, and copyrights
Generative artificial intelligence (AI) has captured everyone’s attention as the next wave of technological advancement promising us to move into a golden era of productivity and creativity. In a matter of minutes, with a well thought out prompt, anyone can create stunning images, videos, songs and text. Companies are outputting more code, generating marketing materials and automating certain business functions with high fidelity, such as customer service and personal assistants. The future looks exciting, as science fiction turns into reality, with the ubiquity of autonomous vehicles and humanoid robots, and rapid drug discovery seemingly around the corner. With any new advancement, the law is usually the last to catch up—regulators struggle to understand the technology and its implications, existing legal frameworks are tested, usually broken, and questions remain unanswered. What lies before regulators faced with rapidly changing technology and tasked with the goal of promoting innovation and creativity while respecting intellectual property (IP) rights is an uphill climb.
While AI systems have existed for decades, recent systems, such as ChatGPT, Stable Diffusion, Claude, Gemini and Grok have brought AI applications and their implications into public view. Legal decisions and laws passed today will lay the groundwork as we approach artificial general intelligence (AGI). Regulatory capture, deep fakes, algorithm bias and black box decision making are just a few of the problems posed by AI today that are bubbling to the surface. Foundationally, questions remain as to the ownership and liability of content created using AI.
This article explores the current state of the law in the US related to generative AI to address questions that are top of mind, such as (1) who owns content created by generative AI and (2) whether generative AI systems infringe copyrights.
1. How does Generative AI work?
Prior to addressing the questions and challenges presented by generative AI applications, we need to have a basic understanding of how these systems work. We’ll focus our conversation on the most widely used generative AI today, autoregressive language models, such as ChatGPT, and large language models (LLM) generally, as opposed to, for example, diffusion models. Broadly speaking, LLMs utilize a transformer architecture1 to (1) add semantic meaning to various types of data, capturing context and relationships within sequences—whether they are words in text, patches in images, frames in videos or samples in audio—and (2) are built on neural networks to model these relationships for generating outputs appropriate to the medium, such as text, images, video or songs.234 Let’s briefly break this down, starting with a neural network.
A. Learning patterns from data
Neural networks learn patterns from unstructured data.5 To understand the basics of how a neural network learns patterns, we need to introduce machine learning concepts through a simple example. Consider a system that was designed to determine whether a fruit was an apple or an orange using a fruit’s weight and color. Training data shown on a graph would contain a plot of several labeled data points representing different apples and oranges, each with their color (y-axis) and weight (x-axis). A simple model may be a line dividing oranges and apples. If a fruit is on one side of the line, the model may predict the next fruit is an apple and, on the other side of the line, an orange. Here the model has two parameters: the orientation of the line (slope) referred to as a weight, and the position (e.g. y intercept) referred to as the bias. Each parameter of the model is adjusted to create a line that best fits the data. Predicting a fruit is an apple, this model may do well if a new fruit to identify is a red delicious apple (very red, lighter weight, IMHO not delicious) or poorly if the fruit is a honey crisp apple (can be lighter in color, heavier weight, IMHO delicious). The model outputs a probability of whether the fruit is an apple or an orange. The parameters—in this case, the position and orientation of the line—can be adjusted to improve how well the model predicts a fruit is an apple or an orange.
Now what if we wanted to consider a fruit’s shape or for the model to also predict whether a fruit is a banana. Obviously, a simple line is not suitable to model these relationships and make predictions. Instead of a single line, a complex boundary is needed to separate fruits accurately (3 types of fruits in this case); perhaps the shape of the boundary has several bends or curves trying to fit the training data. Before losing you to remembering the quadratic formula or the meaning of cubic, enter neural networks.
Neural networks adjust complex, multi-dimensional “lines” (or neurons) based on training data to accommodate more complex and non-linear relationships.6 In simple terms, multiple lines are added together to create curve-like boundaries to fit the training data. There are more weights and biases (parameters) learned from the training data to achieve a shape that models the training data. But instead of the two parameters with the line model discussed earlier, there are a lot more parameters. LLMs today train on terabytes of data and include billions of parameters. For example, Meta's LLaMa 2, an open source LLM, includes 70 billion parameters and is trained on 10 terabytes of text.7 These billions of parameters are adjusted iteratively to perform better predictions. At a high level, this is done by comparing an output with the expected result to calculate an error (a process with derivatives called backpropagation), and the weights and biases (or parameters) are adjusted to minimize this error. As you might imagine, training these models to improve outputs is computationally expensive.8 For example, it has been estimated that a single training run for GPT-3 costed between $500,000 to $4.6 million, depending on assumptions of hardware (e.g. number of GPUs or type, such as the Nvidia A100).9
B. Generating an output
Okay, so if neural networks use complex shapes to model numerical data, does an LLM understand text and, if so, how?10 After all, language has many nuances; for example, there is a big difference between the sentences, “Let’s eat, Grandma,” and, “Let’s eat Grandma.” Consistent with our breakdown here, we’ll take a high-level, 50,000-foot view to understand basic concepts relevant for this discussion.
In general, two techniques from natural language processing (NLP) are used to convert text into numerical data: (1) tokenization and (2) word embeddings. Tokenization separates text data into smaller units called tokens.11 For example, the sentence, “The apple is red,” could be broken down into five tokens [The, apple, is, red, .]. Word embeddings converts a word into a vector or a list of numbers that provides semantic and syntactic meaning of a word’s meaning within a given context.12 For example, “Apple” can be represented by a three dimensional vector, Apple = [0.4, 0.5, 0.1]. Provided with enough training data (e.g. books, articles, posts on X), a vector containing hundreds of numbers or dimensions can be used to represent the meaning of a particular word.13 Vectors capture relationships between words. For example, a word vector for “Apple” may have high values for dimensions representing “fruit” or “red” and lower values for dimensions representing “hammer” or “black.”14
Understanding this context, neural networks are able to predict the next word in a sequence using probabilities. A turning point in the field was the introduction of the transformer architecture in 2017 that added an attention mechanism to a neural network that provides the ability to focus on different portions of the input sequence.15 For example, in the sentence, “The dog walked into the house,” the relationship between “dog” and “walked” may be more important than “the” and “house.” These considerations (broadly speaking similar to a weighted sum) can be performed in parallel (where GPUs shine), which is significant because, instead of two classes (“apple” or “oranges”), here we may have tens of thousands of classes representing words in a language. If the input is “The Apple is,” the LLM may predict the next word is “red,” “delicious” or “rotten” based on which of these words (or tokens) have the highest probability. The selected output tokens (with the highest probability) are converted from numerical representations into text, which can be a single word or sequence of words. With each new word generated by an LLM, the process is repeated to predict and generate the next word, until reaching a predefined limit.
Hopefully you now have a basic understanding of how generative AI works in the context of LLMs. We certainly skipped over many topics (e.g. transformers, fine-tuning, transfer learning, reinforcement learning from human feedback (RLHF), retrieval augmented generation) and broadly introduced concepts necessary for this discussion. For more details, there are plenty of great resources available, including courses and videos from Andrej Karpath, fast.ai, Hugging Face, and Andrew Ng.
2. Who owns content created by Generative AI?
Prompt a generative AI application with an input and who owns the output? A simple question, yet not quite a simple answer. For example, if a person prompts OpenAI’s Dall·E to “create an abstract illustration of Mars,” in a matter of seconds, Dall·E will produce a detailed and vivid image of Mars. Today, that person can readily download, copy and share this illustration. But who owns that illustration—the model owner (OpenAI), the person who prompted the model or someone else? What about the rights to any IP for the illustration? Obtaining answers to these questions may prove difficult, but we will start with the model owner’s terms of service.
A. Terms of service
While there are other important provisions in each terms of service (e.g. limitations of liability, privacy, data rights), we will focus on provisions related to the ownership of the output. A summary of these provisions from¬¬ popular providers has been provided below.
- OpenAI. The user is assigned rights to the output. “As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain all ownership rights in Input and (b) own all Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.”16
- Microsoft. The user retains ownership to any output. “Output Content is Customer Data. Microsoft does not own Customer's Output Content.”17
- Github. The user retains ownership of its code. “GitHub does not own Suggestions. You retain ownership of Your Code.”18
- Anthropic. The user is assigned rights to the output. “Customer Content. As between the parties and to the extent permitted by applicable law, Anthropic agrees that Customer owns all Outputs, and disclaims any rights it receives to the Customer Content under these Terms. Anthropic does not anticipate obtaining any rights in Customer Content under these Terms. Subject to Customer’s compliance with these Terms, Anthropic hereby assigns to Customer its right, title and interest (if any) in and to Outputs.”19
In short, a common theme found in these terms of service relevant to the discussion here is that the providers do not claim ownership of the output generated from their models.
B. Intellectual property
If the model owner claims no IP rights in the output, can a user of a model claim IP rights to the output? Given the medium of a model’s output (e.g. text, images, audio), copyright rights are likely to be at issue. And while patent and trade secret rights can also be at issue, depending on the context of the model’s use, we will limit our discussion to copyright law for the remainder of the article.
Copyright protection is extended to “original works of authorship fixed in any tangible medium of expression … from which they can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device.” 17 U.S.C.A. § 102(a). Works of authorship can include, for example, literary, musical and pictorial works, motion pictures and sound recordings. Id
The US Copyright Office (USCO) has offered some guidance on whether an AI-generated work is eligible for copyright registration.20 Generally, consistent with prior precedent, it is the USCO’s view that “copyright can protect only material that is the product of human creativity.”21 While this issue is on appeal, the United States District Court for the District of Columbia has agreed with the USCO that “Human authorship is a bedrock requirement of copyright.” Thaler v. Perlmutter, No. CV 22-1564 (BAH), 2023 WL 5333236, at *4 (D.D.C. Aug. 18, 2023). USCO’s current policy squarely addresses perhaps the most common use of LLMs today by stating that “when an AI technology receives solely a prompt from a human and produces complex written, visual, or musical works in response, the ‘traditional elements of authorship’’ are determined and executed by the technology—not the human user.”22 For example, an LLM when prompted “‘write a poem about copyright law in the style of William Shakespeare,’’’ will choose “the rhyming pattern, the words in each line, and the structure of the text” and generate “text that is recognizable as a poem, mentions copyright and resembles Shakespeare’s style.”23 The USCO states that “[w]hen an AI technology determines the expressive elements of its output, the generated material is not the product of human authorship,” and if the “work lacks human authorship[,]the [USCO] will not register it.”24
That is not to say all outputs of generative AI cannot be part of the creative process. The USCO clarifies that “a human may select or arrange AI-generated material in a sufficiently creative way that ‘the resulting work as a whole constitutes an original work of authorship’’’ or an “artist may modify material originally generated by AI technology to such a degree that the modifications meet the standard for copyright protection.”25 According to the USCO, “[i]n these cases, copyright will only protect the human-authored aspects of the work, which are ‘independent of’ and do ‘not affect’ the copyright status of the AI-generated material itself.”26 For example, “[i]n February 2023, the [USCO] concluded that a graphic novel comprised of human-authored text combined with images generated by the AI service Midjourney constituted a copyrightable work, but that the individual images themselves could not be protected by copyright.”27 The USCO concludes that “[i]n each case, what matters is the extent to which the human had creative control over the work’s expression and ‘actually formed’ the traditional elements of authorship.”28
3. Does Generative AI infringe copyrights?
Generative AI models are trained on vast amounts of data. But what if the training data contains copyrighted material such as source code, books, songs, images or video? A copyright holder has the exclusive rights to reproduce copyrighted content, create derivative works from it and distribute copies of it. 17 U.S.C.A. § 106. Presumably then, a developer would need permission or a license to copy a copyrighted work for a model’s training data—otherwise they may risk committing copyright infringement. But not all unauthorized copying is created equal.
As the Supreme Court has explained, from copyright protection’s very beginning, to fulfill its purpose “[t]o promote the Progress of Science and useful Arts,” there may be circumstances where there needs to be opportunity for fair use of copyrighted materials. Campbell v. Acuff-Rose Music, Inc., 510 US 569, 575 (1994) (quoting US Const., Art. I, § 8, cl. 8). This is called the fair use doctrine. But what about an output? If a user asks a model for illustrations related to Star Wars, is that fair use or an impermissible derivative work? Is a generative AI application in this context nothing more than a tool to create content, and the onus or liability falls on the user?
Several ongoing disputes between content owners and generative AI platforms are still in the early innings. Currently, a single case, Thomson Reuters Enter. Ctr. GmbH v. Ross Intel. Inc., has considered the role of fair use in building a model and its outputs, but denied motions for summary judgment on the fair use, leaving the issue to be resolved for a jury. No. 1:20-CV-613-SB, 2023 WL 6210901, at *7-11 (D. Del. Sept. 25, 2023). The court said it was placed in an “uncomfortable position” to resolve the issue when faced with “a hotly debated question: Is it in the public benefit to allow AI to be trained with copyrighted material?” Id. at *11.
In this section, we will explore fair use precedent and how it may be applied to generative AI. Fair use is a defense to an accusation of copyright infringement. 17 U.S.C. § 107. The fair use analysis considers four factors: “(1) the purpose and character of the use ...; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the work.” Id. Each of these factors are weighed together. Andy Warhol Found. for the Visual Arts, Inc. v. Goldsmith, 598 U.S. 508, 513 (2023). There are two distinct phases where copyrighted material may be implicated: (1) the training phase for building the model and (2) the inference phase, which is where a user prompts a model and receives an output. Let’s consider each fair use factor for these two phases.
A. Purpose and character of the use
“[T]he first fair use factor considers whether the use of a copyrighted work has a further purpose or different character, which is a matter of degree, and the degree of difference must be balanced against the commercial nature of the use.” Andy Warhol, 598 U.S. at 532. “A use that has a further purpose or different character is said to be transformative.” Id. at 529 (internal quotation and citations omitted). This factor “asks ‘whether and to what extent’ the use at issue has a purpose or character different from the original.” Id. at 510. “The more transformative the new work, the less will be the significance of ... commercialism.” Campbell, 510 US at 569.
- Training phase. When building a model such as an LLM, any copyrighted work included in the training data is being used to learn patterns. After a model is trained, training data is no longer needed to make predictions. The model is stored in a file or set of files that includes the learned parameters, and it can be run within a machine learning framework or accessed through an API to make predictions on new data. There is no copy or reproduction of the copyrighted work within the model itself (the file or set of files with learned list of parameters). Instead, the work is converted into tokens and, in the context of language, words are converted into vectors with thousands of dimensions reflecting relationships between words. Billions of parameters are adjusted iteratively to fit training data. While a majority of the time a developer’s use of a copyrighted work is likely to be commercial, the process of tokenization and word embedding to generate a probability distribution over potential next tokens in the sequence may be found to be a transformational use of a copyrighted work.
- Inference phase. Given a user’s many options for prompting and re-prompting a model, the purpose and character of use will have to be evaluated on a case-by-case basis. An evaluation here would seemingly be no different than an individual using any other medium to create an artistic expression. Armed with an LLM, does an individual strive to recreate someone’s copyrighted work? Is this secondary work for commercial, research or educational purposes? An output that resembles a copyrighted work used in a model’s training data may not be particularly transformative. A model’s capabilities may be considered as well. During inference, new content is generated and limited based on the model’s architecture, capacity and context window. These constraints may produce results that are not substantially similar to expressions of the copyright works found in the training data.
B. Nature of the copyrighted work
Under the second factor—nature of the copyrighted work—"'the more creative a work, the more protection it should be accorded from copying; correlatively, the more informational or functional the plaintiff's work, the broader should be the scope of the fair use defense.” Apple Inc. v. Corellium, Inc., No. 21-12835, 2023 WL 3295671, at *9 (11th Cir. May 8, 2023) (quoting Nimmer on Copyright § 13.05[A][2][a]). Notably, “[t]he second factor has rarely played a significant role in the determination of a fair use dispute.” Authors Guild v. Google, Inc., 804 F.3d 202, 220 (2d Cir. 2015).
- Training phase. The central question here is whether the underlying copyrighted work used by a model’s training data is creative or more informational. The more creative the copyrighted work, such as plays, books, or images, the broader legal protection it receives, and less likely to be fair use.
- Inference phase. The output would need to be assessed for its reliance on any copyrighted expression of any particular work. Facts and ideas themselves are not protectable under the Copyright Act. Is the output seemingly derivative of content that is informational (e.g. historical facts) or creative (e.g. mosaic of Chicago)?
C. Amount and substantiality of the portion used
Depending on the purpose and character of the use, the third factor asks whether “no more was taken than necessary.” Campbell, 510 U.S. at 586-587. “The ‘substantiality’ factor will generally weigh in favor of fair use where … the amount of copying was tethered to a valid, and transformative, purpose.” Google LLC v. Oracle Am., Inc., 593 US 1, 34 (2021). For example, in Authors Guild v. Google, Inc., Google made unauthorized digital copies of entire books, and the court concluded that this copying “enable[d] the search functions to reveal limited, important information about books.” 804 F.3d 202, 221 (2d Cir. 2015).
- Training phase. A content owner would likely contend that the amount of copying by a model is substantial and captures the heart of their copyrighted expression. Complete copies of copyrighted works may be used in the training data for a model. But here a model provider would respond that the amount of copying is directly tethered to the purpose of the copying—training the model. Not only would it be time consuming, impractical and perhaps cost prohibitive to parse data along the fact/expression dichotomy of copyright law, it may have an adverse impact on a model’s performance not having complete data. An model provider would argue similar to Google’s book search function, its copying is necessary to learn patterns and provide useful outputs.
- Inference phase. An individual output or series of outputs would need to be assessed for whether substantial portions of copyrighted works are recreated that capture the heart of the copyright’s expression. Where the output is text, a model’s context window may generalize or summarize a work as opposed to directly capturing a particular expression. Many service providers already have content policies in place that restrict a particular output based on the input.
D. Effect of the use upon the potential market
The fourth factor “looks to the secondary use’s effect on the potential market for or value of the copyrighted work.” Apple Inc. v. Corellium, Inc., No. 21-12835, 2023 WL 3295671, at *11 (11th Cir. May 8, 2023). “This factor ‘requires courts to consider not only the extent of market harm caused by the particular actions of the alleged infringer, but also whether unrestricted and widespread conduct of the sort engaged in by the defendant would result in a substantially adverse impact on the potential market for the original.’” Id. (quoting Campbell, 510 U.S. at 590 (cleaned up)).
- Training phase. A model provider will likely contend that the existence of any particular AI model will not harm a potential market for a copyrighted work. But a content owner would likely respond that a model’s outputs will act as substitutes. Content owners may further contend that absent a license, such activities would impair their ability to monetize their own content for other models. An owner of a portfolio of copyrighted works may argue this may impact a market for their own models. For example, a large animation studio may want to create their own model trained on their existing copyrighted works that may create similar illustrations, songs, scripts, stories or video.
- Inference phase. During inference, the central question is whether the content created by a model directly competes with and has an adverse impact on the original copyrighted work. A content creator or user of a model may argue that the output does not compete but rather expands the market for a particular copyrighted work. For example, an output that comments on, parodies or analyzes a copyrighted work, may draw attention to the original work and increases its value.
Conclusion
In this article, we've explored the multifaceted world of generative AI, highlighting both the economic and computational demands of training large language models and the intricate legal challenges they pose. This conclusion, drafted by ChatGPT, illustrates the dual-edged nature of this technology—while it offers remarkable opportunities for innovation and efficiency, it also necessitates careful consideration of its broader implications. Through generative AI, we can achieve tremendous advancements, provided we navigate its challenges with foresight and responsibility.
1 https://arxiv.org/pdf/1706.03762.pdf.
2 https://www.youtube.com/watch?v=zjkBMFhNj_g&t=202s.
3 https://www.ibm.com/topics/large-language-models.
4 https://medium.com/data-science-at-microsoft/how-large-language-models-work-91c362f5b78f.
5 Id.
6 https://youtu.be/hBBOjCiFcuo?si=rJyn4wk7yjToCHxW&t=2600 (Fast.ai Lesson 3: Practical Deep Learning for Coders 2022).
7 https://www.youtube.com/watch?v=zjkBMFhNj_g&t=202s.
8 https://a16z.com/navigating-the-high-cost-of-ai-compute/.
9 Id.
10 Images are numerical inputs consisting of pixels (height, width and three channels: red, green, blue). https://medium.com/data-science-at-microsoft/how-large-language-models-work-91c362f5b78f.
11 https://saschametzger.com/blog/what-are-tokens-vectors-and-embeddings-how-do-you-create-them.
12 https://medium.com/data-science-at-microsoft/how-large-language-models-work-91c362f5b78f.
13 https://saschametzger.com/blog/what-are-tokens-vectors-and-embeddings-how-do-you-create-them.
14 Id.
15 https://arxiv.org/pdf/1706.03762.pdf.
16 https://openai.com/policies/business-terms (last visited May 7, 2024).
17 https://www.microsoft.com/licensing/terms/product/foronlineservices/all (last visited May 22, 2024)
18 https://github.com/customer-terms/github-copilot-product-specific-terms (last visited May 8, 2024).
19 https://www.anthropic.com/legal/commercial-terms (last visited May 8, 2024).
20 https://www.govinfo.gov/content/pkg/FR-2023-03-16/pdf/2023-05321.pdf (lasted visited May 8, 2024).
21 Id.
22 Id. USCO solicited comments late last year on various AI and copyright issues and are working on a study: https://www.govinfo.gov/content/pkg/FR-2023-08-30/pdf/2023-18624.pdf
23 Id.
24 Id.
25 Id.
26 Id.
27 Id.
28 Id.