California and artificial intelligence laws: Transparency in training
On September 29, 2024, California enacted another law relating to artificial intelligence. This one is focused on the transparency in training generative AI (AB 2013), and will take effect on January 1, 2026.
Transparency in Training Generative AI (AB 2013)
This law will require, for each generative artificial intelligence system or service made publicly available to Californians for use, that the developer of the system or service post on its website documentation regarding the data used by the developer to train the generative intelligence system or service. The law requires that this documentation include, among other requirements, a high-level summary of the datasets used in the development of the system or service. In addition, this requirement also apply to each “subsequent modification” of a generative artificial intelligence system or service that was originally released on or after January 1, 2022.
The AB 2013 definition of “developers” is broad and imposes obligations on any “person, partnership, state or local government agency, or corporation that designs, codes, produces, or substantially modifies an artificial intelligence system or service for use by members of the public.”
Key definitions in the law include:
“Generative artificial intelligence” meaning artificial intelligence that can generate derived synthetic content, such as text, images, video, and audio, that emulates the structure and characteristics of the artificial intelligence’s training data.
“Substantially modifies” or “substantial modification” means a new version, new release, or other update to a generative artificial intelligence system or service that materially changes its functionality or performance, including the results of retraining or fine tuning.
“Members of the public” is defined to exclude affiliates as defined in subparagraph (A) of paragraph (1) of subdivision (c) of Section 1799.1a, “any entity that, directly or indirectly, through one or more intermediaries, controls, is controlled by, or is under common control with, another entity.” Additionally the definition excludes hospitals’ medical staff members.
The law provides a high-level summary of the datasets used in the training of a generative artificial intelligence system or service that are required to be posted in the online documentation, which include the following:
- Sources or owners of the datasets.
- A description of how the datasets further the intended purpose of the AI system or service.
- The number of data points included in the datasets.
- A description of the types of data points within the datasets.
- Whether the datasets include any data protected by copyright, trademark, or patent, or whether the datasets are entirely in the public domain.
- Whether the datasets were purchased or licensed by the developer.
- Whether the datasets include personal information.
- Whether the datasets included aggregate consumer information.
- Whether there was any cleaning, processing, or other modification to the datasets by the developer, including the intended purpose of those efforts in relation to the artificial intelligence system or service
- The time period during which the data in the datasets were collected, including a notice if the data collection is ongoing.
- The dates the datasets were first used during the development of the artificial intelligence system or service.
- Whether the generative artificial intelligence system or service used or continuously uses synthetic data generation in its development.
The law does provide some exceptions for when the training data set documentation is not required to be posted. These include:
- A generative artificial intelligence system or service whose sole purpose is to help ensure security and integrity.
- A generative AI system or service whose sole purpose is the operation of aircraft in national airspace.
- A generative AI system or service developed for national security, military, or defense purposes that is made available only to a federal entity.
This law may be interpreted to require those developers that are in lawsuits related to the alleged misuse of training data to disclose the sources of that training data, rather than the high-level summaries required for the posted documentation.