AI and Data Licensing · Memo

AI Training Data Licensing: What a Usable Agreement Looks Like

The case law on AI training data is genuinely unsettled and counsel drafting in this space should be candid about that with their clients. I will lay out the structure of an agreement that I think holds up under most plausible outcomes, and flag the points where the law has not landed.

The cases that will set the rules for AI training data are still in motion. The Bartz v. Anthropic class action over books used in training, the parallel Kadrey v. Meta matter, the New York Times v. OpenAI litigation, the Doe v. GitHub line on code, the Andersen v. Stability AI line on images, and the multiple state-court and overseas matters running in parallel will produce a patchwork of outcomes over the next eighteen to thirty months. The fair-use question is the one most observers are watching, and the trial court rulings to date have been inconsistent on key sub-issues including transformative use, market substitution, and the relevance of opt-out mechanisms. I do not think any practitioner should be confident about how this resolves.

The drafting consequence is that an AI training-data license that depends on the fair-use question coming out one way is, in my view, an imprudent license. The agreement should be structured so that it holds up under either a strong fair-use outcome or a weak one. That means an explicit grant, an explicit warranty stack, an explicit indemnity, and an explicit allocation of the model-output question.

The grant clause

The grant should not be one sentence. The structural elements I would not draft without:

The warranty stack

The licensor's warranties are where the actual risk allocation happens. The minimum warranties I push for from the licensee side:

  1. That the licensor owns or has the rights to license the data.
  2. That the use of the data for training as specified in the agreement does not infringe a third party's copyright, trademark, or other intellectual property right.
  3. That the data was collected in compliance with applicable law (including privacy law, web-scraping access controls, and the contractual terms of the source).
  4. That no individual whose data is in the dataset has invoked an opt-out right that would prevent the licensed use.

The fourth warranty is the one most licensors are not equipped to give cleanly. The opt-out infrastructure for training data is still developing, and the warranty as drafted may not be supportable. The fallback I land on: the licensor warrants that, to its knowledge, no opt-out has been invoked, with a representation that the licensor will use commercially reasonable efforts to honor opt-outs invoked after the license date. That is softer than I would like but reflects the reality of the data layer.

The indemnity

For a substantial training-data license, the indemnity is doing the bulk of the work. The structure I push for: the licensor indemnifies the licensee for any third-party claim alleging that the licensed data, as used in accordance with the agreement, infringes the third party's IP rights or was collected unlawfully. The indemnity should cover defense costs, settlement amounts, and judgments, with reasonable procedural controls.

The carve-outs the licensor will request:

The first carve-out is reasonable in principle but easily overdrafted. The same narrowing language I use on SaaS IP indemnity carve-outs applies here. The second carve-out is reasonable if the licensor has built an opt-out infrastructure; if not, it shifts risk to the licensee for compliance that the licensee cannot operate. The third carve-out is the place where licensors try to push output-infringement risk to the licensee, and the negotiation depends on whether the licensor or the licensee has more visibility into output behavior.

The data provenance file

The operational requirement I now add to any substantive training-data license: a data provenance file. The licensor agrees to deliver, at the time of license, a structured record of the data's origin (sources, dates of collection, methods of collection, applicable consents or licenses), the data's processing (deduplication, filtering, redaction), and the data's known limitations (known infringing or otherwise risky subsets that have been removed). The provenance file is what counsel for the licensee will need when a claim arrives and the litigation requires the licensee to trace the data's lineage.

The licensor will resist the provenance-file requirement. The reasonable compromise is a summary provenance file at license date, with a fuller record made available on request in the event of a third-party claim. The full record is not, in most deals, something the licensor wants in a public discovery record, so the agreement should treat it as confidential.

What I would not assume

The fair-use question is genuinely open. The Bartz v. Anthropic record will affect how courts apply transformative-use analysis to wholesale ingestion of copyrighted books. The Doe v. GitHub line will affect code. The Andersen v. Stability AI line will affect images. Each of these matters has had inconsistent rulings at the trial court level. The licensee that depends on fair use as its license is taking on litigation risk that an actual license, with warranties and indemnity, displaces.

The CCPA and CPRA inference rules, and the CPPA's evolving ADMT regulations, also bear on whether training on personal data has a separate state-law exposure. The CPPA's draft rules through 2024 and 2025 have indicated that automated decision-making systems trained on personal information are within the agency's scope, with specific notice and opt-out obligations. The drafting move: assume the regulatory layer will grow stricter, draft the agreement to support compliance with stricter rules, and put the burden of regulatory compliance on the party best positioned to handle it (usually the licensee, with supporting representations from the licensor).

AI training data deal on your desk?

If you are working a training-data license in either direction and want a written redline with warranty, indemnity, and provenance positions I would take, email owner@terms.law with the current draft.

Sergei Tokmakov, Esq., CA Bar #279869. This memo is attorney commentary on legal questions and is not legal advice. Reading it does not create an attorney-client relationship. Past matter outcomes depend on facts and the responding party; nothing here is a prediction of result.