AI and Data Licensing · Memo

AI Training Data Licensing: What a Usable Agreement Looks Like

The case law on AI training data is genuinely unsettled and counsel drafting in this space should be candid about that with their clients. I will lay out the structure of an agreement that I think holds up under most plausible outcomes, and flag the points where the law has not landed.

The cases that will set the rules for AI training data are still in motion. The Bartz v. Anthropic class action over books used in training, the parallel Kadrey v. Meta matter, the New York Times v. OpenAI litigation, the Doe v. GitHub line on code, the Andersen v. Stability AI line on images, and the multiple state-court and overseas matters running in parallel will produce a patchwork of outcomes over the next eighteen to thirty months. The fair-use question is the one most observers are watching, and the trial court rulings to date have been inconsistent on key sub-issues including transformative use, market substitution, and the relevance of opt-out mechanisms. I do not think any practitioner should be confident about how this resolves.

The drafting consequence is that an AI training-data license that depends on the fair-use question coming out one way is, in my view, an imprudent license. The agreement should be structured so that it holds up under either a strong fair-use outcome or a weak one. That means an explicit grant, an explicit warranty stack, an explicit indemnity, and an explicit allocation of the model-output question.

The grant clause

The grant should not be one sentence. The structural elements I would not draft without:

Scope of permitted use. The agreement should specify that the licensee may use the licensed data to train, fine-tune, and evaluate machine learning models. It should specify whether the use is limited to a single named model or extends to derivative or successor models. For most enterprise data licensors, the answer is a single model with the right to extend on additional fee.
Output rights. The agreement should specify what rights the licensee has in the outputs of models trained on the licensed data. The licensor's first draft typically says nothing about outputs, which leaves the licensee exposed if the output infringes on the licensed work. The licensee should push for a representation that outputs do not derive in a copyrightable sense from any individual licensed work, plus an indemnity if that representation fails. The licensor will resist this and the negotiation depends on the dataset.
Territory and duration. Training is typically a one-time event, but the model continues to use what it learned. A license that terminates after a defined period leaves the licensee with a model trained on data it can no longer represent it has rights to. The cleaner structure: a perpetual, worldwide license for training and use of models trained, with a finite license for ongoing access to the dataset itself.
Sublicensing. If the licensee is going to provide a model to its own customers, the agreement needs to address whether the customers' use of the model is sublicensed. The conservative position is that the licensee's customers use the model under the licensee's general license terms and the licensee remains responsible to the licensor; no separate sublicense flows through to the customer.

The warranty stack

The licensor's warranties are where the actual risk allocation happens. The minimum warranties I push for from the licensee side:

That the licensor owns or has the rights to license the data.
That the use of the data for training as specified in the agreement does not infringe a third party's copyright, trademark, or other intellectual property right.
That the data was collected in compliance with applicable law (including privacy law, web-scraping access controls, and the contractual terms of the source).
That no individual whose data is in the dataset has invoked an opt-out right that would prevent the licensed use.

The fourth warranty is the one most licensors are not equipped to give cleanly. The opt-out infrastructure for training data is still developing, and the warranty as drafted may not be supportable. The fallback I land on: the licensor warrants that, to its knowledge, no opt-out has been invoked, with a representation that the licensor will use commercially reasonable efforts to honor opt-outs invoked after the license date. That is softer than I would like but reflects the reality of the data layer.

The indemnity

For a substantial training-data license, the indemnity is doing the bulk of the work. The structure I push for: the licensor indemnifies the licensee for any third-party claim alleging that the licensed data, as used in accordance with the agreement, infringes the third party's IP rights or was collected unlawfully. The indemnity should cover defense costs, settlement amounts, and judgments, with reasonable procedural controls.

The carve-outs the licensor will request:

Claims arising from the licensee's combination of the licensed data with other data, where the infringement would not have occurred without the combination.
Claims arising from the licensee's failure to honor an opt-out after notice from the licensor.
Claims arising from the licensee's use of model outputs in a manner not contemplated by the agreement.

The first carve-out is reasonable in principle but easily overdrafted. The same narrowing language I use on SaaS IP indemnity carve-outs applies here. The second carve-out is reasonable if the licensor has built an opt-out infrastructure; if not, it shifts risk to the licensee for compliance that the licensee cannot operate. The third carve-out is the place where licensors try to push output-infringement risk to the licensee, and the negotiation depends on whether the licensor or the licensee has more visibility into output behavior.

The data provenance file

The operational requirement I now add to any substantive training-data license: a data provenance file. The licensor agrees to deliver, at the time of license, a structured record of the data's origin (sources, dates of collection, methods of collection, applicable consents or licenses), the data's processing (deduplication, filtering, redaction), and the data's known limitations (known infringing or otherwise risky subsets that have been removed). The provenance file is what counsel for the licensee will need when a claim arrives and the litigation requires the licensee to trace the data's lineage.

The licensor will resist the provenance-file requirement. The reasonable compromise is a summary provenance file at license date, with a fuller record made available on request in the event of a third-party claim. The full record is not, in most deals, something the licensor wants in a public discovery record, so the agreement should treat it as confidential.

What I would not assume

The fair-use question is genuinely open. The Bartz v. Anthropic record will affect how courts apply transformative-use analysis to wholesale ingestion of copyrighted books. The Doe v. GitHub line will affect code. The Andersen v. Stability AI line will affect images. Each of these matters has had inconsistent rulings at the trial court level. The licensee that depends on fair use as its license is taking on litigation risk that an actual license, with warranties and indemnity, displaces.

The CCPA and CPRA inference rules, and the CPPA's evolving ADMT regulations, also bear on whether training on personal data has a separate state-law exposure. The CPPA's draft rules through 2024 and 2025 have indicated that automated decision-making systems trained on personal information are within the agency's scope, with specific notice and opt-out obligations. The drafting move: assume the regulatory layer will grow stricter, draft the agreement to support compliance with stricter rules, and put the burden of regulatory compliance on the party best positioned to handle it (usually the licensee, with supporting representations from the licensor).

AI training data deal on your desk?

If you are working a training-data license in either direction and want a written redline with warranty, indemnity, and provenance positions I would take, email owner@terms.law with the current draft.

Sergei Tokmakov, Esq., CA Bar #279869. This memo is attorney commentary on legal questions and is not legal advice. Reading it does not create an attorney-client relationship. Past matter outcomes depend on facts and the responding party; nothing here is a prediction of result.

Live interactive demo

Try an AI governance contract workroom

I built the training-data license structure from this memo into an editable workroom so you can redline the grant, warranty, and indemnity yourself. Set autonomy to fully autonomous, or turn vendor training on your data on, and watch the room flag the boundary in real time: live preview with surgical yellow highlighting, click-any-clause comments, and track-changes style suggestions.

Open the live demo workroom How I build these for firms

Fictional demo data. Built by Sergei Tokmakov, Esq., California attorney and AI engineer.