Case Study: Licensing Archive Imagery Into a Model Training Set

Facts

My client was a model developer building a computer-vision model in a specialized vertical. The counterparty was an imagery archive that held the rights, by license from underlying creators, to a corpus of approximately one million curated images in that vertical. The parties had agreed on a commercial price in the mid six figures for a training-use license. They had not agreed on what training use meant in writing, whether the resulting model was a derivative work, what happened to the trained model if an underlying creator withdrew consent, and whether the licensee could use prompts that named the archive's catalog identifiers.

Each party had a different unwritten assumption. The archive assumed an opt-out mechanism for underlying creators that would require the model to be retrained or weighted around removed images. The model developer assumed a one-time license to the images as they existed at signing, with no post-signing removal obligation. Both assumptions were defensible in the absence of an executed license; neither was workable as a deal term.

What I did

I drafted the data licensing agreement from scratch in collaboration with the archive's counsel. The core terms addressed in writing: the scope of training use, including whether outputs that resemble the licensed imagery are permitted; the treatment of the trained model as a derivative work or as an independent work informed by the training data; a tiered removal-on-request mechanism that distinguished between de-identification, weighted suppression, and full retraining, with cost allocation tied to which mechanism applied; a representations-and-warranties section on the archive's chain of license from underlying creators; and a survivability clause that addressed what happened to the trained model on termination of the license.

I also drafted a side letter that gave the archive a periodic audit right limited to logged training inputs, on a confidentiality undertaking.

Outcome

The agreement was executed at the negotiated commercial price with the layered removal-on-request mechanism, the audit side letter, and the survivability terms my client and the archive's counsel had negotiated. The model developer trained against the licensed corpus, and a removal request later in the term was handled under the de-identification tier without triggering full retraining. The contract framework remained in place across two subsequent license renewals. Each matter turns on its facts; the outcome here does not predict the outcome on a similarly framed data licensing transaction.

Lesson

A data licensing transaction for AI training is not a software license, not a content license, and not a stock-image license. It requires its own contract structure that addresses removal, derivative output, and post-termination treatment of the trained model. Parties that treat it as a stretched version of a content license end up with terms that fail under any real-world removal request. The contract architecture is the deal; the price is the easy part.

Have an AI or data licensing matter that looks similar?

Send the deal context and any draft terms in writing. I read every inquiry myself.

See the AI and data practice page Email owner@terms.law

Disclaimer. This case study is an anonymized writeup of a matter I handled. Names, industries, geographies, dollar amounts, and identifying details have been changed. Past results are not a guarantee, prediction, or warranty of any future outcome. Each matter turns on its own facts and applicable law. Reading this page does not create an attorney-client relationship.