Protect datasets shared for AI model training with NDAs designed for data provenance, derivative works, bias representations, and regulatory compliance.
Create or analyze training data confidentiality agreements
Create a customized NDA for sharing datasets with AI companies, research partners, or contractors training ML models.
Create NDAReceived an NDA from an AI company wanting your data? Analyze it for data-unfriendly terms and get negotiation guidance.
Analyze NDACustomized protections for different types of training data
Corpora, documents, and text datasets for NLP and LLM training.
Photos, graphics, and visual datasets for computer vision models.
Speech, music, and audio datasets for ASR and audio models.
Critical provisions for protecting shared training data
Requires disclosure of original data sources, licensing chains, consent status, and known limitations or biases in the dataset.
Critical for LiabilitySpecifies exactly what the data can be used for: initial training only, fine-tuning, evaluation, or ongoing model updates.
Define BoundariesClarifies whether trained models are derivative works of the data and who owns models, embeddings, and representations.
Negotiate CarefullyAddresses representations about data quality, completeness, and known biases. Allocates liability for model outputs.
Risk AllocationSpecifies how long data can be retained, deletion requirements, and whether embeddings or model artifacts must also be destroyed.
GDPR/CCPA CriticalPrevents sharing data with third parties, sublicensing, or making data available through model outputs or APIs.
Standard ProvisionHow training data NDAs apply in common AI situations
A hospital shares patient records (de-identified) with an AI company to train diagnostic models. The NDA must address HIPAA compliance, re-identification risks, and model output liability.
Tip: Include specific HIPAA BAA terms and prohibit any re-identification attempts.A publisher licenses their article corpus for fine-tuning an LLM. The NDA must prevent the model from reproducing copyrighted content verbatim.
Tip: Include output filtering requirements and prohibit memorization extraction attacks.A social platform provides images for facial recognition research. NDA must address consent verification, biometric data laws, and prohibited surveillance uses.
Tip: Require BIPA/GDPR compliance and explicitly prohibit law enforcement or surveillance applications.Voice actors provide recordings to train a text-to-speech system. NDA must address voice cloning rights, synthetic voice usage, and ongoing royalties.
Tip: Clearly define whether synthetic voices can be created and what attribution/compensation is required.What to push back on when sharing training data
Push back on broad language like "any purpose" or "all machine learning applications." Limit to specific use cases, model types, or time periods.
Require specific deletion timelines. Consider whether embeddings, model weights trained on your data, and cached copies must also be destroyed.
Insist on clear language about model ownership. If your data trains their model, do you have any rights to the resulting model or its outputs?
Request rights to audit data handling, verify deletion, and ensure compliance with permitted use restrictions.
Resist broad indemnification that makes you liable for all model outputs. Your liability should be limited to the accuracy of your data representations.