📊 Training Data Protection

Training Data NDA Templates

Protect datasets shared for AI model training with NDAs designed for data provenance, derivative works, bias representations, and regulatory compliance.

Get Started

Create or analyze training data confidentiality agreements

📊

Generate Training Data NDA

Create a customized NDA for sharing datasets with AI companies, research partners, or contractors training ML models.

Create NDA
🔍

Review Received NDA

Received an NDA from an AI company wanting your data? Analyze it for data-unfriendly terms and get negotiation guidance.

Analyze NDA

Training Data Categories

Customized protections for different types of training data

📝 Text & Language Data

Corpora, documents, and text datasets for NLP and LLM training.

  • Content licensing verification
  • PII scrubbing requirements
  • Copyright chain tracking
  • Output attribution rules
Generate Text Data NDA →

🎨 Image & Visual Data

Photos, graphics, and visual datasets for computer vision models.

  • Consent verification
  • Face/biometric data rules
  • Copyright and licensing
  • Synthetic data generation
Generate Image Data NDA →

🎧 Audio & Voice Data

Speech, music, and audio datasets for ASR and audio models.

  • Voice consent requirements
  • Music licensing chains
  • Synthetic voice generation
  • Speaker privacy protection
Generate Audio Data NDA →

Essential Training Data NDA Clauses

Critical provisions for protecting shared training data

📋 Data Provenance Disclosure

Requires disclosure of original data sources, licensing chains, consent status, and known limitations or biases in the dataset.

Critical for Liability

⛔ Permitted Use Restrictions

Specifies exactly what the data can be used for: initial training only, fine-tuning, evaluation, or ongoing model updates.

Define Boundaries

🔁 Derivative Works Ownership

Clarifies whether trained models are derivative works of the data and who owns models, embeddings, and representations.

Negotiate Carefully

📊 Bias & Quality Representations

Addresses representations about data quality, completeness, and known biases. Allocates liability for model outputs.

Risk Allocation

🗑 Data Retention & Deletion

Specifies how long data can be retained, deletion requirements, and whether embeddings or model artifacts must also be destroyed.

GDPR/CCPA Critical

🔒 Re-Distribution Prohibition

Prevents sharing data with third parties, sublicensing, or making data available through model outputs or APIs.

Standard Provision

Real-World Scenarios

How training data NDAs apply in common AI situations

🏥

Healthcare Dataset for Medical AI

A hospital shares patient records (de-identified) with an AI company to train diagnostic models. The NDA must address HIPAA compliance, re-identification risks, and model output liability.

Tip: Include specific HIPAA BAA terms and prohibit any re-identification attempts.
📊

Proprietary Corpus for LLM Fine-Tuning

A publisher licenses their article corpus for fine-tuning an LLM. The NDA must prevent the model from reproducing copyrighted content verbatim.

Tip: Include output filtering requirements and prohibit memorization extraction attacks.
🎨

Image Dataset with Faces

A social platform provides images for facial recognition research. NDA must address consent verification, biometric data laws, and prohibited surveillance uses.

Tip: Require BIPA/GDPR compliance and explicitly prohibit law enforcement or surveillance applications.
🎧

Voice Data for TTS Model

Voice actors provide recordings to train a text-to-speech system. NDA must address voice cloning rights, synthetic voice usage, and ongoing royalties.

Tip: Clearly define whether synthetic voices can be created and what attribution/compensation is required.

Negotiation Tips for Data Providers

What to push back on when sharing training data

Full Negotiation Playbook