AI Training on Your Content: Legal Rights & Opt-Out FAQ

Robots.txt enforcement, DMCA for AI, opt-out processes, and AI company data policies

Q: Can AI companies legally scrape my website content to train their models? +

The legality of AI training data scraping is currently being litigated in multiple high-profile lawsuits, with no definitive resolution. AI companies typically argue that scraping publicly available content for training purposes constitutes fair use under 17 U.S.C. Section 107, citing the transformative nature of AI training.

However, content creators and publishers argue this constitutes mass copyright infringement. Key lawsuits include:

  • New York Times v. OpenAI
  • Getty Images v. Stability AI
  • Various author class actions against OpenAI and Meta

Courts have not yet issued final rulings on whether AI training constitutes fair use.

What we know: Scraping that violates website Terms of Service may constitute breach of contract or trespass to chattels. Scraping behind paywalls or login walls likely exceeds authorized access. The hiQ v. LinkedIn case established that scraping publicly available data is not necessarily a CFAA violation, but copyright claims are separate.

Legal Reference: 17 U.S.C. Section 107 (Fair Use); hiQ Labs v. LinkedIn, 938 F.3d 985 (9th Cir. 2019)
Q: Does robots.txt actually prevent AI crawlers from scraping my content? +

Robots.txt is a voluntary protocol with no legal enforcement mechanism, but ignoring it may strengthen legal claims against scrapers. Major AI companies have created specific user-agents that can be blocked:

  • OpenAI: GPTBot
  • Anthropic: ClaudeBot and anthropic-ai
  • Google: Google-Extended
  • Common Crawl: CCBot

To block these in your robots.txt file, add entries like "User-agent: GPTBot" followed by "Disallow: /" for each bot you want to block.

Important limitations:

  • Robots.txt blocking only works prospectively; it cannot remove content already scraped
  • Not all AI companies respect robots.txt or disclose their user-agents
  • Historical scrapes from Common Crawl may already include your content

Legal significance: While robots.txt is not legally binding, deliberately ignoring it may support claims of bad faith, trespass to chattels, or TOS violation. The New York Times lawsuit specifically alleges OpenAI ignored their robots.txt restrictions.

Legal Reference: Robots Exclusion Protocol; New York Times v. OpenAI Complaint (December 2023)
Q: What are OpenAI and Anthropic's data collection and training policies? +

OpenAI Data Policies:

  • Trained models on data from Common Crawl, WebText, books, and Wikipedia
  • Introduced GPTBot in August 2023, which can be blocked via robots.txt
  • States they respect robots.txt for GPTBot but historical training data was collected before this policy
  • Offers a web publisher opt-out process but this only affects future crawling

Anthropic Data Policies:

  • More transparent about training data sources with published documentation
  • Uses ClaudeBot and anthropic-ai as user-agents that can be blocked
  • Emphasizes Constitutional AI training methods
  • Has been more proactive about content creator concerns

Both companies: Neither provides a way to remove content from already-trained models. Both offer enterprise customers data isolation and non-training guarantees. Both are defendants in ongoing copyright litigation.

Legal Reference: OpenAI Privacy Policy; Anthropic Privacy Policy (2024)
Q: Can I file a DMCA takedown notice against AI companies using my content? +

The DMCA was designed for removing infringing content from websites, not for AI training data, creating significant legal uncertainty.

Standard DMCA Process:

  • The DMCA requires identifying specific infringing material and its location
  • AI training data is not hosted or displayed in a traditional sense
  • AI companies may argue they are not "hosting" your content in a DMCA-applicable way

Potential DMCA Applications:

  • If an AI generates outputs that substantially copy your content, DMCA may apply to those outputs
  • Some companies have DMCA processes for removing content from training data going forward
  • The Copyright Office is studying whether new DMCA-like mechanisms are needed for AI

Current Options: OpenAI has a content removal request process separate from DMCA. You can file DMCA notices against websites republishing AI-generated content that copies your work. Consider sending cease and desist letters directly to AI companies citing copyright infringement.

Legal Reference: 17 U.S.C. Section 512 (DMCA Safe Harbor); 17 U.S.C. Section 1202 (CMI)
Q: How can I opt out of having my content used for AI training? +

Multiple opt-out mechanisms exist, but none are comprehensive or retroactive.

Technical Measures:

  • Update robots.txt to block known AI crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended)
  • Add "noai" and "noimageai" meta tags (emerging standard, not universally respected)
  • Implement technical access controls like CAPTCHAs, rate limiting, or login requirements
  • Use the TDMRep (Text and Data Mining Reservation Protocol) header

Platform-Specific Opt-Outs:

  • OpenAI: Submit opt-out request at their publisher portal
  • Anthropic: Contact their support with domain information
  • Google: Use Google Search Console to manage Google-Extended
  • Art platforms: Check settings for AI training toggles

Legal Measures: Update your Terms of Service to explicitly prohibit AI training scraping. Add copyright notices specifying no machine learning use without license. Register copyrights to enable statutory damages claims.

Legal Reference: Robots Exclusion Protocol; EU Copyright Directive Article 4
Q: Does my website's Terms of Service protect against AI scraping? +

Terms of Service can provide legal protection against AI scraping, but enforcement depends on several factors.

Enforceability Requirements:

  • TOS must be reasonably conspicuous (browse-wrap may not be enforceable)
  • Clickwrap agreements (requiring affirmative consent) are stronger
  • TOS should specifically prohibit automated scraping, data mining, and AI training
  • Include provisions about unauthorized access and use

Strengthening Your TOS:

  • Explicitly prohibit "scraping, crawling, or automated data collection for machine learning or AI training purposes"
  • Reserve all rights not expressly granted
  • Include liquidated damages provisions for violations
  • Specify jurisdiction and choice of law favorable to your position

Legal Precedents: hiQ v. LinkedIn found TOS violations alone may not constitute CFAA violations. However, TOS violations can support breach of contract claims. The New York Times lawsuit alleges OpenAI violated their TOS.

Legal Reference: hiQ Labs v. LinkedIn, 938 F.3d 985 (9th Cir. 2019); Restatement (Second) of Contracts
Q: What legal claims can content creators bring against AI companies? +

Content creators have pursued multiple legal theories against AI companies, with outcomes still pending in most cases.

Copyright Infringement:

  • Direct infringement for unauthorized copying during training
  • Vicarious or contributory infringement when AI outputs copy protected works
  • Removal of copyright management information under 17 U.S.C. Section 1202

Contract Claims:

  • Breach of Terms of Service for prohibited scraping
  • Breach of license terms for content scraped from licensed databases
  • Tortious interference with contractual relations

Computer-Related Claims:

  • Computer Fraud and Abuse Act (CFAA) violations for exceeding authorized access
  • State computer crime statutes
  • Trespass to chattels for server burden from scraping

Other Claims: Unfair competition, right of publicity for AI mimicking individuals, and unjust enrichment.

Legal Reference: 17 U.S.C. Section 501; 18 U.S.C. Section 1030 (CFAA)
Q: How do EU and international laws differ on AI training data? +

International approaches to AI training data vary significantly, creating a complex global landscape.

European Union:

  • The EU Copyright Directive (Article 4) allows text and data mining for research
  • Commercial AI training requires rights holder permission or opt-out respect
  • The EU AI Act imposes transparency requirements for training data
  • GDPR applies to personal data in training sets
  • Rights holders can reserve TDM rights, and AI companies must respect these reservations

United Kingdom: Post-Brexit, the UK proposed broad TDM exceptions for AI training. After backlash, the government paused this proposal. Current UK law requires permission for commercial AI training.

Japan: Japan has relatively permissive rules for AI training. Non-commercial deep learning is generally permitted with fewer restrictions than EU/US for commercial use.

Practical Implications: Content hosted in different jurisdictions may have different protections. AI companies may structure operations to take advantage of permissive jurisdictions.

Legal Reference: EU Copyright Directive 2019/790 Article 4; EU AI Act (2024); GDPR Article 6
Q: What is the current status of major AI training data lawsuits? +

Several landmark cases are shaping AI training data law as of January 2025.

New York Times v. OpenAI and Microsoft (Filed December 2023):

  • Alleges copyright infringement and TOS violations
  • Claims ChatGPT can reproduce NYT articles nearly verbatim
  • Seeks billions in damages and destruction of training data
  • Status: Discovery phase, with significant motions pending

Getty Images v. Stability AI:

  • Alleges Stable Diffusion was trained on millions of Getty images
  • Claims include copyright infringement and trademark violations
  • Shows AI reproducing Getty watermarks as evidence
  • Filed in both US and UK courts

Authors Guild v. OpenAI (Class Action): Represents thousands of authors, alleges systematic infringement of book copyrights, seeks damages and injunctive relief.

Key Legal Questions Being Decided: Does AI training constitute fair use? What damages are appropriate? Can training data be ordered removed or destroyed?

Legal Reference: New York Times v. OpenAI (S.D.N.Y. 2023); Getty Images v. Stability AI (D. Del. 2023)
Q: Should I license my content to AI companies, and what terms should I negotiate? +

Licensing to AI companies is an emerging market with developing standards and significant considerations.

Licensing Opportunities:

  • Major AI companies are entering licensing deals (OpenAI-AP News, OpenAI-Axel Springer)
  • Stock photo sites are licensing image databases to AI companies
  • Content licensing marketplaces specifically for AI training are emerging

Key Terms to Negotiate:

  • Compensation structure (upfront payment, per-use royalties, or hybrid)
  • Usage scope (training only, fine-tuning, RAG, or specific models)
  • Exclusivity requirements and competitive restrictions
  • Attribution and credit provisions
  • Audit rights to verify usage compliance
  • Term length and renewal conditions
  • Indemnification for downstream infringement claims

Potential Risks: Licensing may waive claims for past infringement. Exclusive deals may limit future opportunities. The market is evolving rapidly. Consider consulting an IP attorney before signing AI training licenses.

Legal Reference: 17 U.S.C. Section 101 (Copyright Licensing); Uniform Commercial Code Article 2

Need Help Protecting Your Content from AI Scraping?

Generate a comprehensive Terms of Service with AI scraping prohibitions and proper robots.txt directives.

Create Your Terms