Robots.txt enforcement, DMCA for AI, opt-out processes, and AI company data policies
The legality of AI training data scraping is currently being litigated in multiple high-profile lawsuits, with no definitive resolution. AI companies typically argue that scraping publicly available content for training purposes constitutes fair use under 17 U.S.C. Section 107, citing the transformative nature of AI training.
However, content creators and publishers argue this constitutes mass copyright infringement. Key lawsuits include:
Courts have not yet issued final rulings on whether AI training constitutes fair use.
What we know: Scraping that violates website Terms of Service may constitute breach of contract or trespass to chattels. Scraping behind paywalls or login walls likely exceeds authorized access. The hiQ v. LinkedIn case established that scraping publicly available data is not necessarily a CFAA violation, but copyright claims are separate.
Robots.txt is a voluntary protocol with no legal enforcement mechanism, but ignoring it may strengthen legal claims against scrapers. Major AI companies have created specific user-agents that can be blocked:
To block these in your robots.txt file, add entries like "User-agent: GPTBot" followed by "Disallow: /" for each bot you want to block.
Important limitations:
Legal significance: While robots.txt is not legally binding, deliberately ignoring it may support claims of bad faith, trespass to chattels, or TOS violation. The New York Times lawsuit specifically alleges OpenAI ignored their robots.txt restrictions.
OpenAI Data Policies:
Anthropic Data Policies:
Both companies: Neither provides a way to remove content from already-trained models. Both offer enterprise customers data isolation and non-training guarantees. Both are defendants in ongoing copyright litigation.
The DMCA was designed for removing infringing content from websites, not for AI training data, creating significant legal uncertainty.
Standard DMCA Process:
Potential DMCA Applications:
Current Options: OpenAI has a content removal request process separate from DMCA. You can file DMCA notices against websites republishing AI-generated content that copies your work. Consider sending cease and desist letters directly to AI companies citing copyright infringement.
Multiple opt-out mechanisms exist, but none are comprehensive or retroactive.
Technical Measures:
Platform-Specific Opt-Outs:
Legal Measures: Update your Terms of Service to explicitly prohibit AI training scraping. Add copyright notices specifying no machine learning use without license. Register copyrights to enable statutory damages claims.
Terms of Service can provide legal protection against AI scraping, but enforcement depends on several factors.
Enforceability Requirements:
Strengthening Your TOS:
Legal Precedents: hiQ v. LinkedIn found TOS violations alone may not constitute CFAA violations. However, TOS violations can support breach of contract claims. The New York Times lawsuit alleges OpenAI violated their TOS.
Content creators have pursued multiple legal theories against AI companies, with outcomes still pending in most cases.
Copyright Infringement:
Contract Claims:
Computer-Related Claims:
Other Claims: Unfair competition, right of publicity for AI mimicking individuals, and unjust enrichment.
International approaches to AI training data vary significantly, creating a complex global landscape.
European Union:
United Kingdom: Post-Brexit, the UK proposed broad TDM exceptions for AI training. After backlash, the government paused this proposal. Current UK law requires permission for commercial AI training.
Japan: Japan has relatively permissive rules for AI training. Non-commercial deep learning is generally permitted with fewer restrictions than EU/US for commercial use.
Practical Implications: Content hosted in different jurisdictions may have different protections. AI companies may structure operations to take advantage of permissive jurisdictions.
Several landmark cases are shaping AI training data law as of January 2025.
New York Times v. OpenAI and Microsoft (Filed December 2023):
Getty Images v. Stability AI:
Authors Guild v. OpenAI (Class Action): Represents thousands of authors, alleges systematic infringement of book copyrights, seeks damages and injunctive relief.
Key Legal Questions Being Decided: Does AI training constitute fair use? What damages are appropriate? Can training data be ordered removed or destroyed?
Licensing to AI companies is an emerging market with developing standards and significant considerations.
Licensing Opportunities:
Key Terms to Negotiate:
Potential Risks: Licensing may waive claims for past infringement. Exclusive deals may limit future opportunities. The market is evolving rapidly. Consider consulting an IP attorney before signing AI training licenses.
Generate a comprehensive Terms of Service with AI scraping prohibitions and proper robots.txt directives.
Create Your Terms