Members-only forum — Email to join

AI companies scraped my content - what are my legal options?

Started by TermsLaw_User · Oct 28, 2024 · 12 replies
For informational purposes only. Legal options vary based on specific circumstances.
TU
TermsLaw_User OP

I run a niche legal resource website with years of original content - guides, templates, explanations of legal concepts. All copyrighted, all original work.

Recently I've been testing various AI chatbots and I'm seeing MY content being regurgitated back to me. Not just similar concepts - literally my specific phrasings, my unique examples, even mistakes I made that I later corrected on my site.

I had robots.txt set up to block common crawlers but apparently these AI companies ignored it or scraped before I added the blocks.

What are my options here? Can I send a DMCA? Cease and desist? Sue? This feels like theft but I don't know what legal framework applies.

HV
HarveyLitigation Attorney

You're not alone - this is one of the most active areas of litigation right now. Let me walk through the legal landscape:

Copyright Infringement: This is the primary theory. If AI companies copied your copyrighted content to train their models without permission, that's potentially infringement. The key questions are:

  • Did they actually copy your content? (Sounds like yes based on what you're seeing)
  • Is there a fair use defense? (This is what the AI companies argue)

Multiple lawsuits are testing this right now - Authors Guild v. OpenAI, NYT v. OpenAI/Microsoft, Getty v. Stability AI, and many others.

WM
WebMaster_Chris

The robots.txt thing is frustrating. I've blocked GPTBot, CCBot, anthropic-ai, and Google-Extended on my sites. But the thing is:

  1. robots.txt is voluntary - there's no legal requirement to follow it
  2. Many AI training datasets were scraped years ago before anyone thought to block AI crawlers
  3. Some crawlers use generic user-agents or don't identify themselves

By the time you add the blocks, the horse has left the barn. Your content is already in their training data.

TU
TermsLaw_User OP

@HarveyLitigation - what about the fair use argument? I keep hearing AI companies say training is "transformative" and therefore fair use. Is that actually holding up in court?

HV
HarveyLitigation Attorney

The fair use question is genuinely unsettled. Here's the framework:

Four Fair Use Factors:

  1. Purpose/character of use: AI companies argue training is "transformative" because they're not republishing your work, they're learning patterns. Critics say the output often directly competes with the original.
  2. Nature of copyrighted work: Creative works get more protection than factual ones. Legal guides might fall somewhere in between.
  3. Amount used: They copied everything. That weighs against fair use.
  4. Market effect: This is often the most important factor. If AI outputs substitute for visiting your website, that hurts your market.

No court has definitively ruled on AI training fair use yet. We're all waiting for these cases to shake out.

SB
SmallBizOwner_Lisa

Same situation here. I run a cooking blog with 500+ original recipes. I've found my exact recipes - including my personal stories and tips - showing up in AI responses.

What makes me really angry: people now ask ChatGPT instead of visiting my site. My traffic has dropped 30% since these AI tools launched. That's direct market harm.

But what am I supposed to do? I can't afford to sue OpenAI. The individual harm is small even though the aggregate harm across all content creators is massive.

PL
PublisherLaw_Mike

@SmallBizOwner_Lisa - this is exactly why the class actions and industry group lawsuits matter. Individual creators can't take on these companies, but:

  • Authors Guild filed on behalf of authors
  • Getty filed on behalf of photographers
  • News organizations are banding together

There may be class actions you can join rather than filing individually.

There's background on these lawsuits at /2024/ai-training-data-lawsuits-overview/

TU
TermsLaw_User OP

What about more immediate options? Can I send a DMCA takedown? Or a cease and desist letter?

HV
HarveyLitigation Attorney

DMCA Takedowns: Tricky. DMCA is designed for hosting services - you send a notice, they remove the infringing content. With AI models, there's no specific "copy" to remove. Your content is baked into model weights. OpenAI can't just delete your content from GPT-4 without retraining the entire model.

That said, you could try sending a DMCA notice if specific outputs reproduce your content verbatim. Some companies have processes for this. OpenAI has a content removal request form.

Cease and Desist: You can send one. It puts them on notice and documents your objection. But realistically, these companies have received thousands of such letters and aren't going to retrain their models based on a C&D from a single site.

The letter might be useful for: (1) documenting your objection if you later join litigation, (2) potentially negotiating a licensing deal if your content is valuable enough.

DC
DigitalCreator_Jen

Has anyone looked into terms of service breach? If you had terms on your website prohibiting scraping for AI training, that's potentially a breach of contract claim.

I added explicit language to my TOS last year: "Automated access, scraping, or data collection for training machine learning or AI models is prohibited without explicit written permission."

Doesn't help with scraping that happened before I added it, but might provide a hook going forward.

HV
HarveyLitigation Attorney

@DigitalCreator_Jen - terms of service claims are being tried. The challenge is proving the AI company agreed to your terms. With browsewrap agreements (terms you have to click a link to find), courts are often skeptical that the scraper is bound.

hiQ v. LinkedIn touched on this - the court found scraping publicly accessible data didn't violate the CFAA, but left open contract questions. TOS breach is a cleaner theory than some criminal hacking theories.

Still, it's worth adding that language. Makes your position clearer and could support various claims.

TU
TermsLaw_User OP

Okay, so practical takeaways for someone in my position:

Immediate actions:

  • Update robots.txt to block known AI crawlers (GPTBot, CCBot, etc.) - prevents future scraping at least
  • Update TOS to explicitly prohibit AI training scraping
  • Document examples of my content appearing in AI outputs (screenshots, prompts used)
  • Consider sending a C&D to put them on notice

Longer term:

  • Watch the ongoing lawsuits - the legal landscape will become clearer
  • Consider joining a class action if one is relevant to my content type
  • Look into licensing programs some AI companies are starting
HV
HarveyLitigation Attorney

Good summary @TermsLaw_User. A few additions:

Documentation is key: If you ever do pursue legal action, you'll want evidence. Save examples of AI reproducing your content with timestamps. Screenshot the AI outputs alongside your original content.

Register your copyrights: If you haven't already, register your works with the Copyright Office. You need registration to sue for infringement in the U.S., and timely registration lets you claim statutory damages and attorney's fees.

On licensing: Some companies are now paying for content. OpenAI has deals with AP, Axel Springer, and others. It's possible some companies will proactively reach out to license quality content, or you could approach them.

More on protecting your content: /2025/protecting-content-from-ai-scraping/

SB
SmallBizOwner_Lisa

The licensing angle is interesting. If they're going to use our content anyway, maybe getting paid is better than fighting a losing battle.

Though it still feels wrong that "pay us after we already took everything" is the best option. We should've had a say before they scraped the entire internet.

Thanks for the thorough discussion everyone. At least I understand the landscape better now.

Want to participate in this discussion?

Email owner@terms.law to request access