Private members-only forum

AI companies scraped my content - what are my legal options?

Started by not_a_bot_i_swear_14 · Aug 8, 2025 · 9 replies
For informational purposes only. Legal options vary based on specific circumstances.
NA
not_a_bot_i_swear_14 OP

I run a niche legal resource website with years of original content - guides, templates, explanations of legal concepts. All copyrighted, all original work.

Recently I've been testing various AI chatbots and I'm seeing MY content being regurgitated back to me. Not just similar concepts - literally my specific phrasings, my unique examples, even mistakes I made that I later corrected on my site.

I had robots.txt set up to block common crawlers but apparently these AI companies ignored it or scraped before I added the blocks.

What are my options here? Can I send a DMCA? Cease and desist? Sue? This feels like theft but I don't know what legal framework applies.

LE
legal_eagle_wannabe_13

Same situation here. I run a cooking blog with 500+ original recipes. I've found my exact recipes - including my personal stories and tips - showing up in AI responses.

What makes me really angry: people now ask ChatGPT instead of visiting my site. My traffic has dropped 30% since these AI tools launched. That's direct market harm.

But what am I supposed to do? I can't afford to sue OpenAI. The individual harm is small even though the aggregate harm across all content creators is massive.

NL
nursing_life_11

@legal_eagle_wannabe_13 - this is exactly why the class actions and industry group lawsuits matter. Individual creators can't take on these companies, but:

  • Authors Guild filed on behalf of authors
  • Getty filed on behalf of photographers
  • News organizations are banding together

There may be class actions you can join rather than filing individually.

There's background on these lawsuits at /2024/ai-training-data-lawsuits-overview/

JE
jenny_2024_6 Attorney

@tort_reform_this_4 - terms of service claims are being tried. The challenge is proving the AI company agreed to your terms. With browsewrap agreements (terms you have to click a link to find), courts are often skeptical that the scraper is bound.

hiQ v. LinkedIn touched on this - the court found scraping publicly accessible data didn't violate the CFAA, but left open contract questions. TOS breach is a cleaner theory than some criminal hacking theories.

Still, it's worth adding that language. Makes your position clearer and could support various claims.

NL
nursing_life_11

Wanted to bump this thread with some major updates. The legal landscape has shifted significantly in the past year:

Key developments:

  • The Thomson Reuters v. Ross Intelligence ruling came down - court found AI training on copyrighted legal content was NOT fair use in that context. First major ruling against the "training is transformative" argument.
  • Several AI companies have started offering opt-out programs and content removal request forms, though effectiveness is debatable.
  • The EU AI Act is now in effect with stricter transparency requirements about training data sources.

For anyone following this issue, the tide seems to be turning toward content creators. More courts are rejecting the blanket fair use defense lol.

NA
not_a_bot_i_swear_14 OP

Thanks for the update @nursing_life_11! I actually have some news of my own.

After documenting everything as suggested in this thread, I was contacted by a law firm putting together a class action for niche content publishers. They're specifically targeting one of the major AI companies for scraping specialized professional content.

The advice to screenshot everything and register copyrights was spot-on.

JE
jenny_2024_6 Attorney

Happy New Year everyone. The Thomson Reuters case that @nursing_life_11 mentioned is indeed significant, though I'd caution that it's not a complete victory.

What it established:

@not_a_bot_i_swear_14 - glad to hear you may have an avenue forward. Documentation really is everything in these cases.

CD
case_dismissed_69_4

Several class actions are currently pending against major AI companies for training data scraping. The largest are against OpenAI and Meta. Whether you can join depends on which company scraped your content and the class definition.

Their "transformative use" argument is their best defense, but courts haven't definitively ruled on whether ingesting copyrighted works for AI training constitutes fair use. The key factor will be market substitution โ€” if the AI can generate content that competes with yours, the fair use argument weakens significantly.

Independent of class actions, you can: (1) register copyrights for your most valuable content (required before filing individual suits), (2) implement technical measures (robots.txt, AI crawler blocking โ€” this establishes your objection), (3) consider joining the Authors Guild or similar organizations coordinating collective action, (4) send preservation demands to AI companies to prevent them from destroying evidence of scraping.

LN
lisa_nguyen_10

Major update for anyone following the AI scraping litigation landscape: the NYT v. OpenAI case has progressed significantly. The court denied OpenAI motion to dismiss the copyright claims, finding that the NYT had plausibly alleged that ChatGPT outputs can serve as substitutes for the original articles.

This is a big deal because it means the case is moving to discovery, where OpenAI will likely have to disclose details about their training data and processes. Several other pending cases are watching this closely as a bellwether.

In the meantime, I have had success with a practical approach: I added a clear notice to my robots.txt blocking all AI crawlers, updated my terms of service to explicitly prohibit AI training use, and registered my most valuable content with the Copyright Office. When I sent a formal cease-and-desist to one mid-size AI company citing these protections, they actually confirmed they had added my domain to their exclusion list.

MT
midnight_thoughts_2

Significant development this week that everyone in this thread should know about: a key ruling in the consolidated AI training data cases held that scraping copyrighted content for AI training is not automatically shielded by fair use, and critically, the court rejected the argument that robots.txt non-compliance is irrelevant to the copyright analysis.

The ruling specifically stated that a content owner's technical measures to prevent scraping (robots.txt, terms of service prohibitions, crawler blocking) are relevant to the fair use analysis because they demonstrate the copyright holder's intent to restrict the use. This is a significant shift from earlier rulings that treated robots.txt as legally meaningless. The court analogized it to "No Trespassing" signs -- they do not create the property right, but they are evidence that the owner did not consent to the use.

For practical purposes, this means the steps many of us took early on -- blocking AI crawlers, updating terms of service, sending cease-and-desist letters -- are now directly relevant to strengthening legal claims. If you have not already documented when you implemented these technical measures, do so now. Timestamps matter.

The ruling also opens the door to state-law trespass to chattels claims in addition to federal copyright claims, because unauthorized scraping that circumvents technical barriers may constitute interference with computer systems. Several state attorneys general, including California and New York, have signaled interest in pursuing enforcement actions against AI companies that systematically ignored robots.txt exclusions.