⚙️ API Terms Abuse: When AI Bots Ignore Your robots.txt and Rate Limits

Published: December 5, 2025 • AI

LLM crawlers do not care that you spent months on your API docs and “please be nice” language in your ToS.

If you run a SaaS platform with public documentation, dev portals, and APIs, you’re now in the business of policing:

  • unannounced AI crawlers,
  • “headless browser” scrapers masquerading as normal users, and
  • data brokers quietly reselling your traffic to model builders.

This is a dev-facing guide to the legal + technical playbook:

  • what robots.txt and rate limits actually do (and don’t do);
  • how courts are treating scraping and API terms;
  • how to structure C&Ds and enforcement when an LLM crawler blows through your rules.

🤖 What robots.txt and rate limits really buy you

robots.txt is part of the Robots Exclusion Protocol: a standard way to tell crawlers what they should and shouldn’t fetch. It’s widely used, recently formalized in RFC 9309, and still entirely voluntary. (Wikipedia)

Recent measurement work confirms what devs already suspect: many AI-focused crawlers don’t even check robots.txt, and compliance falls as directives get stricter. (arXiv)

Rate limits and IP blocks are similar: technically powerful, legally ambiguous.

🧱 Table – Technical controls vs legal significance

ControlWhat it does technicallyHow it plays in disputes
robots.txtTells well-behaved bots where they may crawl and at what rate.Not a contract by itself. Shows you attempted reasonable technical controls. Ignoring it can help your story on DMCA anti-circumvention, trespass, and “willful” behavior, but it’s not a silver bullet. (Wikipedia)
Rate limitsThrottle requests per key/IP to a defined quota.Violations are strong evidence of technical abuse. If tied to API Terms, exceeding limits can be part of a contract breach or anti-circumvention narrative. (Stytch)
API keys / authGated access; lets you bind users to explicit terms.Courts are more willing to enforce anti-scraping clauses when the defendant actually used authenticated access or developer terms. (Privacy World)
IP blocking / CAPTCHAsCut off abusive IP ranges; distinguish bots from humans.Repeated attempts to evade these controls after notice look like DMCA circumvention and support trespass / unfair competition theories. This is precisely what Reddit alleges Perplexity and data brokers did. (National Law Review)

So: robots.txt + rate limits are necessary, but only part of an enforcement stack that has to be contractual, technical, and litigation-aware.


⚖️ How courts are treating scraping of public APIs and docs

CFAA and “public” data – hiQ v LinkedIn

The hiQ v LinkedIn saga is the original public-scraping case. The Ninth Circuit held that scraping public LinkedIn profiles was unlikely to be “without authorization” under the Computer Fraud and Abuse Act (CFAA), even after LinkedIn sent a cease-and-desist and deployed technical measures. (Wikipedia)

Takeaway for devs:

  • CFAA is a weak tool for purely public docs and endpoints, especially in the Ninth Circuit. Don’t count on “CFAA threat” as your main club.

ToS alone won’t save you – X Corp v Bright Data

X sued Bright Data over scraping public tweets to sell as datasets, alleging breach of its ToS and related state-law claims. A federal judge in N.D. Cal. dismissed the case:

  • X’s ToS claims were preempted by the Copyright Act, because X doesn’t own user content and can’t use state contract law to create a private copyright regime. (Technology & Marketing Law Blog)
  • X also hadn’t plausibly shown Bright Data actually breached the agreement or caused the kind of system harm needed for tort claims. (courtlistener.com)

Takeaway:

“No scraping” in your ToS, standing alone, is not a guaranteed basis for a lawsuit—especially for public content you don’t own.

Logged-in vs logged-out – Meta v Bright Data

In Meta v Bright Data, Meta argued Bright Data violated its terms by scraping Facebook and Instagram. Judge Chen granted summary judgment for Bright Data on the breach-of-contract claim:

  • Meta’s terms provisions at issue were aimed at account-holders abusing logged-in access.
  • Bright Data scraped logged-out, public data using generic web access.
  • The court held Meta’s ToS didn’t bind Bright Data in that context; Bright Data stood in the same shoes as any non-logged-in visitor. (quinnemanuel.com)

Takeaway:

  • If your abuse is coming from non-authenticated scraping, ToS enforcement is weaker. You want tokens and contracts, not just browsewrap.

The AI era: Reddit v Anthropic, Reddit v Perplexity

Reddit is running the modern AI playbook:

  • Reddit v Anthropic (SF Superior Court) – focuses on breach of the Reddit User Agreement, unjust enrichment, interference with user contracts, and scraping to train Claude without a license, allegedly after Reddit blocked Anthropic’s bots and asked it to stop. (Reuters)
  • Reddit v Perplexity & data brokers (SDNY) – alleges “industrial-scale” scraping of Reddit content via Google SERPs, bypassing robots.txt and anti-scraping measures; layers copyright, DMCA anti-circumvention, unfair competition, unjust enrichment on top of the ToS story. (National Law Review)

Reddit also publicly announced updates to its robots.txt and rate limiting as part of a broader strategy to block unauthorized AI crawlers and push them into paid licensing deals. (Reuters)

Takeaway:

The pattern is shifting from “you broke our ToS” to “you built a commercial AI product by circumventing technical controls, ignoring our terms, and exploiting our content.”

That’s exactly the framing you want if your target is a modern LLM crawler.


🧭 For SaaS platforms: designing your stack with AI scrapers in mind

Think of this as defensive architecture across three layers.

🧱 Contract layer: API & ToS language

For dev portals and APIs, you want developer-facing terms that:

  • Tie access to a key, account, or seat, not just “anyone who visits our docs.”
  • Explicitly restrict:
    • high-volume scraping or bulk export;
    • use of output or documentation for training or improving general-purpose models;
    • reselling or redistributing your data via data brokers/export tools.
  • Reserve rights to:
    • throttle, suspend, or revoke keys for abuse;
    • demand logs and cooperation in investigating suspected automated scraping;
    • pursue DMCA 1201 / anti-circumvention and other remedies if bots bypass your technical measures.

Courts have enforced anti-scraping terms against known account-holders who used fake profiles or abused private APIs, while being skeptical when sites try to bind random visitors. (Privacy World)

🧪 Technical layer: make abuse measurable

You want to be able to point to something more concrete than “our servers felt slow.”

For each API key and IP range, log:

  • request timestamps and paths;
  • user agent and any declared bot identity;
  • burst patterns (e.g., 10K requests in 5 minutes to /docs/ or /v1/resource);
  • correlation with your robots.txt rules (e.g., access to /disallow/ paths).

These fields become exhibits in C&Ds and, if needed, complaints.

🧬 Governance layer: content & ownership

AI cases keep reminding platforms of a basic but uncomfortable fact:

  • If users own the content and you only hold a non-exclusive license (like X, Reddit, many forums), there are limits on what you can do with ToS to police third-party use. (wtk-law.com)

That doesn’t mean do nothing, but it affects:

  • your copyright theories;
  • whether you lean more on contract, privacy, and data protection;
  • and what you promise users about how their contributions will (or won’t) be used in AI training.

🕵️ Detecting and documenting API terms abuse by LLM crawlers

As a dev or platform architect, your job is to make the abuse story legible in log form.

What “LLM crawler abuse” usually looks like

Patterns you’ll recognize:

  • A small number of IPs or ASNs hitting every endpoint and every doc page, often ignoring Retry-After headers.
  • User agents that either openly declare *Bot … or fake standard browsers while behaving like classic scrapers.
  • Aggressive crawling of exactly the directories marked Disallow in robots.txt. (arXiv)
  • Usage coming from accounts or keys that never progress beyond read-only calls and explore the entire schema.

When you see that, start assembling a technical dossier:

Evidence itemWhy it matters
IP ranges and ASN ownershipLets you tie traffic to data centers, cloud providers, or known scraping companies.
Key or account IDsSupports contract theories; shows they agreed to your dev terms.
robots.txt vs access patternsHelps tell the story: “They saw the instructions and decided to ignore them.”
Rate-limit headers vs trafficDocuments deliberate evasion of throttling and support for “technical abuse” framing.
Any identification of the botSome AI crawlers self-identify (GPTBot, Claude-Web, PerplexityBot); screenshots of this matter, especially if they continue after you disallow them. (WIRED)

✉️ Structuring a C&D to an AI crawler that ignores your rules

Here’s the bones of a dev-savvy C&D aimed at LLM scrapers. You’d adapt this for your jurisdiction, but the structure is stable.

🧩 Table – Key building blocks of an AI-scraper C&D

SectionWhat you doDev-facing details to include
1. Identify the bot and trafficTie the letter to concrete behavior.IP ranges, user agents, API keys, time windows, and approximate request volume. Include sample log lines or charts as appendices.
2. Point to your contractual termsShow that this isn’t just vibes; there was an agreement.Quote the relevant API Terms / ToS sections: anti-scraping, rate limits, “no AI training”, export / resale restrictions. Clarify how the bot (or its operator) agreed—e.g., developer signup, key issuance. (Privacy World)
3. Describe technical abuseElevate it above “they made a lot of requests.”Explain how their behavior exceeded documented rate limits, ignored robots.txt, hit blocked paths, or bypassed auth and CAPTCHAs. Connect directly to your docs: “Our published limit is X requests / minute; your bot has repeatedly hit >Y/minute.” (Stytch)
4. Legal theories (short and targeted)Layer contract + IP + computer misuse.– Breach of API Terms (for key-based use) – Trespass to chattels / unfair competition for evasive scraping – Potential DMCA §1201 anti-circumvention if they bypass technical access controls, especially robots + IP blocking + CAPTCHAs. (National Law Review)
5. Demands: keys, logging, and model useThis is where you get concrete.Common demands: – Immediately cease scraping and training on your endpoints/content. – Revoke and delete API keys, and confirm no derived keys/proxies. – Provide network-level logs (IPs, timestamps, endpoints, volumes) for a defined lookback period. – Provide a sworn statement describing how harvested data was used (training, fine-tuning, RAG indices) and commit to ceasing use / deleting derived corpora where appropriate.
6. Cure window and escalationGive them a path out, but on your terms.Short but reasonable timeline (e.g., 7–14 days) to comply and schedule a discussion about licensing if they want continued access. Make clear that failure will lead to key revocation, broader IP blocking, and potential litigation in your chosen forum.

Your tone can stay professional and technical: the goal is to make it feel inevitable that, if they don’t cooperate, the next step is a complaint that looks a lot like Reddit’s filings.


🚨 Enforcement: what you actually do after the C&D

If the LLM crawler keeps misbehaving after your letter, you have options.

Immediate platform actions

  • Revoke keys and sessions bound to the offending entity.
  • Extend IP and ASN blocking, including known data centers and proxies used.
  • Tighten rate limits globally or for suspect ranges; require CAPTCHAs or OAuth for previously anonymous endpoints.

Log these changes—they become evidence if you later argue circumvention (because the bot had to adapt to evade them). (redditinc.com)

Considering litigation

If the crawler:

  • is high-volume;
  • clearly training or powering a commercial model on your data; and
  • ignores multiple notices and technical blocks,

you’re in the same fact pattern as Reddit v Anthropic / Perplexity.

Claims counsel are reaching for in these fact patterns include:

  • Breach of contract (for authenticated access / API users). (Courthouse News)
  • Copyright (for expressive docs / examples / knowledge base content). (ddg.fr)
  • DMCA §1201: bypassing robots + IP blocks + CAPTCHAs and using Google SERPs or proxies to sneak in anyway. (National Law Review)
  • Trespass to chattels / unfair competition / unjust enrichment under state law, especially in state court where judges are more comfortable treating scraping as interference with your business rather than a pure copyright issue. (Courthouse News)

The key is that your API Terms, robots.txt, and rate-limit policies have been turned into a narrative:

“We set clear rules, both contractual and technical. You built an LLM product by intentionally ignoring them and evading our controls.”


❓ FAQ for SaaS platforms and dev teams

Are we legally required to honor all crawlers that follow robots.txt?

No. robots.txt is a voluntary protocol, not a legal obligation. You can block or throttle even “polite” bots if they don’t fit your business model. (Wikipedia)

robots.txt is mainly useful to:

  • signal your preferences;
  • distinguish respectful actors from bad ones;
  • and support your story that you had reasonable technical measures before you went to court.

Can we ever rely on ToS alone to stop scraping?

Sometimes, but it’s risky:

  • For authenticated users with keys or logins, courts have enforced anti-scraping clauses and account-abuse provisions. (Privacy World)
  • For purely public content, cases like X v Bright Data and Meta v Bright Data show judges are reluctant to let platforms use state contract law to expand their rights beyond copyright and user licenses. (Technology & Marketing Law Blog)

So use ToS, but don’t stop there. Combine them with auth, logging, and targeted legal hooks.

Does ignoring robots.txt itself violate any law?

Not by itself. robots.txt is voluntary. But:

  • If a bot both ignores robots.txt and actively evades CAPTCHAs, IP blocks, or other controls after notice, that combination starts to look like circumvention of technical measures and can support DMCA §1201, trespass, and similar claims. (National Law Review)

The law isn’t there yet in a clean, bright-line way, but Reddit’s Perplexity suit is basically trying to make that argument stick.

Should we try to cut off all AI crawlers?

Probably not. A more sustainable approach:

  • Allow or license specific AI vendors whose use you’re comfortable with;
  • Block and pursue those who won’t sign a license and aggressively ignore your controls;
  • Make your robots.txt + docs reflect this distinction (e.g., whitelisting named crawlers and disallowing others). (Reuters)

This positions you as selective, not Luddite, which helps legally and reputationally.


If your SaaS platform is already seeing unexplained bursts of traffic on docs and APIs, the time to assemble your technical dossier and C&D templates is before you end up in Reddit’s position. The intersection of dev telemetry + clean contracts + targeted demand letters is where you can still steer AI crawlers toward a license instead of a lawsuit.

And once you have this system in place, it becomes a reusable pattern for every future model vendor that decides your docs are free training data.