Is cloudflare becoming a mob outfit? Because they are selling scraping counterme...

shadowfiend · 2026-03-11T00:20:11 1773188411

No: https://developers.cloudflare.com/browser-rendering/rest-api...

oefrha · 2026-03-11T03:28:27 1773199707

That's not the perfect defense you think it is. Plenty of robots.txts[1] technically allow scraping their main content pages as long as your user-agent isn't explicitly disallowed, but in practice they're behind Cloudflare so they still throw up Cloudflare bot check if you actually attempt to crawl.

And forget about crawling. If you have a less reputable IP (basically every IP in third world countries are less reputable, for instance), you can be CAPTCHA'ed to no end by Cloudflare even as a human user, on the default setting, so plenty of site owners with more reputable home/office IPs don't even know what they subject a subset of their users to.

[1] E.g. https://www.wired.com/robots.txt to pick an example high up on HN front page.

rendaw · 2026-03-11T06:17:29 1773209849

I think the simple explanation is that they weren't selling scraping countermeasures, they were selling web-based denial of service protection (which may be caused by scrapers).

dewey · 2026-03-11T13:33:31 1773236011

This was always also sold as bot protection and anti-scraping / crawling features like https://www.cloudflare.com/lp/pg-ai-crawl-control/

PeterStuer · 2026-03-11T07:47:21 1773215241

Ask yourself, why would a scraper ddos? Why would a ddos-protection vendor ddos?

wongarsu · 2026-03-11T10:18:25 1773224305

Because the scraper is either impatient, careless or indifferent; and if they scrape for training data they don't plan to come back. If they don't plan to come back they don't care if you tighten up crawling protections after they have moved on. In fact they are probably happy that they got their data and their competition won't

wiether · 2026-03-11T11:38:56 1773229136

> they don't plan to come back

To me the current behavior of those scrapers tells me that "they don't plan", period.

Looks like they hired a bunch of excavators and are digging 2 meters deep on whole fields, looking for nuggets of gold, and pilling the dirt on a huge mountain.

Once they realize the field was bereft of any gold but full of silver? Or that the gold was actually 2.5 meters deep?

They have to go through everything again.

junaru · 2026-03-11T10:48:22 1773226102

> Ask yourself, why would a scraper ddos?

Don't need to ask anything i can tell you exactly - because they have no regard for anything but their own profit.

Let me give you an example of this mom and pop shop known as anthropic.

You see they have this thing called claudebot and at least initially it scraped iterating through IP's.

Now you have these things called shared hosting servers, typically running 1000-10000 domains of actual low volume websites on 1-50 or so IPs.

Guess what happens when it is your networks time to bend over? Whole hosting company infrastructure going down as each server has hundreds of claudebots crawling hundreds of vhosts at the same time.

This happened for months. Its the reason they are banned in WAFs by half the hosting industry.

PeterStuer · 2026-03-11T17:19:32 1773249572

So how would you avoid this specific situation as a web-crawler that tries to be well behaved? You strictly adhere to robots.txt as specified by each domain. The problem is not with any of the sites but the density (1000-10000) by which the hoster packed them. If e.g. the crawler had a 1 sec between page governor even if robots.txt had no rate specified, which to be fair is very reasonable, this packing could still lead to high server load.

c0balt · 2026-03-11T10:40:33 1773225633

The number of git forges behind Anubis et al and the numerous public announcements should be enough.

Scrappers seem to be exceedingly careless in using public resources. The problem is often not even DDOS (as in overwhelming bandwidth usage) but rather DOS through excessive hits on expensive routes.

drcongo · 2026-03-11T12:11:38 1773231098

Ask yourself, why would everyone except you say that they do?

its-kostya · 2026-03-11T00:00:03 1773187203

Cloudflare has been trying to mediate publishers & AI companies. If publishers are behind Cloudflare and Cloudflare's bot detection stops scrapers at the request of publishers, the publishers can allow their data to be scraped (via this end point) for a price. It creates market scarcity. I don't believe the target audience is you and me. Unless you own a very popular blog that AI companies would pay you for.

PeterStuer · 2026-03-11T07:34:02 1773214442

Next step will be their default "free" anti-bot denying all but their own bot. They know full well nearly nobody changes the default.

pocksuppet · 2026-03-11T03:31:32 1773199892

Was it ever not one? They protect a lot of DDoS-for-hire sites from DDoS by their competitors. In return they increase the quantity of DDoS on the internet. They offer you a service for $150, then months later suddenly demand $150k in 24 hours or they shut down your business. If you use them as a DNS registrar they will hold your domain hostage.

azinman2 · 2026-03-11T04:40:07 1773204007

Where can I learn more about the 150k in 24h?

caffeinewriter · 2026-03-11T05:58:29 1773208709

I imagine it's referencing this story:

https://robindev.substack.com/p/cloudflare-took-down-our-web...

HN Discussion:

https://hackertimes.com/item?id=40481808

Sebguer · 2026-03-11T04:12:40 1773202360

yeah, GP completely fails to realize that Cloudflare has always played both sides. that is their entire business model, and it was transparent from the beginning that they would absolutely do the same here.

theamk · 2026-03-10T23:56:44 1773187004

no? it takes 10 seconds to check:

> The /crawl endpoint respects the directives of robots.txt files, including crawl-delay. All URLs that /crawl is directed not to crawl are listed in the response with "status": "disallowed".

You don't need any scraping countermeasures for crawlers like those.

Macha · 2026-03-11T01:10:29 1773191429

So what’s the user agent for their bot? They don’t seem to specify the default in the docs and it looks like it’s user configurable. So yet another opt out bot which you need your web server to match on special behaviour to block

flanksteak20 · 2026-03-11T05:58:59 1773208739

Isn't this covered here? https://developers.cloudflare.com/browser-rendering/referenc...

Macha · 2026-03-11T09:25:17 1773221117

No, hence all their examples using User-Agent: *

gruez · 2026-03-11T01:14:05 1773191645

>So yet another opt out bot which you need your web server to match on special behaviour to block

Given that malicious bots are allegedly spoofing real user agents, "another user agent you have to add to your list" seems like the least of your problems.

AdamN · 2026-03-11T09:45:38 1773222338

Not 'allegedly' - it's just a fact. Even if you're not malicious however it's still sometimes necessary because the server may have different sites for different browsers and check user agents for the experience they deliver. So then even for legitimate purposes you need to at least use the prefix of the user agent that the server expects.

Macha · 2026-03-11T14:05:11 1773237911

It is cloudflare who made the claim that they are well behaved unlike those other bots and that their behaviour can be controlled by robots.txt

If I need to treat cloudflare bots the same as malicious bots, that undermines their claim.

PeterStuer · 2026-03-11T07:50:10 1773215410

Like they explain in the docs, their crawler will respect the robots.txt dissalowed user-agents, right after the section hat explains how to change your user-agent.

subscribed · 2026-03-11T00:58:52 1773190732

I think there's some space being absolutely snuffed by the countless bots of everyone, ignoring everything, pulling from residential proxies, and this, supposedly slower, well behavior, smarter bot.

Like there's a difference between dozens of drunk teenagers thrashing the city streets in the illegal street race vs a taxi driver.

isodev · 2026-03-11T04:26:05 1773203165

They always have been.

They also use their dominant position to apply political pressure when they don’t like how a country chooses to run things.

So yeah, we’ve created another mega corp monster that will hurt for years to come.

andrepd · 2026-03-11T10:36:39 1773225399

Well this scraper honours robots.txt so I'm sure most AI crawlers will find it useless.

iso-logi · 2026-03-10T23:57:17 1773187037

Their free DNS is only a small piece of the pie.

The fact that 30%+ of the web relies on their caching services, routablility services and DDoS protection services is the main pull.

Their DNS is only really for data collection and to front as "good will"

jen729w · 2026-03-11T04:09:23 1773202163

> The fact that 30%+ of the web relies on their caching services

30% of the web might use their caching services. 'Relies on' implies that it wouldn't work without them, which I doubt is the case.

It might be the case for the biggest 1% of that 30%. But not the whole lot.

reddalo · 2026-03-11T08:21:39 1773217299

>'Relies on' implies that it wouldn't work without them

Last time Cloudflare went down, their dashboard was also unavailable, so you couldn't turn off their proxy service anyway.

rrr_oh_man · 2026-03-10T23:45:23 1773186323

[flagged]

stri8ted · 2026-03-10T23:58:49 1773187129

Do you have any evidence to support this view?

rolymath · 2026-03-11T00:55:48 1773190548

Read who and how it was founded. It's not a secret at all.

rrr_oh_man · 2026-03-11T18:52:43 1773255163

It’s funny how I got immediately downvoted and flagged

pocksuppet · 2026-03-11T03:32:04 1773199924

Who else would MITM 30% of the internet?

mtmail · 2026-03-10T23:58:47 1773187127

Any kind of source for the claim?

Retr0id · 2026-03-10T23:48:18 1773186498

For a long time cloudflare has proudly protected DDoS-as-a-service sites (but of course, they claim they don't "host" them)

Dylan16807 · 2026-03-11T03:31:06 1773199866

Are you using the word "claim" to call them wrong or for a more confusing reason?

Because I'm pretty sure they are not in fact wrong.

Retr0id · 2026-03-11T03:46:50 1773200810

The distinction between a caching proxy and an origin server is pretty meaningless when you're serving static content, if you ask me.

Dylan16807 · 2026-03-11T04:36:06 1773203766

There's a blurry line there, true.

On the other hand when a page is small and static enough that it's basically just a flyer, I also care a lot less about who hosts it.

giancarlostoro · 2026-03-11T00:27:49 1773188869

If they ever sell or the CEO shifts, yes. For the meantime, they have not given any strong indication that they're trying to bully anybody. I could see things changing drastically if the people in charge are swapped out.