TBH most of the talk of "aggressive scraping" has been in the 100K pages/day ran...

jeroenhd · 2025-05-28T08:42:58 1748421778

I've caught Huawei and Tencent IPs scraping the same image over and over again, with different query parameters. Sure, the image was only 260KiB and I don't use Amazon or GCP or Azure so it didn't cost me anything, but it still spammed my logs and caused a constant drain on my servers' resources.

The bots keep coming back too, ignoring HTTP status codes, permanent redirects, and what else I can think of to tell them to fuck off. Robots.txt obviously doesn't help. Filtering traffic from data centers didn't help either, because soon after I did that, residential IPs started doing the same thing. I don't know if this is a Chinese ISP abusing their IP ranges or if China just has a massive botnet problem, but either way the traditional ways to get rid of these bots hasn't helped.

In the end, I'm now blocking all of China and Singapore. That stops the endless flow of bullshit requests for now, though I see some familiar user agents appearing in other east Asian countries as well.

account42 · 2025-05-28T15:02:44 1748444564

So make sure the image is only available at one canonical URL with proper caching headers? No, obviously the only solution is to install crapware that worsens the experience for regular users.

account42 · 2025-05-28T15:01:18 1748444478

Agreed. Website operators should have a hard look at why their unoptimized crap can't manage such low request rates before contributing to the enshittification of the web by deploying crapware like anubis or buttflare.

immibis · 2025-05-28T15:24:00 1748445840

I've been blocking a few scrapers from my gitea service - not because it's overloaded, more just to see what happens. They're not getting good data from <repo>/commit/<every sha256 in the repo>/<every file path in the repo> anyway. If they actually wanted the data they could run "git clone".

I just checked, since someone was talking about scraping in IRC earlier. Facebook is sending me about 3 requests per second. I blocked their user-agent. Someone with a Googlebot user-agent is doing the same stupid scraping pattern, and I'm not blocking it. Someone else is sending a request every 5 seconds with

One thing that's interesting on the current web is that sites are expected to make themselves scrapeable. It's supposed to be my job to organize the site in such a way that scrapers don't try to scrape every combination of commit and file path.