Flutterby™! : Crawler Abuse

Next unread comment / Catchup all unread comments User Account Info | Logout | XML/Pilot/etc versions | Long version (with comments) | Weblog archives | Site Map | | Browse Topics

Crawler Abuse

2024-07-26 17:43:56.179457+02 by Dan Lyke 2 comments

I occasionally think about finding ways to self-host video. It's not like my videos get a lot of watches, I'd rather deliver the content than just let YouTube monetize it, surely just putting it on an S3 host, or serving it via some sort of proxy from home, wouldn't be that onerous. But I've also hosted things from home before, including, ages ago, a friend's relatively low volume forum that someone decided to spider with no rate limiting, DDOSing everything.

When that shit happens on Flutterby, I do a little ipfw deny ... and everything's fine (and have some of that automated), but the fuckwits always find some new way through, and I'm getting tired.

And, of course, I see stuff like this: Read The Docs: AI crawlers need to be more respectful:

One crawler downloaded 73 TB of zipped HTML files in May 2024, with almost 10 TB in a single day.

... with no bandwidth limiting or support for ETags or Last-Modified.

And Anthropic AI Scraper Hits iFixit’s Website a Million Times in a Day.

I think one of the huge problems we have is that either the crawler companies aren't hiring the best and the brightest (likely, because they're the ones sucked in by promises of "AI"), or there's no incentive to not fuck over the world in the mad dash.

Anyway, if I can find a way that I trust, I could see maybe doing some sort of actual user detection which does a temporarily signed S3 key that I served from... But... there's been a lot of discussion recently about the challenges with self-hosting blogs, and now Fediverse, sites, and this is just more in the "why we can't have nice things" category.

Via

[ related topics: Interactive Drama Weblogs broadband Invention and Design Community Artificial Intelligence Video ]

comments in ascending chronological order (reverse):

#Comment Re: Crawler Abuse made: 2024-07-27 13:58:05.210978+02 by: DaveP

Thanks for these examples. I’ve pondered setting up my own hosting for a few of my websites, and after looking through this… nope. I’ll keep hosting them elsewhere and let someone else worry about the crawlers.

#Comment Re: Crawler Abuse made: 2024-07-30 20:37:29.376857+02 by: Dan Lyke

This server's on a fixed-bandwidth unlimited connection, and I don't often have issues. Occasionally some dipshit will decide to sequentially spider 30k entries in order, and I did lock down some of the "by topic" stuff, because spiders were just hammering the shit out of that and it took some database querying to build.

Add your own comment:

(If anyone ever actually uses Webmention/indie-action to post here, please email me)




Format with:

(You should probably use "Text" mode: URLs will be mostly recognized and linked, _underscore quoted_ text is looked up in a glossary, _underscore quoted_ (http://xyz.pdq) becomes a link, without the link in the parenthesis it becomes a <cite> tag. All <cite>ed text will point to the Flutterby knowledge base. Two enters (ie: a blank line) gets you a new paragraph, special treatment for paragraphs that are manually indented or start with "#" (as in "#include" or "#!/usr/bin/perl"), "/* " or ">" (as in a quoted message) or look like lists, or within a paragraph you can use a number of HTML tags:

p, img, br, hr, a, sub, sup, tt, i, b, h1, h2, h3, h4, h5, h6, cite, em, strong, code, samp, kbd, pre, blockquote, address, ol, dl, ul, dt, dd, li, dir, menu, table, tr, td, th

Comment policy

We will not edit your comments. However, we may delete your comments, or cause them to be hidden behind another link, if we feel they detract from the conversation. Commercial plugs are fine, if they are relevant to the conversation, and if you don't try to pretend to be a consumer. Annoying endorsements will be deleted if you're lucky, if you're not a whole bunch of people smarter and more articulate than you will ridicule you, and we will leave such ridicule in place.


Flutterby™ is a trademark claimed by

Dan Lyke
for the web publications at www.flutterby.com and www.flutterby.net.