| ▲ | quchen 2 days ago |
| Unless this concept becomes a mass phenomenon with many implementations, isn’t this pretty easy to filter out? And furthermore, since this antagonizes billion-dollar companies that can spin up teams doing nothing but browse Github and HN for software like this to prevent polluting their datalakes, I wonder whether this is a very efficient approach. |
|
| ▲ | marcus0x62 2 days ago | parent | next [-] |
| Author of a similar tool here[0]. There are a few implementations of this sort of thing that I know of. Mine is different in that the primary purpose is to slightly alter content statically using a Markov generator, mainly to make it useless for content reposters, secondarily to make it useless to LLM crawlers that ignore my robots.txt file[1]. I assume the generated text is bad enough that the LLM crawlers just throw the result out. Other than the extremely poor quality of the text, my tool doesn't leave any fingerprints (like recursive non-sense links.) In any case, it can be run on static sites with no server-side dependencies so long as you have a way to do content redirection based on User-Agent, IP, etc. My tool does have a second component - linkmaze - which generates a bunch of nonsense text with a Markov generator, and serves infinite links (like Nepthenes does) but I generally only throw incorrigible bots at it (and, at others have noted in-thread, most crawlers already set some kind of limit on how many requests they'll send to a given site, especially a small site.) I do use it for PHP-exploit crawlers as well, though I've seen no evidence those fall into the maze -- I think they mostly just look for some string indicating a successful exploit and move on if whatever they're looking for isn't present. But, for my use case, I don't really care if someone fingerprints content generated by my tool and avoids it. That's the point: I've set robots.txt to tell these people not to crawl my site. In addition to Quixotic (my tool) and Napthenes, I know of: * https://github.com/Fingel/django-llm-poison * https://codeberg.org/MikeCoats/poison-the-wellms * https://codeberg.org/timmc/marko/ 0 - https://marcusb.org/hacks/quixotic.html 1 - I use the ai.robots.txt user agent list from https://github.com/ai-robots-txt/ai.robots.txt |
| |
| ▲ | tremon 8 hours ago | parent [-] | | poison-the-wellms I gotta give props for this project name. |
|
|
| ▲ | btilly 2 days ago | parent | prev | next [-] |
| It would be more efficient for them to spin up a team to study this robots.txt thing. They've ignored that low hanging fruit, so they won't do the more sophisticated thing any time soon. |
| |
| ▲ | tgv 2 days ago | parent [-] | | You can't make money out of studying robots.txt, but you can avoid costs skipping bad web sites. | | |
| ▲ | xeromal a day ago | parent [-] | | Sounds like a benefit for the site owner. lol. It accomplished what they wanted. |
|
|
|
| ▲ | iugtmkbdfil834 a day ago | parent | prev | next [-] |
| I forget which fiction book covered this phenomenon ( Rainbow's End? ), but the moment it becomes the basic default install ( ala adblocker in browsers for people ), it does not matter what the bigger players want to do ; they are not actively fighting against determined and possibly radicalized users. |
|
| ▲ | reedf1 2 days ago | parent | prev | next [-] |
| The idea is that you place this in parallel to the rest of your website routes, that way your entire server might get blacklisted by the bot. |
|
| ▲ | WD-42 2 days ago | parent | prev | next [-] |
| Does it need to be efficient if it’s easy? I wrote a similar tool except it’s not a performance tarpit. The goal is to slightly modify otherwise organic content so that it is wrong, but only for AI bots. If they catch on and stop crawling the site, nothing is lost. https://github.com/Fingel/django-llm-poison |
|
| ▲ | focusedone 2 days ago | parent | prev | next [-] |
| But it's fun, right? |
|
| ▲ | grajaganDev 2 days ago | parent | prev | next [-] |
| I am not sure. How would crawlers filter this? |
| |
| ▲ | marginalia_nu 2 days ago | parent | next [-] | | You limit the crawl time or number of requests per domain for all domains, and set the limit proportional to how important the domain is. There's a ton of these types of of things online, you can't e.g. exhaustively crawl every wikipedia mirror someone's put online. | |
| ▲ | captainmuon 2 days ago | parent | prev [-] | | Check if the response time, the length of the "main text", or other indicators are in the lowest few percentile -> send to the heap for manual review. Does the inferred "topic" of the domain match the topic of the individual pages? If not -> manual review. And there are many more indicators. Hire a bunch of student jobbers, have them search github for tarpits, and let them write middleware to detect those. If you are doing broad crawling, you already need to do this kind of thing anyway. | | |
| ▲ | dylan604 2 days ago | parent [-] | | > Hire a bunch of student jobbers, Do people still do this, or do they just off shore the task? |
|
|
|
| ▲ | pmarreck a day ago | parent | prev | next [-] |
| It's not. It's rather pointless and frankly, nearsighted. And we can DDoS sites like this just as offensively as well simply by making many requests to it since its own docs say its Markov generation is computationally expensive, but it is NOT expensive for even 1 person to make many requests to it. Just expensive to host. So feel free to use this bash function to defeat these: httpunch() {
local url=$1
local connections=${2:-${HTTPUNCH_CONNECTIONS:-100}}
local action=$1
local keepalive_time=${HTTPUNCH_KEEPALIVE:-60}
local silent_mode=false
# Check if "kill" was passed as the first argument
if [[ $action == "kill" ]]; then
echo "Killing all curl processes..."
pkill -f "curl --no-buffer"
return
fi
# Parse optional --silent argument
for arg in "$@"; do
if [[ $arg == "--silent" ]]; then
silent_mode=true
break
fi
done
# Ensure URL is provided if "kill" is not used
if [[ -z $url ]]; then
echo "Usage: httpunch [kill | <url>] [number_of_connections] [--silent]"
echo "Environment variables: HTTPUNCH_CONNECTIONS (default: 100), HTTPUNCH_KEEPALIVE (default: 60)."
return 1
fi
echo "Starting $connections connections to $url..."
for ((i = 1; i <= connections; i++)); do
if $silent_mode; then
curl --no-buffer --silent --output /dev/null --keepalive-time "$keepalive_time" "$url" &
else
curl --no-buffer --keepalive-time "$keepalive_time" "$url" &
fi
done
echo "$connections connections started with a keepalive time of $keepalive_time seconds."
echo "Use 'httpunch kill' to terminate them."
}
(Generated in a few seconds with the help of an LLM of course.) Your free speech is also my free speech. LLM's are just a very useful tool, and Llama for example is open-source and also needs to be trained on data. And I <opinion> just can't stand knee-jerk-anticorporate AI-doomers who decide to just create chaos instead of using that same energy to try to steer the progress </opinion>. |
| |
| ▲ | WD-42 a day ago | parent | next [-] | | You called the parent unintelligent yet need an LLM to show you how to run curl in a loop. Yikes. | | |
| ▲ | pmarreck 9 hours ago | parent | next [-] | | Your assumption that I couldn't have written this myself or that I didn't make corrections to it is telling. I've only been doing dev for 30+ years lol LLMs are an accelerant, like all previous tools... Not a replacement, although it seems most people still need to figure that out for themselves while I already have | | |
| ▲ | dilDDoS 6 hours ago | parent [-] | | Sure, but in this case it's like driving your car 10 feet to your mailbox and then bragging about how it's an accelerant (in other words, the task wasn't remotely difficult to begin with and doesn't really warrant "accelerating"). I assume in this case your note about how it was written with an LLM was more just to spite the anti-LLM sentiment above though, which would make more sense. | | |
| ▲ | pmarreck an hour ago | parent [-] | | That's exactly what it was meant to do. You're right, this is a trivial use case. |
|
| |
| ▲ | thruway516 15 hours ago | parent | prev | next [-] | | The 21st century script kiddy | | | |
| ▲ | flir 19 hours ago | parent | prev [-] | | "I'm not lazy, I'm efficient" - Heinlein |
| |
| ▲ | scudsworth a day ago | parent | prev | next [-] | | "Ah, my favorite ADD tech nomad! adjusts monocle" - https://gist.github.com/pmarreck/970e5d040f9f91fd9bce8a4bcee... | |
| ▲ | SrslyJosh 9 hours ago | parent | prev [-] | | Shhh, the adults are talking. | | |
| ▲ | pmarreck 9 hours ago | parent [-] | | The only actual child is OP or anyone who actually believes their tarpit is going to be effective at stopping LLMs |
|
|
|
| ▲ | Blackthorn 2 days ago | parent | prev [-] |
| If it means it makes your own content safe when you deploy it on a corner of your website: mission accomplished! |
| |
| ▲ | gruez 2 days ago | parent | next [-] | | >If it means it makes your own content safe Not really? As mentioned by others, such tarpits are easily mitigated by using a priority queue. For instance, crawlers can prioritize external links over internal links, which means if your blog post makes it to HN, it'll get crawled ahead of the tarpit. If it's discoverable and readable by actual humans, AI bots will be able to scrape it. | |
| ▲ | TeMPOraL 2 days ago | parent | prev [-] | | [flagged] | | |
| ▲ | Blackthorn a day ago | parent [-] | | You've got to be seriously AI-drunk to equate letting your site be crawled by commercial scrapers with "contributing to humanity". Maybe you don't want your your stuff to get thrown into the latest silicon valley commercial operation without getting paid for it. That seems like a valid position to take. Or maybe you just don't want Claude's ridiculously badly behaved scraper to chew through your entire budget. Regardless, scrapers that don't follow the rules like robots.txt pretty quickly will discover why those rules exist in the first place as they receive increasing amounts of garbage. |
|
|