A small part. On my server AI bots outnumber real visitors 300 to one.

davidsojevic 4 hours ago | parent | next [-]

I don't mean that users are following the links to `acme.com` and `demo.com` type domains in documentation; I mean that bots are likely finding and following many links to them because of their widespread use in documentation.

If you search for `site:github.com "acme.com"` in Google, you'll find numerous instances of the domain being used in contrived links in documentation as an example of how URLs might be structured on an arbitrary domain and also in issues to demonstrate a fully qualified URL without giving away the actual domain people were using.

This means that numerous links are pointing to non-existent paths on `acme.com` because of the nature of how people are using them in documentation and examples.

	▲	danaris 3 hours ago \| parent [-]
		That is very possible. But it is not necessary to see the results that are being described. If sites like my tiny little browser game, with roughly 120 weekly unique users, are getting absolutely hammered by the scraper-bots (it was, last year, until I put the Wiki behind a login wall; now I still get a significant amount of bot traffic, it's just no longer enough to actually crash the game), then sites that people actually know and consider important like acme.com are very likely to be getting massive deluges of traffic purely from first-order hits.

▲

dylan604 4 hours ago | parent | prev | next [-]

That such an absolutely ludicrous thing to hear in a "wtf are these people doing" type of way. I can't imagine a non-social media site would be generating enough traffic to the level that these bots need to be essentially doing continuous scraping. It's just gross to me to be okay with that level of unsophisticated effort that they just do the same thing over and over with zero gain.

▲

kjok 4 hours ago | parent | prev | next [-]

How are you measuring this? Does your solution rely on user agent or device fingerprinting? Curious to know what tools are available today and how accurate they are.

	▲	spiderfarmer 2 hours ago \| parent [-]
		I'm popular in Europe, there's no reason people from Singapore, Russia, Brazil and literally every other country in the world to all start visiting very old articles and permalinks for comments en masse. Having honeypot links is the only thing that helps, but I'm running into massive IP tables, slowing things down. This is not what I want to do with my time. I can't afford the expensive specialised tools. I'm just a solo entrepreneur on a shoestring budget. I just want to improve the website for my 3k real users and 10k real daily guests, not for bots.

▲

Lerc 4 hours ago | parent | prev [-]

Where from? And quite frankly why? There are existing training data sets that are large enough for smaller models. Larger models have been focusing on data quality more than quantity. There's limited utility to further indiscriminate widespread scraping,

	▲	danaris 3 hours ago \| parent [-]
		Tell that to the idiots doing the scraping. Small site operators like us know very well that the utility they can get by scraping us is marginal at best. Based on their patterns of behavior, though, my best guess is that they've simply configured their bots to scrape absolutely everything, all the time, forever, as aggressively as possible, and treat any attempt to indicate "hey, this data isn't useful to you" as an adversarial signal that the site operator is trying to hide things from them that are their God-given right.