| ▲ | spiderfarmer 5 hours ago | |||||||
A small part. On my server AI bots outnumber real visitors 300 to one. | ||||||||
| ▲ | davidsojevic 4 hours ago | parent | next [-] | |||||||
I don't mean that users are following the links to `acme.com` and `demo.com` type domains in documentation; I mean that bots are likely finding and following many links to them because of their widespread use in documentation. If you search for `site:github.com "acme.com"` in Google, you'll find numerous instances of the domain being used in contrived links in documentation as an example of how URLs might be structured on an arbitrary domain and also in issues to demonstrate a fully qualified URL without giving away the actual domain people were using. This means that numerous links are pointing to non-existent paths on `acme.com` because of the nature of how people are using them in documentation and examples. | ||||||||
| ||||||||
| ▲ | dylan604 4 hours ago | parent | prev | next [-] | |||||||
That such an absolutely ludicrous thing to hear in a "wtf are these people doing" type of way. I can't imagine a non-social media site would be generating enough traffic to the level that these bots need to be essentially doing continuous scraping. It's just gross to me to be okay with that level of unsophisticated effort that they just do the same thing over and over with zero gain. | ||||||||
| ▲ | kjok 4 hours ago | parent | prev | next [-] | |||||||
How are you measuring this? Does your solution rely on user agent or device fingerprinting? Curious to know what tools are available today and how accurate they are. | ||||||||
| ||||||||
| ▲ | Lerc 4 hours ago | parent | prev [-] | |||||||
Where from? And quite frankly why? There are existing training data sets that are large enough for smaller models. Larger models have been focusing on data quality more than quantity. There's limited utility to further indiscriminate widespread scraping, | ||||||||
| ||||||||