Remix.run Logo
xnx 3 days ago

It's an arms race. Websites have become stupidly/unnecessarily/hostilely complicated, but AI/LLMs have made it possible (though more expensive) to get whatever useful information exists out of them.

Soon, LLMs will be able to complete any Captcha a human can within reasonable time. When that happens, the "analog hole" may be open permanently. If you can point a camera and a microphone at it, the AI will be able to make better sense of it than a person.

Gigachad 3 days ago | parent | next [-]

The future will just be every web session gets tied to a real ID and if the service detects you as a bot you just get blocked by ID.

wraptile 3 days ago | parent | next [-]

> The future will just be every web session gets tied to a real ID

This seems like an awful future. We already had this in form of limited ipv4 addresses wher each IP is basically an identity. People started buying up ip addresses and selling them as proxies. So any other form of ID would suffer the same fate unless enforced at government level.

Worst case scenario we have 10,000 people sitting in front of the screens clicking page links because hiring someone to use their "government id" to mindlessly browse the web is the only way to get data of the public web. That's not the future we should want.

xnx 3 days ago | parent | prev [-]

I definitely agree logins will be required for many more sites, but how would the site be able to distinguish humans from bots controlling the browser? Captcha is almost obsolete. ARC AGI is too cumbersome for verifying every time.

Gigachad 3 days ago | parent [-]

Small scale usage at the same level as a normal person would probably go under the radar, but if you try scraping, running multiple accounts or posting any more than you would a normal user it’ll be picked up once they can link all actions to a real person.

If you are just asking Siri to load a page for you, that probably gets tolerated. Maybe very sensitive sites will go verified mobile platform only and Apple/Google will provide some kind of AI free compute environment like how they can block screen recording or custom roms today.

Yes it is 100% the death of the free and open computing environment. But captchas are no longer going to be sufficient. It seems realistic to block bots if you are willing to fully lock down everything.

xnx 3 days ago | parent [-]

The next frontier is entire fake personas to login and scrape sites ... which is why government/real-world verification will be required soon.

goku12 3 days ago | parent | prev [-]

Please remember that an LLM accessing any website isn't the problem here. It's the scraping bots that saturate the server bandwidth (a DoS attack of sorts) to collect data to train the LLMs with. An LLM solving a captcha or an Anubis style proof of work problem isn't a big concern here, because the worst they're going to do with the collected data is to cache them for later analysis and reporting. Unlike the crawlers, LLMs don't have any incentives in sucking up huge amounts of data like a giant vacuum cleaner.

TeMPOraL 3 days ago | parent [-]

Scraping was a thing before LLMs, there's a whole separate arms race around this for regular competition and "industrial espionage" reasons. I'm not really sure why model training would become a noticeable fraction of scrapping activity - there's only few players on the planet that can afford to train decent LLMs in the first place, and they're not going to re-scrape the content they already have ad infinitum.

int_19h 3 days ago | parent [-]

> they're not going to re-scrape the content they already have

That's true for static content, but much of it is forums and other places like that where the main value is that new content is constantly generated - but needs to be re-scraped.

a96 2 days ago | parent [-]

If only sites agreed on putting a machine readable URL somewhere that lists all items by date. Like a site summary or a syndication stream. And maybe like a "map" of a static site. It would be so easy to share their updates with other interested systems.

int_19h a day ago | parent [-]

Why should they agree to make life even easier for people doing something they don't want?