| ▲ | jchw a day ago |
| I'm rooting for Ladybird to gain traction in the future. Currently, it is using cURL proper for networking. That is probably going to have some challenges (I think cURL is still limited in some ways, e.g. I don't think it can do WebSockets over h2 yet) but on the other hand, having a rising browser engine might eventually remove this avenue for fingerprinting since legitimate traffic will have the same fingerprint as stock cURL. |
|
| ▲ | rhdunn a day ago | parent | next [-] |
| It would be good to see Ladybird's cURL usage improve cURL itself, such as the WebSocket over h2 example you mention. It is also a good test of cURL to see and identify what functionality cURL is missing w.r.t. real-world browser workflows. |
|
| ▲ | userbinator a day ago | parent | prev | next [-] |
| but on the other hand, having a rising browser engine might eventually remove this avenue for fingerprinting If what I've seen from CloudFlare et.al. are any indication, it's the exact opposite --- the amount of fingerprinting and "exploitation" of implementation-defined behaviour has increased significantly in the past few months, likely in an attempt to kill off other browser engines; the incumbents do not like competition at all. The enemy has been trying to spin it as "AI bots DDoSing" but one wonders how much of that was their own doing... |
| |
| ▲ | SoftTalker 20 hours ago | parent | next [-] | | It's entirely deliberate. CloudFlare could certainly distinguish low-volume but legit web browsers from bots, as much as they can distinguish chrome/edge/safari/firefox from bots. That is if they cared to. | |
| ▲ | hansvm a day ago | parent | prev | next [-] | | Hold up, one of those things is not like the other. Are we really blaming webmasters for 100x increases in costs from a huge wave of poorly written and maliciously aggressive bots? | | |
| ▲ | refulgentis a day ago | parent | next [-] | | > Are we really blaming... No, they're discussing increased fingerprinting / browser profiling recently and how it affects low-market-share browsers. | | |
| ▲ | hansvm a day ago | parent [-] | | I saw that, but I'm still not sure how this fits in: > The enemy has been trying to spin it as "AI bots DDoSing" but one wonders how much of that was their own doing... I'm reading that as `enemy == fingerprinters`, `that == AI bots DDoSing`, and `their own == webmasters, hosting providers, and CDNs (i.e., the fingerprinters)`, which sounds pretty straightforwardly like the fingerprinters are responsible for the DDoSing they're receiving. That interpretation doesn't seem to match the rest of the post though. Do you happen to have a better one? | | |
| ▲ | userbinator 21 hours ago | parent [-] | | "their own" = CloudFlare and/or those who have vested interests in closing up the Internet. | | |
|
| |
| ▲ | jillyboel 8 hours ago | parent | prev [-] | | Your costs only went up 100x if you built your site poorly | | |
| ▲ | hansvm 3 hours ago | parent [-] | | I'll bite. How do you serve 100x the traffic without 100x the costs? It costs something like 1e-10 dollars to serve a recipe page with a few photos, for example. If you serve it 100x more times, how does that not scale up? | | |
| ▲ | jillyboel 2 hours ago | parent [-] | | It might scale up but if you're anywhere near efficient you're way overprovisioned to begin with. The compute cost should be miniscule due to caching and bandwidth is cheap if you're not with one of the big clouds. As an example, according to dang HN runs on a single server and yet many websites that get posted to HN, and thus receive a fraction of the traffic, go down due to the load. |
|
|
| |
| ▲ | cyanydeez 19 hours ago | parent | prev [-] | | I dont think they're doing this to kill off browser engines; they're trying to sift browsers into "user" and "AI slop", so they can prioritize users. This is entirely web crawler 2.0 apocolypse. | | |
| ▲ | nicman23 18 hours ago | parent | next [-] | | man i just want a bot to buy groceries for me | | |
| ▲ | baq 17 hours ago | parent [-] | | That’s one of the few reasons to leave the house. I’d like dishes and laundry bots first, please. | | |
| ▲ | dodslaser 15 hours ago | parent [-] | | You mean dishwashers and washing machines? | | |
| ▲ | baq 13 hours ago | parent [-] | | Yes, but no. I want a robot to load and unload those. | | |
| ▲ | dec0dedab0de 10 hours ago | parent [-] | | I have been paying my local laundromat to do mu laundry for over a decade now, it’s probably cheaper than youre imagining and sooo worth it. | | |
| ▲ | baq 7 hours ago | parent [-] | | my household is 6 people, it isn't uncommon to run 3 washing machine loads in a day and days without at least one are rare. I can imagine the convenience, but at this scale it sounds a bit unreasonable. dishwasher runs at least once a day, at least 80% full, every day, unless we're traveling. |
|
|
|
|
| |
| ▲ | extraduder_ire 10 hours ago | parent | prev [-] | | I think "slop" only refers to the output of generative AI systems. bot, crawler, scraper, or spider would be a more apt term for software making (excessive) requests to collect data. |
|
|
|
| ▲ | nonrandomstring a day ago | parent | prev | next [-] |
| When I spoke to these guys [0] we touched on those quirks and foibles
that make a signature (including TCP stack stuff beyond control of any
userspace app). I love this curl, but I worry that if a component takes on the role of
deception in order to "keep up" it accumulates a legacy of hard to
maintain "compatibility" baggage. Ideally it should just say... "hey I'm curl, let me in" The problem of course lies with a server that is picky about dress
codes, and that problem in turn is caused by crooks sneaking in
disguise, so it's rather a circular chicken and egg thing. [0] https://cybershow.uk/episodes.php?id=39 |
| |
| ▲ | thaumasiotes a day ago | parent | next [-] | | > Ideally it should just say... "hey I'm curl, let me in" What? Ideally it should just say "GET /path/to/page". Sending a user agent is a bad idea. That shouldn't be happening at all, from any source. | | |
| ▲ | Tor3 20 hours ago | parent | next [-] | | Since the first browser appeared I've always meant that sending a user agent id was a really bad idea. It breaks with the fundamental idea of the web protocol, that it's the server's responsibility to provide data and it's the client's responsibility to present it to the user. The server does not need to know anything about the client. Including user agent in this whole thing was a huge mistake as it allowed web site designers to code for specific quirks in browsers.
I can to some extent accept a capability list from the client, but I'm not so sure even that is necessary. | |
| ▲ | nonrandomstring 17 hours ago | parent | prev [-] | | Absolutely, yes! A protocol should not be tied to client
details. Where did "User Agent" strings even come from? | | |
| ▲ | darrenf 16 hours ago | parent [-] | | They're in the HTTP/1.0 spec. https://www.rfc-editor.org/rfc/rfc1945#section-10.15 10.15 User-Agent The User-Agent request-header field contains information about the
user agent originating the request. This is for statistical purposes,
the tracing of protocol violations, and automated recognition of user
agents for the sake of tailoring responses to avoid particular user
agent limitations.
|
|
| |
| ▲ | immibis a day ago | parent | prev [-] | | What should instead happen is that Chrome should stop sending as much of a fingerprint, so that sites won't be able to fingerprint. That won't happen, since it's against Google's interests. | | |
| ▲ | gruez a day ago | parent [-] | | This is a fundamental misunderstanding of how TLS fingerprinting works. The "fingerprint" isn't from chrome sending a "fingerprint: [random uuid]" attribute in every TLS negotiation. It's derived from various properties of the TLS stack, like what ciphers it can accept. You can't make "stop sending as much of a fingerprint", without every browser agreeing on the same TLS stack. It's already minimal as it is, because there's basically no aspect of the TLS stack that users can configure, and chrome bundles its own, so you'd expect every chrome user to have the same TLS fingerprint. It's only really useful to distinguish "fake" chrome users (eg. curl with custom header set, or firefox users with user agent spoofer) from "real" chrome users. | | |
| ▲ | RKFADU_UOFCCLEL 9 hours ago | parent | next [-] | | What? Just fix the ciphers to a list of what's known to work + some safety margin. Each user needing some different specific cipher (like a cipher for horses, and one for dogs), is not a thing. | | |
| ▲ | gruez 7 hours ago | parent [-] | | >Just fix the ciphers to a list of what's known to work + some safety margin. That's already the case. The trouble is that NSS (what firefox uses) doesn't support the same cipher suites as boringssl (what chrome uses?). |
| |
| ▲ | dochtman a day ago | parent | prev [-] | | Part of the fingerprint is stuff like the ordering of extensions, which Chrome could easily do but AFAIK doesn’t. (AIUI Google’s Play Store is one of the biggest TLS fingerprinting culprits.) | | |
| ▲ | shiomiru a day ago | parent | next [-] | | Chrome has randomized its ClientHello extension order for two years now.[0] The companies to blame here are solely the ones employing these fingerprinting techniques, and those relying on services of these companies (which is a worryingly large chunk of the web). For example, after the Chrome change, Cloudflare just switched to a fingerprinter that doesn't check the order.[1] [0]: https://chromestatus.com/feature/5124606246518784 [1]: https://blog.cloudflare.com/ja4-signals/ | | |
| ▲ | nonrandomstring a day ago | parent | next [-] | | > blame here are solely the ones employing these fingerprinting techniques, Sure. And it's a tragedy. But when you look at the bot situation and
the sheer magnitude of resource abuse out there, you have to see it
from the other side. FWIW the conversation mentioned above, we acknowledged that and moved
on to talk about behavioural fingerprinting and why it makes sense
not to focus on the browser/agent alone but what gets done with it. | | |
| ▲ | NavinF a day ago | parent [-] | | Last time I saw someone complaining about scrapers, they were talking about 100gib/month. That's 300kbps. Less than $1/month in IP transit and ~$0 in compute. Personally I've never noticed bots show up on a resource graph. As long as you don't block them, they won't bother using more than a few IPs and they'll backoff when they're throttled | | |
| ▲ | marcus0x62 a day ago | parent | next [-] | | For some sites, things are a lot worse. See, for example, Jonathan Corbet's report[0]. 0 - https://social.kernel.org/notice/AqJkUigsjad3gQc664 | |
| ▲ | lmz a day ago | parent | prev | next [-] | | How can you say it's $0 in compute without knowing if the data returned required any computation? | |
| ▲ | nonrandomstring 17 hours ago | parent | prev [-] | | Didn't rachelbytheebay post recently that her blog was being swamped?
I've heard that from a few self-hosting bloggers now. And Wikipedia
has recently said more than half of traffic is noe bots. ARe you
claiming this isn't a real problem? |
|
| |
| ▲ | fc417fc802 a day ago | parent | prev [-] | | > The companies to blame here are solely the ones employing these fingerprinting techniques, Let's not go blaming vulnerabilities on those exploiting them. Exploitation is also bad but being exploitable is a problem in and of itself. | | |
| ▲ | shiomiru 15 hours ago | parent | next [-] | | > Let's not go blaming vulnerabilities on those exploiting
them. Exploitation is also bad but being exploitable is a problem in and
of itself. There's "vulnerabilities" and there's "inherent properties of a complex
protocol that is used to transfer data securely". One of the latter is
that metadata may differ from client to client for various reasons,
inside the bounds accepted in the standard. If you discriminate based
on such metadata, you have effectively invented a new proprietary
protocol that certain existing browsers just so happen to implement. It's like the UA string, but instead of just copying a single HTTP
header, new browsers now have to reverse engineer the network stack of
existing ones to get an identical user experience. | | |
| ▲ | fc417fc802 15 hours ago | parent [-] | | I get that. I don't condone the behavior of those doing the fingerprinting. But what I'm saying is that the fact that it is possible to fingerprint should in pretty much all cases be viewed as a sort of vulnerability. It isn't necessarily a critical vulnerability. But it is a problem on some level nonetheless. To the extent possible you should not be leaking information that you did not intend to share. A protocol that can be fingerprinted is similar to a water pipe with a pinhole leak. It still works, it isn't (necessarily) catastrophic, but it definitely would be better if it wasn't leaking. |
| |
| ▲ | Jubijub 3 hours ago | parent | prev [-] | | I’m sorry but you comment shows you never had to fight this problem a scale. The challenge is not small time crawlers, the challenge is blocking large / dedicated actors. The problem is simple : if there is more than X volume of traffic per <aggregation criteria >, block it.
Problem : most aggregation criteria are trivially spoofable, or very cheap to change :
- IP : with IPv6 this is not an issue to rotate your IP often
- UA : changing this is scraping 101
- SSL fingerprint : easy to use the same as everyone
- IP stack fingerprint : also easy to use a common one
- request / session tokens : it’s cheap to create a new session
You can force login, but then you have a spam account creation challenge, with the same issues as above, and depending on your infra this can become heavy Add to this that the minute you use a signal for detection, you “burn” it as adversaries will avoid using it, and you lose measurement thus the ability to know if you are fixing the problem at all. I worked on this kind of problem for a FAANG service, whoever claims it’s easy clearly never had to deal with motivated adversaries |
|
| |
| ▲ | gruez a day ago | parent | prev [-] | | What's the advantage of randomizing the order, when all chrome users already have the same order? Practically speaking there's a bazillion ways to fingerprint Chrome besides TLS cipher ordering, that it's not worth adding random mitigations like this. |
|
|
|
|
|
| ▲ | johnisgood 14 hours ago | parent | prev | next [-] |
| I used to call it "cURL", but apparently officially it is curl, correct? |
| |
| ▲ | bdhcuidbebe 7 hours ago | parent | next [-] | | I’d guess Daniel pronounce it as ”kurl”, with a hard C like in ”crust”, since hes swedish. | |
| ▲ | cruffle_duffle 11 hours ago | parent | prev [-] | | As in “See-URL”? I’ve always called it curl but “see url” makes a hell of a lot of sense too! I’ve just never considered it and it’s one of those things you rarely say out loud. | | |
| ▲ | johnisgood 10 hours ago | parent [-] | | I prefer cURL as well, but according to official sources it is curl. :D Not sure how it is pronounced though, I pronounce it as "see-url" and/or "see-U-R-L". It might be pronounced as "curl" though. |
|
|
|
| ▲ | eesmith a day ago | parent | prev | next [-] |
| I'm hoping this means Ladybird might support ftp URLs. |
| |
|
| ▲ | devwastaken 9 hours ago | parent | prev [-] |
| ladybird does not have the resources to be a contender to current browsers. its well marketed but has no benefits or reason to exist over chromium. its also a major security risk as it is designed yet again in demonstrably unsafe c++. |