Remix.run Logo
nonrandomstring a day ago

When I spoke to these guys [0] we touched on those quirks and foibles that make a signature (including TCP stack stuff beyond control of any userspace app).

I love this curl, but I worry that if a component takes on the role of deception in order to "keep up" it accumulates a legacy of hard to maintain "compatibility" baggage.

Ideally it should just say... "hey I'm curl, let me in"

The problem of course lies with a server that is picky about dress codes, and that problem in turn is caused by crooks sneaking in disguise, so it's rather a circular chicken and egg thing.

[0] https://cybershow.uk/episodes.php?id=39

thaumasiotes a day ago | parent | next [-]

> Ideally it should just say... "hey I'm curl, let me in"

What? Ideally it should just say "GET /path/to/page".

Sending a user agent is a bad idea. That shouldn't be happening at all, from any source.

Tor3 20 hours ago | parent | next [-]

Since the first browser appeared I've always meant that sending a user agent id was a really bad idea. It breaks with the fundamental idea of the web protocol, that it's the server's responsibility to provide data and it's the client's responsibility to present it to the user. The server does not need to know anything about the client. Including user agent in this whole thing was a huge mistake as it allowed web site designers to code for specific quirks in browsers. I can to some extent accept a capability list from the client, but I'm not so sure even that is necessary.

nonrandomstring 17 hours ago | parent | prev [-]

Absolutely, yes! A protocol should not be tied to client details. Where did "User Agent" strings even come from?

darrenf 15 hours ago | parent [-]

They're in the HTTP/1.0 spec. https://www.rfc-editor.org/rfc/rfc1945#section-10.15

10.15 User-Agent

   The User-Agent request-header field contains information about the
   user agent originating the request. This is for statistical purposes,
   the tracing of protocol violations, and automated recognition of user
   agents for the sake of tailoring responses to avoid particular user
   agent limitations.
immibis a day ago | parent | prev [-]

What should instead happen is that Chrome should stop sending as much of a fingerprint, so that sites won't be able to fingerprint. That won't happen, since it's against Google's interests.

gruez a day ago | parent [-]

This is a fundamental misunderstanding of how TLS fingerprinting works. The "fingerprint" isn't from chrome sending a "fingerprint: [random uuid]" attribute in every TLS negotiation. It's derived from various properties of the TLS stack, like what ciphers it can accept. You can't make "stop sending as much of a fingerprint", without every browser agreeing on the same TLS stack. It's already minimal as it is, because there's basically no aspect of the TLS stack that users can configure, and chrome bundles its own, so you'd expect every chrome user to have the same TLS fingerprint. It's only really useful to distinguish "fake" chrome users (eg. curl with custom header set, or firefox users with user agent spoofer) from "real" chrome users.

RKFADU_UOFCCLEL 8 hours ago | parent | next [-]

What? Just fix the ciphers to a list of what's known to work + some safety margin. Each user needing some different specific cipher (like a cipher for horses, and one for dogs), is not a thing.

gruez 7 hours ago | parent [-]

>Just fix the ciphers to a list of what's known to work + some safety margin.

That's already the case. The trouble is that NSS (what firefox uses) doesn't support the same cipher suites as boringssl (what chrome uses?).

dochtman a day ago | parent | prev [-]

Part of the fingerprint is stuff like the ordering of extensions, which Chrome could easily do but AFAIK doesn’t.

(AIUI Google’s Play Store is one of the biggest TLS fingerprinting culprits.)

shiomiru a day ago | parent | next [-]

Chrome has randomized its ClientHello extension order for two years now.[0]

The companies to blame here are solely the ones employing these fingerprinting techniques, and those relying on services of these companies (which is a worryingly large chunk of the web). For example, after the Chrome change, Cloudflare just switched to a fingerprinter that doesn't check the order.[1]

[0]: https://chromestatus.com/feature/5124606246518784

[1]: https://blog.cloudflare.com/ja4-signals/

nonrandomstring a day ago | parent | next [-]

> blame here are solely the ones employing these fingerprinting techniques,

Sure. And it's a tragedy. But when you look at the bot situation and the sheer magnitude of resource abuse out there, you have to see it from the other side.

FWIW the conversation mentioned above, we acknowledged that and moved on to talk about behavioural fingerprinting and why it makes sense not to focus on the browser/agent alone but what gets done with it.

NavinF a day ago | parent [-]

Last time I saw someone complaining about scrapers, they were talking about 100gib/month. That's 300kbps. Less than $1/month in IP transit and ~$0 in compute. Personally I've never noticed bots show up on a resource graph. As long as you don't block them, they won't bother using more than a few IPs and they'll backoff when they're throttled

marcus0x62 a day ago | parent | next [-]

For some sites, things are a lot worse. See, for example, Jonathan Corbet's report[0].

0 - https://social.kernel.org/notice/AqJkUigsjad3gQc664

lmz a day ago | parent | prev | next [-]

How can you say it's $0 in compute without knowing if the data returned required any computation?

nonrandomstring 17 hours ago | parent | prev [-]

Didn't rachelbytheebay post recently that her blog was being swamped? I've heard that from a few self-hosting bloggers now. And Wikipedia has recently said more than half of traffic is noe bots. ARe you claiming this isn't a real problem?

fc417fc802 a day ago | parent | prev [-]

> The companies to blame here are solely the ones employing these fingerprinting techniques,

Let's not go blaming vulnerabilities on those exploiting them. Exploitation is also bad but being exploitable is a problem in and of itself.

shiomiru 15 hours ago | parent | next [-]

> Let's not go blaming vulnerabilities on those exploiting them. Exploitation is also bad but being exploitable is a problem in and of itself.

There's "vulnerabilities" and there's "inherent properties of a complex protocol that is used to transfer data securely". One of the latter is that metadata may differ from client to client for various reasons, inside the bounds accepted in the standard. If you discriminate based on such metadata, you have effectively invented a new proprietary protocol that certain existing browsers just so happen to implement.

It's like the UA string, but instead of just copying a single HTTP header, new browsers now have to reverse engineer the network stack of existing ones to get an identical user experience.

fc417fc802 15 hours ago | parent [-]

I get that. I don't condone the behavior of those doing the fingerprinting. But what I'm saying is that the fact that it is possible to fingerprint should in pretty much all cases be viewed as a sort of vulnerability.

It isn't necessarily a critical vulnerability. But it is a problem on some level nonetheless. To the extent possible you should not be leaking information that you did not intend to share.

A protocol that can be fingerprinted is similar to a water pipe with a pinhole leak. It still works, it isn't (necessarily) catastrophic, but it definitely would be better if it wasn't leaking.

Jubijub 3 hours ago | parent | prev [-]

I’m sorry but you comment shows you never had to fight this problem a scale. The challenge is not small time crawlers, the challenge is blocking large / dedicated actors. The problem is simple : if there is more than X volume of traffic per <aggregation criteria >, block it. Problem : most aggregation criteria are trivially spoofable, or very cheap to change : - IP : with IPv6 this is not an issue to rotate your IP often - UA : changing this is scraping 101 - SSL fingerprint : easy to use the same as everyone - IP stack fingerprint : also easy to use a common one - request / session tokens : it’s cheap to create a new session You can force login, but then you have a spam account creation challenge, with the same issues as above, and depending on your infra this can become heavy

Add to this that the minute you use a signal for detection, you “burn” it as adversaries will avoid using it, and you lose measurement thus the ability to know if you are fixing the problem at all.

I worked on this kind of problem for a FAANG service, whoever claims it’s easy clearly never had to deal with motivated adversaries

gruez a day ago | parent | prev [-]

What's the advantage of randomizing the order, when all chrome users already have the same order? Practically speaking there's a bazillion ways to fingerprint Chrome besides TLS cipher ordering, that it's not worth adding random mitigations like this.