It is incorrect to "normalize" // in HTTP URL paths

Remix clone Hacker News

new | show | ask | jobs Github

▲ It is incorrect to "normalize" // in HTTP URL paths(runxiyu.org)

42 points by pabs3 6 hours ago | 33 comments

▲ echoangle 2 hours ago | parent | next [-]

> Wait, are there any implementations that wrongly collapse double-slashes?

> nginx with merge_slashes

How can it be wrong if it is server-side? If the server wants to treat those paths equally, it can if it wants to.

It would only be wrong if a client does it and requests a different URL than the user entered, right?

▲

leni536 2 hours ago | parent | next [-]

It can't be. It's the same confusion as "email address normalization" being wrong (for example when gmail ignores dots when mapping an address to an inbox).

It matters where the normalization happens, and server-side behavior is out-of-scope of these identifier RFCs.

▲

cxr 27 minutes ago | parent | prev | next [-]

nginx is frequently used as a reverse proxy and not "the server" (or only to the extent that it's the client-facing server). Its defaults assume that it's fine to do a "normalization" pass to remove double slash, etc., even though that's potentially out of step with how the actual content/application server wishes to deal with those requests.

	▲	echoangle 3 minutes ago \| parent [-]
		That’s purely a server side configuration issue and has nothing to do with web standards though. There’s nothing that says that the internal communication on the server needs to follow the standards for user agents. And at least according to this, the default setting is off so nginx actually is compliant unless you manually make it not be: https://www.oreilly.com/library/view/nginx-http-server/97817...

▲

OoooooooO an hour ago | parent | prev [-]

Yeah I would say that falls under the origin defining both paths as equivalent.

> Therefore, collapsing // to / in HTTP URL path segments is not correct normalization. It produces a different, non-equivalent identifier unless the origin explicitly defines those two paths as equivalent.

▲ MattJ100 4 hours ago | parent | prev | next [-]

URL parsing/normalisation/escaping/unescaping is a minefield. There are many edge cases where every implementation does things differently. This is a perfect example.

It gets worse if you are mapping URLs to a filesystem (e.g. for serving files). Even though they look similar, URL paths have different capabilities and rules than filesystems, and different filesystems also vary. This is also an example of that (I don't think most filesystems support empty directory names).

▲ bryden_cruz an hour ago | parent | prev | next [-]

This exact ambiguity causes massive headaches when putting Nginx in front of a Spring Boot backend. Nginx defaults to merge_slashes on, so it silently 'fixes' the path. But Spring Security's strict firewall explicitly rejects URLs with // as a potential directory traversal vector and throws an error. It forces you to explicitly decide which layer in your infrastructure owns path normalization, because if Nginx passes it raw, the Java backend completely panics.

	▲	jeroenhd 33 minutes ago \| parent [-]
		What I don't understand about this setup is why a double slash could ever be a directory traversal attack in Spring Boot. If you're proxying to another server that just assumes relative paths and doesn't do any kind of validation, I guess an extra / might cause reading files outside of the expected area? That'd be an extremely weird and awful setup that I don't think makes any sense in the context of Spring Boot.

▲ PunchyHamster 3 hours ago | parent | prev | next [-]

We cut those and few others coz historically there were exploits relying on it

Nothing on web is "correct", deal with it

▲ dale_glass 3 hours ago | parent | prev | next [-]

But maybe you should anyway.

Because maybe you use S3, which treats `foo/bar.txt` and `foo//bar.txt` as entirely separate things. Because to S3, directories don't exist and those are literally the exact names of the keys under which data is stored.

So you have script A concatenate "foo" + "/bar" and script B concatenate "foo/" + "/bar", and suddenly you have a weird problem.

I can't imagine a real use case where you'd think this is desirable.

▲ Mordisquitos an hour ago | parent | next [-]

> I can't imagine a real use case where you'd think this is desirable.

Not S3, but here's a literal real use case: the entry for the Iraqw word /ameeni (woman) in Wiktionary.

https://en.wiktionary.org/wiki//ameeni

If for whatever reason your S3 keys contained English words and their translations separated by a slash, you would have a real problem if one of your scripts were to concatenate woman, / and /ameeni as woman/ameeni instead of woman//ameeni in the English/Iraqw case.

	▲	zarzavat 10 minutes ago \| parent \| next [-]
		Sounds like a Unicode problem. U+002F is not a letter codepoint and it's not appropriate to use as a letter given its history of being used for path separation. Iraqw slash should have its own code point. Can they not just use a 3 like in Arabic?
	▲	kstrauser an hour ago \| parent \| prev [-]
		If you’re working with a use case where that’s even possible, you need to URL-encode it like `woman/%2Fameeni` Consider that if the language allowed trailing slashes. What would this path mean if ameeni/ happened to be a valid word? `ameeni//ameeni` One of those would get the slash but it’s not clear which. W3C says: > The slash ("/", ASCII 2F hex) character is reserved for the delimiting of substrings whose relationship is hierarchical.

▲ realitylabs 26 minutes ago | parent | prev | next [-]

This exact issue has derailed our main document store for the past several years. We have written a couple supporting applications specifically to address the fallout from this issue.

▲ secondcoming 2 hours ago | parent | prev [-]

If a user of S3 knows that directories aren't real why would they expect directory-related normalisation to happen?

▲ leni536 2 hours ago | parent | prev | next [-]

I don't think it's incorrect for distinct paths to point to the same resource.

Of course you shouldn't assume that in a client. If you are implementing against an API don't deviate regarding // and trailing / from the API documentation.

▲ sfeng 3 hours ago | parent | prev | next [-]

What I’ve learned in doing this type of normalization is whatever the specification says, you will always find some website that uses some insane url tweak to decide what content it should show.

▲ domenicd an hour ago | parent | prev | next [-]

As some others have indirectly pointed out, this article conflates two things:

- URL parsing/normalization; and

- Mapping URLs to resources (e.g. file paths or database entries) to be served from the server, and whether you ever map two distinct URLs to the same resource (either via redirects or just serving the same content).

The former has a good spec these days: https://url.spec.whatwg.org/ tells you precisely how to turn a string (e.g., sent over the network via HTTP requests) into a normalized data structure [1] of (scheme, username, password, host, port, path, query, fragment). The article is correct insofar that the spec's path (which is a list of strings, for HTTP URLs) can contain empty string segments.

But the latter is much more wild-west, and I don't know of any attempt being made to standardize it. There are tons of possible choices you can make here:

- Should `https://example.com/foo//bar` serve the same resource as `https://example.com/foo/bar`? (What the article focuses on.)

- `https://example.com/foo/` vs. `https://example.com/foo`

- `https://example.com/foo/` vs. `https://example.com/FOO`

- `https://example.com/foo` vs. `https://example.com/fo%6f%` vs. `https://example.com/fo%6F%`

- `https://example.com/foo%2Fbar` vs. `https://example.com/foo/bar`

- `https://example.com/foo/` vs. `https://example.com/foo.html`

Note that some things are normalized during parsing, e.g. `/foo\bar` -> `/foo/bar`, and `/foo/baz/../bar` -> `/foo/bar`. But for paths, very few.

Relatedly:

- For hosts, many more things are normalized during parsing. (This makes some sense, for security reasons.)

- For query, very little is normalized during parsing. But unlike for pathname, there is a standardized format and parser, application/x-www-form-urlencoded [2], that can be used to go further and canonicalize from the raw query string into a list of (name, value) string pairs.

Some discussions on the topic of path normalization, especially in terms of mapping the filesystem, in the URL Standard repo:

- https://github.com/whatwg/url/issues/552

- https://github.com/whatwg/url/issues/606

- https://github.com/whatwg/url/issues/565

- https://github.com/whatwg/url/issues/729

-----

[1]: https://url.spec.whatwg.org/#url-representation [2]: https://url.spec.whatwg.org/#application/x-www-form-urlencod...

▲ mjs01 4 hours ago | parent | prev | next [-]

// is useful if the server needs to serve both static files in the filesystem, and embedded files like a webpage. // can be used for embedded files' URL because they will never conflict with filesystem paths.

	▲	PunchyHamster 3 hours ago \| parent [-]
		....just serve it from other paths

▲ renewiltord 2 hours ago | parent | prev | next [-]

I’m going to keep doing it.

▲ janmarsal 2 hours ago | parent | prev | next [-]

i'm gonna do it anyway

▲ leni536 3 hours ago | parent | prev | next [-]

Wait until you try http:/example.com and http://////example.com in your browser.

	▲	tremon 13 minutes ago \| parent \| next [-]
		Your first example is a valid uri but not a valid http url, because it's missing a host part. Your second example is not a valid uri, as the spec requires that [scheme]:// is followed by a host indicator. Neither has much to do with / normalization, which applies to the path part of a valid uri.
	▲	stanac 2 hours ago \| parent \| prev [-]
		In both cases I get https://example.com/ in FF.

▲ WesolyKubeczek 4 hours ago | parent | prev [-]

It is probably “incorrect”, but given the established actual usage over the decades, it’s most likely what you need to do nevertheless.

Not doing it is like punishing people for not using Oxford commas, or entering an hour long debate each time someone writes “would of” instead of “would have”. It grinds my gears too, but I have different hills to die on.

▲

bazoom42 3 hours ago | parent | next [-]

If different clients does it differently, you have incompatibilies. This punishes everybody. Since normalizing // to / removes information which may be significant, the obviously correct choice is folllowing the spec.

▲

PunchyHamster 3 hours ago | parent [-]

if it is significant, you coded your app wrong, plain and simple

	▲	jeroenhd 3 hours ago \| parent \| next [-]
		Of course not. It's an explicit feature part of every specification. Plenty of websites rewrite paths like /a/b/c/d into a backend service call like /?w=a&x=b&y=c&z=d. In that scheme, /a//c/d would rewrite to /?w=a&x=&y=c&z=d, something entirely distinct from /a/c/d working out to /?w=a&x=b&y=c It's not the application's fault that the people attempting to configure web server URLs don't know how web server URLs work.
	▲	bazoom42 3 hours ago \| parent \| prev [-]
		Why?

▲

Etheryte 4 hours ago | parent | prev [-]

Not sure I agree. The correct thing is to not mess with the URL at all if you're unsure about what to be doing to it. Doing nothing is the easiest thing of them all, why not do that?

▲

j16sdiz 3 hours ago | parent [-]

because the you need some consistency or normalisation before applying ACL or do routing?

	▲	jeroenhd 3 hours ago \| parent [-]
		URL normalization is defined and it doesn't include collapsing slashes. Not that you can include custom normalization rules (like collapsing slashes, tolower()ing the entire path, removing the query part of the URL), but that's not part of the standard. If you're doing anything extra, the risk of breaking stuff is on you.