Most carrier/enterprise/hardware IPv4 routers, particular those on the internet, will not actually perform IPv4 fragmentation on behalf of the client traffic even though it's allowed by the IPv4 standard. Typically fragmentation is reserved for boxes which already have another reason to care about it (such as needing to NAT or inspect the packets) or the client endpoints themselves. I.e. the internet will (sparing security middleboxes) allow arbitrary IPv4 fragments through but it won't typically turn a 8000 byte packet into 6 fragments to fit through a 1500 byte MTU limitation on behalf of the clients. E.g. if you send a 1500 byte IPv4 ping without DF set to a cellular modem or someone with a DSL modem using PPPoE it'll almost always get dropped by the carrier rather than fragmented.

Of course nothing is stopping you from labbing it up at home. Firewalls and software routers can usually be made to do refragmentation.

▲

ay 8 months ago | parent [-]

Of course on the carrier boxes the fragmentation is done also not inline, so its behavior will depend on the aggressiveness of the CoPP configuration, and will be subject to the same pitfalls as the ICMP packet too big generation.

Thanks for keeping me straight here!

Based on the admittedly old study at [0] seems like some carriers just don’t bother to fragment, indeed - but by far not all of them.

Firewalls might do virtual reassembly, so the trick with the initial fragment won’t fly there.

This MTU subject is interesting for me because I have a little work in progress experiment: https://gerrit.fd.io/r/c/vpp/+/41914/1/src/plugins/pvti/pvti... (the code itself is already in, but has a few crashy bugs still and I need to take make it not suck performance wise, but that is my attempt to revisit the issue of MTU for tunnel use case. The thesis is that keeping the 5-tuple will make “chunking”/“de-chunking” at tunnel endpoints much much simpler on the endpoints of the tunnel.

The source of inspiration was a very practical setup at [1], which is, while looking horrible in theory (locally fragmented GRE over L2TP), actually gives a decent performance with 1500-byte end to end MTU over the tunnel.

The open question is which inner MTU will be sane, taking into account the increased probability of loss with bigger inner MTU… intuitively seems like something like ~2.5K should just double the loss probability (because it’s 2x packets) and might be a workable compromise in 2025….

One could also do the same trick over QUIC, of course, but i wanted something tiny and easier to experiment with - and the ability to go over IPSec or wireguard as well as a secured underlay.

[0] https://labs.ripe.net/author/emileaben/ripe-atlas-packet-siz...

[1] https://github.com/ayourtch/linode-ipv6-tunnel

▲

zamadatix 8 months ago | parent [-]

Very interesting! It's like the best of the fragment-pre-encrypt world (everything appears as single packet 5 tuples to middleboxes) and fragment-post-encrypt world (transported packet data remains untouched) debate seen on IPsec deployments.

Like you mention you could do this under QUIC but then you'd be hamstrung to some of the design mandates such as encryption. This is way better as it's just datagrams doing your one goal - hiding that you're transporting fragments.

▲

ay 8 months ago | parent [-]

Yeah, that was precisely the set of trade offs :-)

OTOH, I heard folks calling to banish the “no messing with a flow within 5-tuple” principle, so my hack may not have an overly long shelf life.

▲

zamadatix 8 months ago | parent [-]

Next up: Everything just ends up being QUIC because you can't fuck with what you can't see inside :).

	▲	ay 8 months ago \| parent [-]
		Potentially. However, anecdotally a lot of service providers treat UDP to stricter rate limiting than TCP because it’s unauthenticated nature, so there is a back-pressure factor there. Also: RFC9000 for QUIC is almost 50% longer than RFC9293 that is the new one for TCP - so, I would expect the implementation would be probably more complex ? In the absence of that, everything will go over HTTP :-)