| ▲ | mrmattyboy 10 hours ago | |||||||||||||
I agree this doens't seem too ambiguous - it's "you may do this.." and they said "or we may do the reverse". If I say you're could prefix something.. the alternative isn't that you can suffix it. But also.. the programmers working on the software running one of the most important (end-user) DNS servers in the world: 1. Changes logic in how CNAME responses are formed 2. I assume some tests at least broke that meant they needed to be "fixed up" (y'know - "when a CNAME is queried, I expect this response") 3. No one saw these changes in test behavoir and thought "I wonder if this order is important". Or "We should research more into this", Or "Are other DNS servers changing order", Or "This should be flagged for a very gradual release". 4. Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken Cloudflare seem to be getting into thr swing of breaking things and then being transparent. But this really reads as a fun "did you know", not a "we broke things again - please still use us". There's no real RCA except to blame an RFC - but honestly, for a large-scale operation like there's this seems very big to slip through the cracks. I would make a joke about South Park's oil "I'm sorry".. but they don't even seem to be | ||||||||||||||
| ▲ | bashook 3 hours ago | parent | next [-] | |||||||||||||
I was even more surprised to see that the RFC draft had original text from the author dating back to 2015. https://github.com/ableyjoe/draft-jabley-dnsop-ordered-answe... We used to say at work that the best way to get promoted was to be the programmer that introduced the bug into production and then fix it. Crazy if true here... | ||||||||||||||
| ▲ | jrochkind1 7 hours ago | parent | prev | next [-] | |||||||||||||
> I assume some tests at least broke that meant they needed to be "fixed up" OP said: "However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC." One could guess it's something like -- back when we wrote the tests, years ago, whoever did it missed that this was required, not helped by the fact that the spec proceeded RFC 2119 standardizing the all-caps "MUST" "SHOULD" etc language, which would have helped us translsate specs to tests more completely. | ||||||||||||||
| ▲ | black3r 6 hours ago | parent | prev | next [-] | |||||||||||||
> 4. Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken "Testing environment" sounds to me like a real network real user devices are used with (like the network used inside CloudFlare offices). That's what I would do if I was developing a DNS server anyway, other than unit tests (which obviously wouldn't catch this unless they were explicitly written for this case) and maybe integration/end-to-end tests, which might be running in Alpine Linux containers and as such using musl. If that's indeed the case, I can easily imagine how noone noticed anything was broken. First look at this line: > Most DNS clients don’t have this issue. For example, systemd-resolved first parses the records into an ordered set: Now think about what real end user devices are using: Windows/macOS/iOS obviously aren't using glibc and Android also has its own C library even though it's Linux-based, and they all probably fall under the "Most DNS clients don't have this issue.". That leaves GNU/Linux, where we could reasonably expect most software to use glibc for resolving queries, so presumably anyone using Linux on their laptop would catch this right? Except most distributions started using systemd-resolved (most notable exception is Debian, but not many people use that on desktops/laptops), which is a locally-cached recursive DNS server, and as such acts as a middleman between glibc software and the network configured DNS server, so it would resolve 1.1.1.1 queries correctly, and then return the results from its cache ordered by its own ordering algorithm. | ||||||||||||||
| ▲ | laixintao 3 hours ago | parent | prev | next [-] | |||||||||||||
Yes, at least they should test the glibc case. | ||||||||||||||
| ▲ | bpt3 9 hours ago | parent | prev [-] | |||||||||||||
> Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken This is the part that is shocking to me. How is getaddrinfo not called in any unit or system tests? | ||||||||||||||
| ||||||||||||||