| ▲ | A Higgs-Bugson in the Linux Kernel(blog.janestreet.com) |
| 217 points by Ne02ptzero 2 days ago | 52 comments |
| |
|
| ▲ | gnfargbl a day ago | parent | next [-] |
| Calling this a "Higgs-Bugson" doesn't make a lot of sense. There's nothing uncertain or difficult to reproduce about the Higgs. The reason that it took so long to find was that the cross-section of production is very low, the decay signatures are hard to separate from the background, the specific energy scale it existed at was not well-defined, and building the LHC was (to put it mildly) difficult and expensive. Roughly, if you'll forgive a bad analogy from a long-lapsed physicist, it was the equivalent of trying to find a very weak glow from a specific type of bug hiding at an unknown location in a huge field of corn. Except that your vision was very bad, so you had to invent a new type of previously-unimaginably excellent eyeglasses to see the thing. Also before you could even start looking you had to expend a painful amount of time and money building a flashlight so incredibly huge that it needed new types of cryogenic cooling inventing, just to stop it from melting when you switched it on. If you had a software bug that you were almost certain was there, but you needed half of the world's GPU clusters for three years to locate and prove it, then that would be a Higgs-Bugson. |
| |
| ▲ | lisper a day ago | parent [-] | | A bug that shows up in production but goes away when you try to debug it is usually called a Heisenbug. |
|
|
| ▲ | alexpotato a day ago | parent | prev | next [-] |
| Regarding NFS, I've always loved this quote from the CTO at a hedge fund I once worked at: "NFS is lot like heroin: at first, it seems amazing. But then it ruins your life" (This is a place that did almost EVERYTHING via NFS including different applications communicating via shared files on NFS mounts. It also had the weird setup of using BOTH Linux AND Windows permissions on NFS mounts shared between user desktops [windows] an servers [linux]) |
| |
| ▲ | stavros a day ago | parent [-] | | The problem I have with reviews like these is that they're expressed in absolute terms. Yes, NFS might ruin my life, but if it ruins my life less than every other alternative, it's still a win. | | |
| ▲ | eqvinox a day ago | parent | next [-] | | I'd go as far as saying most networked concurrent file access will ruin your life one way or another, because it's just a hard problem, and it's trying to solve it at a very odd layer; a "classic" fs can't really take advantage of higher layer transactional or other known constraints in order to make things work better… | | |
| ▲ | fragmede 17 hours ago | parent [-] | | Google Docs solved the problem at the right layer then. | | |
| ▲ | eqvinox 14 hours ago | parent [-] | | No, Google Docs solved a different problem at the right layer. Their solution isn't transferable to other specific problems that may currently be approached using networked file systems, let alone the generic case. |
|
| |
| ▲ | burnt-resistor 18 hours ago | parent | prev | next [-] | | The devil is in the details, the devil you know is preferable, and there's yet no perfectly angelic systems or code (because of the widespread allergy to formal methods and job security).. which will lead to less evil, but still imperfect systems. | |
| ▲ | johncolanduoni a day ago | parent | prev [-] | | Continuing the analogy, many people eventually discover that they used NFS because they didn’t understand their underlying problem clearly. |
|
|
|
| ▲ | hglee a day ago | parent | prev | next [-] |
| https://lists.openwall.net/linux-kernel/2025/03/19/1374 |
| |
| ▲ | penguin_booze a day ago | parent [-] | | I wish developers--new and old alike--pay attention to the commit messages that goes into the kernel. Granted, it takes a subject matter expert to really understand what's being said, but the general format and layout of commit messages is instructive. Commit messages helps the reader/reviewer get their bearings; they also help to build the case from the bottom up. The fact that the development team is globally distributed both necessitates this kind of knowledge serialization and preserves it for posterity. It's completely different from tapping a colleague sitting next you on the shoulder, and saying "psst, can you approve this quick? It's just a bunch of fixes". | | |
|
|
| ▲ | emmelaich a day ago | parent | prev | next [-] |
| Kerberos is 'fun'. I had to manage a system which used Kerberos
to provide authentication between a Rubik's cube of various Windows flavours with various crypto standards, Linux machines of various versions and Java versions and apps of various maturity. It was an ever present source of weird behaviour and I had to bury myself in the innards of all these systems. I know, not directly related to the article. Just needed to vent bitterly. |
|
| ▲ | vrnvu a day ago | parent | prev | next [-] |
| I'd like to highlight this: >NFS with Kerberos secure, simple, battle tested. no crazy architecture works so well a bug showed up in the kernel :-) |
| |
| ▲ | eqvinox a day ago | parent | next [-] | | > works so well a bug showed up in the kernel :-) What exactly are you trying to highlight here? Most code has bugs. This one is someone forgetting to stick to actual behavior described in 1997, it's a mistake, mistakes happen. Which one of "secure", "simple", "battle tested" and "no crazy architecture" do you think this disproves? Or do you think CIFS or Ceph have no bugs? | | |
| ▲ | gyesxnuibh a day ago | parent [-] | | I think they're saying typically the kernel one of the last places you'd expect the bug, so it shows that it is battle tested? I don't think they're being snarky. | | |
| ▲ | eqvinox a day ago | parent [-] | | I didn't really read it as snarky, I just straight up don't understand what they mean (and maybe why that smiley is there?) | | |
| ▲ | vrnvu 20 hours ago | parent [-] | | By "no crazy architecture" I meant it avoids the modern trend of building monstrous data platforms on top of data meshes, event buses, and layers of cloud abstractions. The kind I sometimes see, hence the smiley :-) |
|
|
| |
| ▲ | burnt-resistor 18 hours ago | parent | prev [-] | | The Linux kernel needs to adopt better testing methodologies because they're almost entirely reliant on meatcloud CI than provably-correct code with invariant contracts. |
|
|
| ▲ | protocolture a day ago | parent | prev | next [-] |
| I love the term "Higgs Bugson". Its much better than what I usually do which is just call a system haunted. |
| |
| ▲ | GTP a day ago | parent | next [-] | | I was used to the more common Heisenbug, but I find Higgs-Bugson more funny. | |
| ▲ | burnt-resistor 18 hours ago | parent | prev | next [-] | | There is no magic at any time. All behavior lives somewhere. | |
| ▲ | EarlKing a day ago | parent | prev [-] | | Haunted? Hell, it's positively possessed. |
|
|
| ▲ | mikeyg a day ago | parent | prev | next [-] |
| anyone else feel like the linux kernel release quality has come down a bit here in the 2020s? i feel like it hasn't been this bad since the mid 90s. anecdotally in the past couple years, i've experienced a data corruption bug in xfs, wonky wifi firmware/kernel regressions, graphics artifacts and hard crashes in amdgpu. my experience with mainline releases before 2020 has been that they're rock solid. i'd doubt myself before i doubted the kernel. i say all this with a deep appreciation for everyone and the work they're doing... my intuition says that the complexity of it all is reaching a tipping point that is finally overwhelming the ages old release engineering processes. |
|
| ▲ | anonymousiam a day ago | parent | prev | next [-] |
| "The normal timeout logic can take care of retransmission in the unlikely case that one is needed." NFS can be run over TCP or UDP. Does the retransmission occur when using UDP? |
| |
| ▲ | ninjha a day ago | parent [-] | | Yes! The retransmission logic in Linux NFS is independent of transport (see the `retrans` option in `mount.nfs`). Weirdly enough this also means that if you’re running with TCP you can have retransmits at the NFS/SunRPC level and at the TCP level depending on your configuration. |
|
|
| ▲ | nycerrrrrrrrrr a day ago | parent | prev | next [-] |
| Curious why they're using NFSv3 instead of v4? |
|
| ▲ | sedatk a day ago | parent | prev | next [-] |
| > A higgs-bugson is a bug that is reported in practice but difficult to reproduce This was the first time I heard of "higgs-bugson". The term sounded so forced that I had to know how it differed from Heisenbug. In short, it doesn't[1]. Then why did it even exist? The term somehow made it to the "Heisenbug"'s Wikipedia page[1], so I checked the sources. There were two and both end up at the same site: Jeff Atwood's blog post[2] quoting some StackOverflow answers to a poll-like question ("what's a programming term you coined?") because he wanted to remove lighthearted content from the site as he thought it clashed with SO's mission of educating people and advancing their skills[3]. There was a proposal on Meta StackExchange about undeleting that question with the answers, but it was refused by Jeff Atwood again because it invited "made up stuff"[4] among other reasons. So, Wikipedia in the end, has this term in Heisenbug page because someone just blurted out something in 2010, it was copy-pasted to a blog, and then got scooped up by some news outlet. There are no other sources. Kagi doesn't find any instances of the term before it was coined on StackOverflow in 2010. For all we know, "gingerbreadboy" from England invented it. The irony is that the term somehow made it to the literature -hence the blog post here- because someone was just having fun at StackOverflow. It obviously either sounded good, or just clicked that others started using it. StackOverflow deleted the content that actually made a small part of computer science history because it wasn't "serious". In other words, StackOverflow cut off one of its history-making parts because it had an incomplete and simplistic view of useful. I think it might be possible to draw a line from their understanding of communities and societal dynamics to the downfall of StackOverflow after the emergence of AI[5]. [1] https://en.wikipedia.org/wiki/Heisenbug [2] https://blog.codinghorror.com/new-programming-jargon/ [3] https://stackoverflow.blog/2010/01/04/stack-overflow-where-w... [4] https://meta.stackexchange.com/questions/122164/can-we-un-de... [5] https://blog.pragmaticengineer.com/stack-overflow-is-almost-... |
| |
| ▲ | dh2022 a day ago | parent | next [-] | | I think Heisenbug refers to a bug that stops repro’ing during debugging (the act of observing the system changes the system behavior). This bug was different: it was very rare and debugging it didn’t make it go away. | |
| ▲ | zahlman a day ago | parent | prev | next [-] | | > because he wanted to remove lighthearted content from the site as he thought it clashed with SO's mission of educating people and advancing their skills[3]. No; he wanted to remove discussion and socialization, because it clashed with SO's mission of presenting useful information without parsing through others' discussion. https://meta.stackexchange.com/questions/2950 https://meta.stackexchange.com/questions/19665 https://meta.stackexchange.com/questions/92107 https://meta.stackexchange.com/questions/131009 > In other words, StackOverflow cut off one of its history-making parts because it had an incomplete and simplistic view of useful. How does this in any way demonstrate that the view of usefulness was "incomplete" or "simplistic"? How is the deleted content "useful"? > I think it might be possible to draw a line from their understanding of communities and societal dynamics to the downfall of StackOverflow after the emergence of AI[5]. What downfall? Before you point at any of the incoming-question-rate statistics: why should they be interpreted as representing a "downfall"? That is, why is it actually bad if fewer questions are asked? Before you answer that, keep in mind that Stack Overflow already has more than three times as many publicly visible questions about programming as Wikipedia has articles about literally anything notable. | | |
| ▲ | robertlagrant a day ago | parent [-] | | > why should they be interpreted as representing a "downfall"? I agree, but also SO has certainly gone through ups and downs. It does feel as though it's now in a terminal "down" having invested its limited resources in things lots of the dedicated members didn't seem to want, instead of basic improvements to moderation and to chat features. |
| |
| ▲ | bux93 a day ago | parent | prev | next [-] | | Like they say, "Stop trying to make fetch happen!" | |
| ▲ | chris_wot a day ago | parent | prev [-] | | Yeah, stack overflow is dying, we all know it. |
|
|
| ▲ | rurban 21 hours ago | parent | prev | next [-] |
| NFS is usually only used in mixed linux/windows environments. The easiest fix is to avoid NFS and esp. Windows. NFS alone is nightmare enough, Windows is just insanity. |
|
| ▲ | jwillp 11 hours ago | parent | prev | next [-] |
| The article is well-written and I even learned a few things. I'm glad for Nikhil's persistence troubleshooting it and fixing the bug upstream.
Thanks, Nikhil! |
|
| ▲ | alienbaby 21 hours ago | parent | prev | next [-] |
| I always thought this was rather called a 'Heisenbug'? |
| |
| ▲ | 12_throw_away 20 hours ago | parent [-] | | I think the distinction the author is making is that: - a Heisenbug is stochastic and potentially non-local, but - a Higgs-Bugson is a bug that is known to exist (in prod) but is extremely hard to observe in the lab (during dev/testing). I can see it being a useful distinction. A bug that I can't even reproduce at all sits in a different ring of hell from, say, a memory corruption bug that makes the program crash on random unrelated lines of code. |
|
|
| ▲ | Havoc a day ago | parent | prev | next [-] |
| Didn't know jane street did tech writeups |
| |
|
| ▲ | konsalexee a day ago | parent | prev | next [-] |
| TIL higgs-bugson and Heisenbug |
|
| ▲ | ribcage a day ago | parent | prev | next [-] |
| [dead] |
|
| ▲ | snvzz a day ago | parent | prev | next [-] |
| With millions of LoCs, it is no surprise there are bugs. Worse yet, the kernel runs in supervisor mode. This kernel design is bankrupt. There's much better available, such as seL4+Genode. |
| |
| ▲ | eqvinox a day ago | parent | next [-] | | Please try keeping your snide comments to issues they actually apply to. This is a logic bug, with the kernel missing a piece of abnormality handling. You can get the exact same bug in a microkernel (or, FWIW, a memory safe, e.g. Rust) implementation; neither of those concepts help here. | | |
| ▲ | snvzz 15 hours ago | parent [-] | | >You can get the exact same bug in a microkernel Absolutely. And yet, it is that much easier to keep a tiny codebase bug-free. And only that tiny codebase has to run with supervisor privileges. | | |
| ▲ | eqvinox 14 hours ago | parent [-] | | Of course a tiny microkernel code base won't have NFS bugs. It doesn't implement NFS. The bug will instead be in the NFS process/daemon/service/… which considering it's an fs service won't exactly be unprivileged either, even if only by returning maliciously corrupted contents. (e.g. a SUID root file that should not exist.) And, sure, a microkernel could have better security properties. However, (1) this has no connection at all to this specific bug, and (2) the Linux kernel seems to be doing reasonably well on security properties; or rather the industry seems to have decided it's sufficiently secure, even if not perfect. | | |
| ▲ | snvzz 6 hours ago | parent [-] | | Not only is the damage contained, but it is also much easier to protect an isolated NFS server. For instance, instead of being able to read/write/jump literally anywhere in memory, it would only have capabilities to the resources it needs. And these capabilities would be enforced strictly, by the bug-free microkernel. The likes of seL4 even have formal proof of correctness. | | |
| ▲ | eqvinox 2 hours ago | parent [-] | | And you are still making these arguments on the discussion of a bug that they have absolutely no bearing on. If Linux were written with the same exact development history, but as a microkernel, the exact same bug could (and likely would) exist in the NFS client component. The impact is spurious unavailability of service, and would be the same on a microkernel; it is not exploitable for memory corruption. And any file system service, by its function, will be in a position of relative privilege, even if less so on a microkernel. Your arguments are likely valid, with other bugs. Please take them there. Wedging this discussion in here just makes you look like a proselytizing zealot. |
|
|
|
| |
| ▲ | burnt-resistor 18 hours ago | parent | prev | next [-] | | seL4 exhibited great advances in software engineering processes and advances in correctness, zero-copy microkernel IPC performance, and capabilities-based security, but these need explanation, adaptation, and evangelism to real-world use-cases like Linux. Microkernels have severe limitations when it comes to transactional boundaries of calling multiple subsystems and rolling back on failure. Linux has too much inertia to reinvent itself instantly or completely into XYZ. What would add more value would be gradual conversion to Rust and adding formal verification to C and Rust like specifying invariants in comments/metadata like frama-c and/or flux. PS: Religious judgement opinion wars are rarely constructive. | |
| ▲ | eddythompson80 a day ago | parent | prev | next [-] | | seL4+Genode is equally as bankrupt. I run my code in the SMM anyway. | |
| ▲ | lotharcable a day ago | parent | prev [-] | | > This kernel design is bankrupt. There's much better available, such as seL4+Genode. I am sure that the tech community would love to read the details of your great success in deploying microkernels for large variety of production workloads. |
|
|
| ▲ | jxjnskkzxxhx a day ago | parent | prev [-] |
| Content marketing for Jane street. |