| ▲ | A kernel bug froze my machine: Debugging an async-profiler deadlock(questdb.com) | ||||||||||||||||||||||
| 111 points by bluestreak a day ago | 18 comments | |||||||||||||||||||||||
| ▲ | SerCe 21 hours ago | parent | next [-] | ||||||||||||||||||||||
Great article! Just yesterday I watched a Devoxx talk by Andrei Pangin [1], the creator of async-profiler where I learned about the new heatmap support. To many folks it might not sound that exciting, until you realise that these heatmaps make it much easier to see patterns over time. If you’re interested there’s a solid blog post [2] from Netflix that walks through the format and why it can be incredibly useful. [1]: https://www.youtube.com/watch?v=u7-S-Hn-7Do [2]: https://netflixtechblog.com/netflix-flamescope-a57ca19d47bb | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | ChuckMcM 21 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
Question, isn't this a bug? static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer) { - if (event->state != PERF_EVENT_STATE_ACTIVE) + if (event->state != PERF_EVENT_STATE_ACTIVE || + event->hw.state & PERF_HES_STOPPED) return HRTIMER_NORESTART; The bug being that the precedence of || is higher than the precedence of != ? Consider writing it if ((event->state != PERF_EVENT_STATE_ACTIVE) || (event->hw_state & PERF_HES_STOPPED)) This coming from a person who has too many scars from not parenthesizing my expressions in conditionals to ensure they work the way I meant them to work. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | everlier 20 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
I'm glad to hear I'm not alone. Due to the nature of what I do, I'm often accumulating ~800-900GB of Docker images and volumes on my machine, sometimes running 20-30 containers at once starting/stopping them concurrently. Somehow, very rarely, but still quite often (once every couple of weeks) - it leads to a complete deadlock somewhere inside of the kernel due to some crazy race condition that I'm absolutely in no way able to reliably reproduce. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | Artoooooor 9 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
Ah, this is the bug that froze the system when Minecraft was running with Spark profiler mod! | |||||||||||||||||||||||
| ▲ | broken_broken_ 11 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
Nice article, thank you. Did you also consider using bpftrace while debugging? I do not have much experience with it, but I think you can see the kernel call stack with it and I know you can also see the return value (in eax). That would be less effort than qemu + gdb + disabling kernel aslr, etc. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | bluuewhale 17 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
Great write-up. This kind of "debugging journey" post is gold. | |||||||||||||||||||||||
| ▲ | jerrinot 21 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
Author here. I've always been kernel-curious despite never having worked on one myself. Consider this either a collection of impractical party tricks or a hands-on way to get a feel for kernel internals. | |||||||||||||||||||||||
| ▲ | snvzz 20 hours ago | parent | prev [-] | ||||||||||||||||||||||
Great debugging effort. Now, with the complexity (MLoCs!) of the Linux kernel, this is definitely not the only bug to be found in there. This is why Linux is just an interim kernel for these use cases in which we still cannot use seL4[0]. | |||||||||||||||||||||||
| |||||||||||||||||||||||