Show HN: Continuous Nvidia CUDA PC Sampling Profiler

Blog post about how we extended our open source profiler to include support for continuous production PC sampling.

saagarjha 2 hours ago | parent | next [-]

Honest question, I feel like kernels are usually short enough that you can fully understand their performance in the development cycle before you even deploy them. If you get different results in production this seems to me that you didn’t spend enough time understanding what’s going on earlier. Are there things you genuinely can’t get from this workflow?

	▲	SyzygyRhythm 26 minutes ago \| parent [-]
		Sometimes you have to optimize other people's code. Also, sometimes code behaves unexpectedly depending on the data, say over a certain size threshold. And sometimes it behaves differently on different hardware. You don't always find these things out until production.

▲

killamdiaz 4 days ago | parent | prev [-]

Very cool project.

Curious whether the biggest value has been performance debugging itself or helping developers understand system behavior they otherwise wouldn't have visibility into.

Sometimes the observability layer ends up being more valuable than the optimization layer.

	▲	gnurizen 4 days ago \| parent [-]
		Thanks! I think most performance debugging happens during development, what we're bringing to the table is exposure of system behavior in production which often diverges because of changes in the shape of workloads from dev, which are often simplistic and synthetic. So I'd say its late-stage performance debugging and production observability combined that makes this useful. Stay tuned for a follow on post where we show how we used this to optimize an FSST decompression kernel for vortex (https://github.com/vortex-data/vortex).