I don’t mean to call you or your pseudocode out specifically, but I see this sort of thing all the time, and I just want to put it out there:

PSA: if you ever see code trying to measure timing and it’s not using the CUDA event APIs, it’s fundamentally wrong and is lying to you. The simplest way to be sure you’re not measuring noise is to just ban the usage of any other timing source. Definitely don’t add unnecessary syncs just so that you can add a timing tap.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART_...

▲

bee_rider 3 months ago | parent | next [-]

If I have a mostly CPU code and I want to time the scenario: “I have just a couple subroutines that I am willing to offload to the GPU,” what’s wrong with sprinkling my code with normal old python timing calls?

If I don’t care what part of the CUDA ecosystem is taking time (from my point of view it is a black-box that does GEMMs) so why not measure “time until my normal code is running again?”

▲

nickysielicki 3 months ago | parent [-]

If you care enough to time it, you should care enough to time it correctly.

▲

bee_rider 3 months ago | parent | next [-]

I described the correct way to time it when using the card as a black-box accelerator.

▲

nickysielicki 3 months ago | parent [-]

You can create metrics for whatever you want! Go ahead!

But cuda is not a black box math accelerator. You can stupidly treat it as such, but that doesn’t make it that. It’s an entire ecosystem with drivers and contexts and lifecycles. If everything you’re doing is synchronous and/or you don’t mind if your metrics include totally unrelated costs, then time.time() is fine, sure. But if that’s the case, you’ve got bigger problems.

	▲	bee_rider 3 months ago \| parent \| next [-]
		Sure, it’s easy to say “there are bigger problems.” There are always bigger problems. But, there are like 50 years worth of Fortran numerical codes out there, lots of them just use RCIs… if I want to try CUDA in some existing library, I guess I will need the vector back before I can go back into the RCI.
	▲	doctorpangloss 3 months ago \| parent \| prev [-]
		You're arguing with people who have no idea what they're talking about on a forum that is a circular "increase in acceleration" of a personality trait that gets co-opted into arguing incorrectly about everything - a trait that everyone else knows is defective.

▲

gavinray 3 months ago | parent | prev [-]

One of the wisest things I've read all week.

I authored one of the primary tools for GraphQL server benchmarks.

I learned about the Coordinated Omission problem and formats like HDR Histograms during the implementation.

My takeaway from that project is that not only is benchmarking anything correctly difficult, but they all ought to come with disclaimers of:

"These are the results obtained on X machine, running at Y time, with Z resources."

▲

jms55 3 months ago | parent | prev | next [-]

Never used CUDA, but I'm guessing these map to the same underlying stuff as timestamp queries in graphics APIs, yes?

▲

saagarjha 3 months ago | parent | prev [-]

I mean you can definitely use it in a pinch if you know what you’re doing. But yes the event APIs are better.