Curious what the timing would be if it included the memory transfer time, e.g.

  matricies = [np.random(...) for _ in range]
  time_start = time.time()
  cp_matricies = [cp.array(m) for m in matrices]
  add_(cp_matricies)
  sync
  time_end = time.time()

▲ nickysielicki 3 months ago | parent | next [-]

I don’t mean to call you or your pseudocode out specifically, but I see this sort of thing all the time, and I just want to put it out there:

PSA: if you ever see code trying to measure timing and it’s not using the CUDA event APIs, it’s fundamentally wrong and is lying to you. The simplest way to be sure you’re not measuring noise is to just ban the usage of any other timing source. Definitely don’t add unnecessary syncs just so that you can add a timing tap.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART_...

▲

bee_rider 3 months ago | parent | next [-]

If I have a mostly CPU code and I want to time the scenario: “I have just a couple subroutines that I am willing to offload to the GPU,” what’s wrong with sprinkling my code with normal old python timing calls?

If I don’t care what part of the CUDA ecosystem is taking time (from my point of view it is a black-box that does GEMMs) so why not measure “time until my normal code is running again?”

▲

nickysielicki 3 months ago | parent [-]

If you care enough to time it, you should care enough to time it correctly.

▲

bee_rider 3 months ago | parent | next [-]

I described the correct way to time it when using the card as a black-box accelerator.

▲

nickysielicki 3 months ago | parent [-]

You can create metrics for whatever you want! Go ahead!

But cuda is not a black box math accelerator. You can stupidly treat it as such, but that doesn’t make it that. It’s an entire ecosystem with drivers and contexts and lifecycles. If everything you’re doing is synchronous and/or you don’t mind if your metrics include totally unrelated costs, then time.time() is fine, sure. But if that’s the case, you’ve got bigger problems.

	▲	bee_rider 3 months ago \| parent \| next [-]
		Sure, it’s easy to say “there are bigger problems.” There are always bigger problems. But, there are like 50 years worth of Fortran numerical codes out there, lots of them just use RCIs… if I want to try CUDA in some existing library, I guess I will need the vector back before I can go back into the RCI.
	▲	doctorpangloss 3 months ago \| parent \| prev [-]
		You're arguing with people who have no idea what they're talking about on a forum that is a circular "increase in acceleration" of a personality trait that gets co-opted into arguing incorrectly about everything - a trait that everyone else knows is defective.

▲

gavinray 3 months ago | parent | prev [-]

One of the wisest things I've read all week.

I authored one of the primary tools for GraphQL server benchmarks.

I learned about the Coordinated Omission problem and formats like HDR Histograms during the implementation.

My takeaway from that project is that not only is benchmarking anything correctly difficult, but they all ought to come with disclaimers of:

"These are the results obtained on X machine, running at Y time, with Z resources."

▲

jms55 3 months ago | parent | prev | next [-]

Never used CUDA, but I'm guessing these map to the same underlying stuff as timestamp queries in graphics APIs, yes?

▲

saagarjha 3 months ago | parent | prev [-]

I mean you can definitely use it in a pinch if you know what you’re doing. But yes the event APIs are better.

▲ hnuser123456 3 months ago | parent | prev [-]

I think it does?: (the comment is in the original source)

    print("Adding matrices using GPU...")
    start_time = time.time()
    gpu_result = add_matrices(gpu_matrices)
    cp.cuda.get_current_stream().synchronize() # Not 100% sure what this does
    elapsed_time = time.time() - start_time

I was going to ask, any CUDA professionals who want to give a crash course on what us python guys will need to know?

▲ apbytes 3 months ago | parent [-]

When you call a cuda method, it is launched asynchronously. That is the function queues it up for execution on gpu and returns.

So if you need to wait for an op to finish, you need to `synchronize` as shown above.

`get_current_stream` because the queue mentioned above is actually called stream in cuda.

If you want to run many independent ops concurrently, you can use several streams.

Benchmarking is one use case for synchronize. Another would be if you let's say run two independent ops in different streams and need to combine their results.

Btw, if you work with pytorch, when ops are run on gpu, they are launched in background. If you want to bench torch models on gpu, they also provide a sync api.

▲ claytonjy 3 months ago | parent | next [-]

I’ve always thought it was weird GPU stuff in python doesn’t use asyncio, and mostly assumed it was because python-on-GPU predates asyncio. But I was hoping a new lib like this might right that wrong, but it doesn’t. Maybe for interop reasons?

Do other languages surface the asynchronous nature of GPUs in language-level async, avoiding silly stuff like synchronize?

▲ ImprobableTruth 3 months ago | parent | next [-]

The reason is that the usage is completely different from coroutine based async. With GPUs you want to queue _as many async operations as possible_ and only then synchronize. That is, you would have a program like this (pseudocode):

  b = foo(a)
  c = bar(b)
  d = baz(c)
  synchronize()

With coroutines/async await, something like this

  b = await foo(a)
  c = await bar(b)
  d = await baz(c)

would synchronize after every step, being much more inefficient.

▲

hackernudes 3 months ago | parent | next [-]

Pretty sure you want it to do it the first way in all cases (not just with GPUs)!

	▲	halter73 3 months ago \| parent [-]
		It really depends on if you're dealing with an async stream or a single async result as the input to the next function. If a is an access token needed to access resource b, you cannot access a and b at the same time. You have to serialize your operations.

▲

alanfranz 3 months ago | parent | prev [-]

Well you can and should create multiple coroutine/tasks and then gather them. If you replace cuda with network calls, it’s exactly the same problem. Nothing to do with asyncio.

	▲	ImprobableTruth 3 months ago \| parent [-]
		No, that's a different scenario. In the one I gave there's explicitly a dependency between requests. If you use gather, the network requests would be executed in parallel. If you have dependencies they're sequential by nature because later ones depend on values of former ones. The 'trick' for CUDA is that you declare all this using buffers as inputs/outputs rather than values and that there's automatic ordering enforcement through CUDA's stream mechanism. Marrying that with the coroutine mechanism just doesn't really make sense.

▲ apbytes 3 months ago | parent | prev [-]

Might have to look at specific lib implementations, but I'd guess that mostly gpu calls from python are actually happening in c++ land. And internally a lib might be using synchronize calls where needed.

▲ hnuser123456 3 months ago | parent | prev [-]

Thank you kindly!