▲ | hnuser123456 14 days ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I think it does?: (the comment is in the original source)
I was going to ask, any CUDA professionals who want to give a crash course on what us python guys will need to know? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | apbytes 14 days ago | parent [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
When you call a cuda method, it is launched asynchronously. That is the function queues it up for execution on gpu and returns. So if you need to wait for an op to finish, you need to `synchronize` as shown above. `get_current_stream` because the queue mentioned above is actually called stream in cuda. If you want to run many independent ops concurrently, you can use several streams. Benchmarking is one use case for synchronize. Another would be if you let's say run two independent ops in different streams and need to combine their results. Btw, if you work with pytorch, when ops are run on gpu, they are launched in background. If you want to bench torch models on gpu, they also provide a sync api. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|