I'm no GPU programmer, but seems easy to use even for someone like me. I pulled together a quick demo of using the GPU vs the CPU, based on what I could find (https://gist.github.com/victorb/452a55dbcf59b3cbf84efd8c3097...) which gave these results (after downloading 2.6GB of dependencies of course):

    Creating 100 random matrices of size 5000x5000 on CPU...
    Adding matrices using CPU...
    CPU matrix addition completed in 0.6541 seconds
    CPU result matrix shape: (5000, 5000)
    
    Creating 100 random matrices of size 5000x5000 on GPU...
    Adding matrices using GPU...
    GPU matrix addition completed in 0.1480 seconds
    GPU result matrix shape: (5000, 5000)

Definitely worth digging into more, as the API is really simple to use, at least for basic things like these. CUDA programming seems like a big chore without something higher level like this.

▲ ashvardanian 3 months ago | parent | next [-]

CuPy has been available for years and has always worked great. The article is about the next wave of Python-oriented JIT toolchains, that will allow writing actual GPU kernels in a Pythonic-style instead of calling an existing precompiled GEMM implementation in CuPy (like in that snippet) or even JIT-ing CUDA C++ kernels from a Python source, that has also been available for years: https://docs.cupy.dev/en/stable/user_guide/kernel.html#raw-k...

▲ almostgotcaught 3 months ago | parent [-]

it's funny - people around here really do not have a clue about the GPU ecosystem even though everyone is always talking about AI:

> The article is about the next wave of Python-oriented JIT toolchains

the article is content marketing (for whatever) but the actual product has literally has nothing to do with kernels or jitting or anything

https://github.com/NVIDIA/cuda-python

literally just cython bindings to CUDA runtime and CUB.

for once CUDA is aping ROCm:

https://github.com/ROCm/hip-python

▲ dragonwriter 3 months ago | parent | next [-]

The mistake you seem to be making is confusing the existing product (which has been available for many years) with the upcoming new features for that product just announced at GTC, which are not addressed at all on the page for the existing product, but are addressed in the article about the GTC announcement.

▲

almostgotcaught 3 months ago | parent [-]

> The mistake you seem to be making is confusing the existing product

i'm not making any such mistake - i'm just able to actually read and comprehend what i'm reading rather than perform hype:

> Over the last year, NVIDIA made CUDA Core, which Jones said is a “Pythonic reimagining of the CUDA runtime to be naturally and natively Python.”

so the article is about cuda-core, not whatever you think it's about - so i'm responding directly to what the article is about.

> CUDA Core has the execution flow of Python, which is fully in process and leans heavily into JIT compilation.

this is bullshit/hype about Python's new JIT which womp womp womp isn't all that great (yet). this has absolutely nothing to do with any other JIT e.g., the cutile kernel driver JIT (which also has absolutely nothing to do with what you think it does).

▲

dragonwriter 3 months ago | parent | next [-]

> i'm just able to actually read and comprehend what i'm reading rather than perform hype:

The evidence of that is lacking.

> so the article is about cuda-core, not whatever you think it's about

cuda.core (a relatively new, rapidly developing, library whose entire API is experimental) is one of several things (NVMath is another) mentioned in the article, but the newer and as yet unreleased piece mentioned in the article and the GTC announcement, and a key part of the “Native Python” in the headline, is the CuTile model [0]:

“The new programming model, called CuTile interface, is being developed first for Pythonic CUDA with an extension for C++ CUDA coming later.”

> this is bullshit/hype about Python's new JIT

No, as is is fairly explicit in the next line after the one you quote, it is about the Nvidia CUDA Python toolchain using in-process compilation rather than relying on shelling out to out-of-process command-line compilers for CUDA code.

[0] The article only has fairly vague qualitative description of what CuTile is, but (without having to watch the whole talk from GTC), one could look at this tweet for a preview of what the Python code using the model is expected to look like when it is released: https://x.com/blelbach/status/1902113767066103949?t=uihk0M8V...

▲

almostgotcaught 3 months ago | parent [-]

> No, as is is fairly explicit in the next line after the one you quote, it is about the Nvidia CUDA Python toolchain using in-process compilation rather than relying on shelling out to out-of-process command-line compilers for CUDA code.

my guy what i am able to read, which you are not, is the source and release notes. i do not need to read tweets and press releases because i know what these things actually are. here are the release notes

> Support Python 3.13

> Add bindings for nvJitLink (requires nvJitLink from CUDA 12.3 or above)

> Add optional dependencies on CUDA NVRTC and nvJitLink wheels

https://nvidia.github.io/cuda-python/latest/release/12.8.0-n...

do you understand what "bindings" and "optional dependencies on..." means? it means there's nothing happening in this library and these are... just bindings to existing libraries. specifically that means you cannot jit python using this thing (except via the python 3.13 jit interpreter) and can only do what you've always already been able to do with eg cupy (compile and run C/C++ CUDA code).

EDIT: y'all realize that

1. calling a compiler for your entire source file

2. loading and running that compiled code

is not at all a JIT? y'all understand that right?

	▲	squeaky-clean 3 months ago \| parent \| next [-]
		> my guy what i am able to read, which you are not, is the source and release notes. i do not need to read tweets and press releases because i know what these things actually are. here are the release notes Those aren't the release notes for the native python thing being announced. CuTile has not been publicly released yet. Based on what the devs are saying on Twitter it probably won't be released before the SciPy 2025 conference in July.
	▲	musicale 3 months ago \| parent \| prev \| next [-]
		JIT as an adjective means just-in-time, as opposed to AOT, ahead-of-time. What Nvidia discussed at GTC was a software stack that will enable you to generate new CUDA kernels dynamically at runtime using Python API calls. It is a just-in-time (runtime, dynamic) compiler system rather than an ahead-of-time (pre-runtime, static) compiler.
	▲	saagarjha 3 months ago \| parent \| prev \| next [-]
		cuTile is basically Nvidia’s Triton (no, not that Triton, OpenAI’s Triton) competitor. It takes your Python code and generates kernels at runtime. CUTLASS has a new Python interface that does the same thing.
	▲	wahnfrieden 3 months ago \| parent \| prev [-]
		[flagged]

▲

squeaky-clean 3 months ago | parent | prev [-]

Isn't the main announcement of the article CuTile? Which has not been released yet.

Also the cuda-core JIT stuff has nothing to do with Python's new JIT, it's referring to integrating nvJitLink with python, which you can see an example of in cuda_core/examples/jit_lto_fractal.py

▲ ashvardanian 3 months ago | parent | prev | next [-]

In case someone is looking for some performance examples & testimonials, even on RTX 3090 vs a 64-core AMD Epy/Threadripper, even a couple of years ago, CuPy was a blast. I have a couple of recorded sessions with roughly identical slides/numbers:

  - San Francisco Python meetup in 2023: https://youtu.be/L9ELuU3GeNc?si=TOp8lARr7rP4cYaw
  - Yerevan PyData meetup in 2022: https://youtu.be/OxAKSVuW2Yk?si=5s_G0hm7FvFHXx0u

Of the more remarkable results:

  - 1000x sorting speedup switching from NumPy to CuPy.
  - 50x performance improvements switching from Pandas to CuDF on the New York Taxi Rides queries.
  - 20x GEMM speedup switching from NumPy to CuPy.

CuGraph is also definitely worth checking out. At that time, Intel wasn't in as bad of a position as they are now and was trying to push Modin, but the difference in performance and quality of implementation was mind-boggling.

▲ ladberg 3 months ago | parent | prev | next [-]

The main release highlighted by the article is cuTile which is certainly about jitting kernels from Python code

▲

almostgotcaught 3 months ago | parent [-]

> main release

there is no release of cutile (yet). so the only substantive thing that the article can be describing is cuda-core - which it does describe and is a recent/new addition to the ecosystem.

man i can't fathom glazing a random blog this hard just because it's tangentially related to some other thing (NV GPUs) that clearly people only vaguely understand.

	▲	throwaway314155 3 months ago \| parent [-]
		christ man lighten the fuck up. there's zero need to be _so_ god damn patronizing and disrespectful.

▲ yieldcrv 3 months ago | parent | prev [-]

I just want to see benchmarks. is this new one faster than CuPy or not

▲ moffkalast 3 months ago | parent | prev | next [-]

Only 4x speed seems rather low for GPU acceleration, does numpy already use AVX2 or anything SIMD?

For comparison, doing something similar with torch on CPU and torch on GPU will get you like 100x speed difference.

	▲	diggan 3 months ago \| parent [-]
		It's a microbenchmark (if even that), take it with a grain of salt. You'd probably see a bigger difference with bigger/more/more complicated tasks,

▲ wiredfool 3 months ago | parent | prev | next [-]

Curious what the timing would be if it included the memory transfer time, e.g.

  matricies = [np.random(...) for _ in range]
  time_start = time.time()
  cp_matricies = [cp.array(m) for m in matrices]
  add_(cp_matricies)
  sync
  time_end = time.time()

▲ nickysielicki 3 months ago | parent | next [-]

I don’t mean to call you or your pseudocode out specifically, but I see this sort of thing all the time, and I just want to put it out there:

PSA: if you ever see code trying to measure timing and it’s not using the CUDA event APIs, it’s fundamentally wrong and is lying to you. The simplest way to be sure you’re not measuring noise is to just ban the usage of any other timing source. Definitely don’t add unnecessary syncs just so that you can add a timing tap.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART_...

▲

bee_rider 3 months ago | parent | next [-]

If I have a mostly CPU code and I want to time the scenario: “I have just a couple subroutines that I am willing to offload to the GPU,” what’s wrong with sprinkling my code with normal old python timing calls?

If I don’t care what part of the CUDA ecosystem is taking time (from my point of view it is a black-box that does GEMMs) so why not measure “time until my normal code is running again?”

▲

nickysielicki 3 months ago | parent [-]

If you care enough to time it, you should care enough to time it correctly.

▲

bee_rider 3 months ago | parent | next [-]

I described the correct way to time it when using the card as a black-box accelerator.

▲

nickysielicki 3 months ago | parent [-]

You can create metrics for whatever you want! Go ahead!

But cuda is not a black box math accelerator. You can stupidly treat it as such, but that doesn’t make it that. It’s an entire ecosystem with drivers and contexts and lifecycles. If everything you’re doing is synchronous and/or you don’t mind if your metrics include totally unrelated costs, then time.time() is fine, sure. But if that’s the case, you’ve got bigger problems.

	▲	bee_rider 3 months ago \| parent \| next [-]
		Sure, it’s easy to say “there are bigger problems.” There are always bigger problems. But, there are like 50 years worth of Fortran numerical codes out there, lots of them just use RCIs… if I want to try CUDA in some existing library, I guess I will need the vector back before I can go back into the RCI.
	▲	doctorpangloss 3 months ago \| parent \| prev [-]
		You're arguing with people who have no idea what they're talking about on a forum that is a circular "increase in acceleration" of a personality trait that gets co-opted into arguing incorrectly about everything - a trait that everyone else knows is defective.

▲

gavinray 3 months ago | parent | prev [-]

One of the wisest things I've read all week.

I authored one of the primary tools for GraphQL server benchmarks.

I learned about the Coordinated Omission problem and formats like HDR Histograms during the implementation.

My takeaway from that project is that not only is benchmarking anything correctly difficult, but they all ought to come with disclaimers of:

"These are the results obtained on X machine, running at Y time, with Z resources."

▲

jms55 3 months ago | parent | prev | next [-]

Never used CUDA, but I'm guessing these map to the same underlying stuff as timestamp queries in graphics APIs, yes?

▲

saagarjha 3 months ago | parent | prev [-]

I mean you can definitely use it in a pinch if you know what you’re doing. But yes the event APIs are better.

▲ hnuser123456 3 months ago | parent | prev [-]

I think it does?: (the comment is in the original source)

    print("Adding matrices using GPU...")
    start_time = time.time()
    gpu_result = add_matrices(gpu_matrices)
    cp.cuda.get_current_stream().synchronize() # Not 100% sure what this does
    elapsed_time = time.time() - start_time

I was going to ask, any CUDA professionals who want to give a crash course on what us python guys will need to know?

▲ apbytes 3 months ago | parent [-]

When you call a cuda method, it is launched asynchronously. That is the function queues it up for execution on gpu and returns.

So if you need to wait for an op to finish, you need to `synchronize` as shown above.

`get_current_stream` because the queue mentioned above is actually called stream in cuda.

If you want to run many independent ops concurrently, you can use several streams.

Benchmarking is one use case for synchronize. Another would be if you let's say run two independent ops in different streams and need to combine their results.

Btw, if you work with pytorch, when ops are run on gpu, they are launched in background. If you want to bench torch models on gpu, they also provide a sync api.

▲ claytonjy 3 months ago | parent | next [-]

I’ve always thought it was weird GPU stuff in python doesn’t use asyncio, and mostly assumed it was because python-on-GPU predates asyncio. But I was hoping a new lib like this might right that wrong, but it doesn’t. Maybe for interop reasons?

Do other languages surface the asynchronous nature of GPUs in language-level async, avoiding silly stuff like synchronize?

▲ ImprobableTruth 3 months ago | parent | next [-]

The reason is that the usage is completely different from coroutine based async. With GPUs you want to queue _as many async operations as possible_ and only then synchronize. That is, you would have a program like this (pseudocode):

  b = foo(a)
  c = bar(b)
  d = baz(c)
  synchronize()

With coroutines/async await, something like this

  b = await foo(a)
  c = await bar(b)
  d = await baz(c)

would synchronize after every step, being much more inefficient.

▲

hackernudes 3 months ago | parent | next [-]

Pretty sure you want it to do it the first way in all cases (not just with GPUs)!

	▲	halter73 3 months ago \| parent [-]
		It really depends on if you're dealing with an async stream or a single async result as the input to the next function. If a is an access token needed to access resource b, you cannot access a and b at the same time. You have to serialize your operations.

▲

alanfranz 3 months ago | parent | prev [-]

Well you can and should create multiple coroutine/tasks and then gather them. If you replace cuda with network calls, it’s exactly the same problem. Nothing to do with asyncio.

	▲	ImprobableTruth 3 months ago \| parent [-]
		No, that's a different scenario. In the one I gave there's explicitly a dependency between requests. If you use gather, the network requests would be executed in parallel. If you have dependencies they're sequential by nature because later ones depend on values of former ones. The 'trick' for CUDA is that you declare all this using buffers as inputs/outputs rather than values and that there's automatic ordering enforcement through CUDA's stream mechanism. Marrying that with the coroutine mechanism just doesn't really make sense.

▲ apbytes 3 months ago | parent | prev [-]

Might have to look at specific lib implementations, but I'd guess that mostly gpu calls from python are actually happening in c++ land. And internally a lib might be using synchronize calls where needed.

▲ hnuser123456 3 months ago | parent | prev [-]

Thank you kindly!

▲ 3 months ago | parent | prev | next [-]

[deleted]

▲ rahimnathwani 3 months ago | parent | prev [-]

Thank you. I scrolled up and down the article hoping they included a code sample.

	▲	diggan 3 months ago \| parent \| next [-]
		Yeah, I figured I wasn't alone in doing just that :)
	▲	rahimnathwani 3 months ago \| parent \| prev [-]
		EDIT: Just realized the code doesn't seem to be using the GPU for the addition.