Remix.run Logo
fulafel 9 hours ago

You're right in terms of fitting your program to memory, so that it can run in the first place.

But in performance work, the relative speed of RAM relative to computation has dropped such that it's a common wisdom to treat today's cache as RAM of old (and today's RAM as disk of old, etc).

In software performance work it's been all about hitting the cache for a long time. LLMs aren't too amenable to caching though.

makapuf 8 hours ago | parent | next [-]

AFAIK, you can't explicitly allocate cache like you allocate RAM however. A bit like if you could only work on files and ram was used for cache. Maybe I am mistaken ? (Edit: typo)

KeplerBoy 3 hours ago | parent | next [-]

You can in CUDA. You can have shared memory which is basically L1 cache you have full control over. It's called shared memory because all threads within a block (which reside on a common SM) have fast access to it. The downside: you now have less regular L1 cache.

lou1306 7 hours ago | parent | prev [-]

You can't explicitly allocate cache, but you can lay things out in memory to minimize cache misses.

tux3 5 hours ago | parent | next [-]

A fun fact for the people who like to go on rabbit holes. There is an x86 technique called cache-as-RAM (CAR) that allows you to explicitly allocate a range of memory to be stored directly in cache, avoiding the DRAM entirely.

CAR is often used in early boot before the DRAM is initialized. It works because the x86 disable cache bit actually only decouples the cache from the memory, but the CPU will still use the cache if you primed it with valid cache lines before setting the cache disable bit.

So the technique is to mark a particular range of memory as write-back cacheable, prime the cache with valid cache lines for the entire region, and then set the bit to decouple the cache from memory. Now every access to this memory region is a cache hit that doesn't write back to DRAM.

The one downside is that when CAR is on, any cache you don't allocate as memory is wasted. You could allocate only half the cache as RAM to a particular memory region, but the disable bit is global, so the other half would just sit idle.

crakenzak 6 hours ago | parent | prev [-]

Out of curiosity, why has there not been a slight paradigm shift in modern system programming languages to expose more control over the caches?

pjc50 4 hours ago | parent | next [-]

Same as the failure of Itanium VLIW instructions: you don't actually want to force the decision of what is in the cache back to compile time, when the relevant information is better available at runtime.

Also, additional information on instructions costs instruction bandwidth and I-cache.

david-gpu an hour ago | parent [-]

> you don't actually want to force the decision of what is in the cache back to compile time, when the relevant information is better available at runtime

That is very context-dependent. In high-performance code having explicit control over caches can be very beneficial. CUDA and similar give you that ability and it is used extensively.

Now, for general "I wrote some code and want the hardware to run it fast with little effort from my side", I agree that transparent caches are the way.

10000truths 34 minutes ago | parent [-]

x86 provides this control with non-temporal load/store instructions.

greybcg 5 hours ago | parent | prev | next [-]

There are cache control instructions already. The reason why it goes no further than prefetch/invalidate hints is probably because exposing a fuller api on the chip level to control the cache would overcomplicate designs, not be backwards compatible/stable api. Treating the cache as ram would also require a controller, which then also needs to receive instructions, or the cpu has to suddenly manage the cache itself.

I can understand why they just decide to bake the cache algorithms into hardware, validate it and be done with it. Id love if a hardware engineer or more well-read fellow could chime in.

Someone 5 hours ago | parent [-]

Another reason for doing cache algorithms in hardware is that cache access (especially for level 1 caches) has to be low latency to be useful.

Tuna-Fish 5 hours ago | parent | prev | next [-]

Because programmers are in general worse at managing them than the basic LRU algorithm.

And because the abstraction is simple and easy enough to understand that when you do need close control, it's easy to achieve by just writing to the abstraction. Careful control of data layout and nontemporal instructions are almost always all you need.

zozbot234 3 hours ago | parent | prev | next [-]

This is not applicable to most programming scenarios since the cache gets trashed unpredictably during context switches (including the user-level task switches involved in cooperative async patterns). It's not a true scratchpad storage, and turning it into one would slow down context switches a lot since the scratchpad would be processor state. Maybe this can be revisited once even low-end computers have so many hardware cores/threads that context switches become so rare that the overhead is not a big deal. But we are very far from anything of the sort.

rwmj 5 hours ago | parent | prev [-]

There has! Intel has Cache Acceleration Technology, and I was very peripherally involved in reviewing research projects at Boston University into this. One that I remember was allowing the operating system to divide up cache and memory bandwidth for better prioritization.

https://www.intel.com/content/www/us/en/developer/articles/t...

seanmcdirmid 8 hours ago | parent | prev [-]

LLMs need memory bandwidth to stream lots of data through quickly, not so much caching. Well, this is basically the same way that a GPU uses memory.

zozbot234 3 hours ago | parent [-]

OTOH, LLM inference tends to have very predictable memory access patterns. So well-placed prefetch instructions that can execute predictable memory fetches in parallel with expensive compute might help CPU performance quite a bit. I assume that this is done already as part of optimized numerical primitives such as GEMM, since that's where most of the gain would be.