▲ | lsb 5 days ago | ||||||||||||||||
That’s wild that with a KV cache and compilation on the Mac CPU you are faster than on an A100 GPU. | |||||||||||||||||
▲ | ModelForge 5 days ago | parent | next [-] | ||||||||||||||||
Could be an artifact of the small size not fully taking advantage of the GPU. For example, for the slightly larger Qwen3 0.6B model the A100 is faster (you can see it when scrolling to the bottom here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/11...) | |||||||||||||||||
▲ | ladberg 5 days ago | parent | prev | next [-] | ||||||||||||||||
Given that the compiled version is slower than then eager version on A100, there's definitely something suboptimal happening there | |||||||||||||||||
| |||||||||||||||||
▲ | punnerud 5 days ago | parent | prev | next [-] | ||||||||||||||||
Because on Mac the CPU and GPU share memory, but A100 need to transfer to RAM/CPU on the parts that’s not supported by GPU? (My first guess) | |||||||||||||||||
▲ | Weryj 5 days ago | parent | prev [-] | ||||||||||||||||
This would be because the GPU can’t fill its waveform and hide memory latency, no? I’m curious for a reason why |