▲ | dragonwriter 6 days ago | |
> My RTX 5090 is about 10x faster (measured by FP32 TFLOPS) and I still don't find it to be fast enough. I can't imagine using something so slow for AI/ML. Only 2.2 tokens/sec on an 8B parameter Llama model? That's slower than someone typing. Its also orders of magnitudr slower than what I normally see cited by people using 5090s; heck, its even much slower than I see on my own 3080Ti laptop card for 8B models, though usually won’t use more than an 8bpw quant for that size model. | ||
▲ | Sohcahtoa82 6 days ago | parent [-] | |
Yeah, I must be doing something wrong. Someone else pointed out that I should be getting much better performance. I'll be looking into it. |