| ▲ | wolttam 2 hours ago | |
To answer this and my sibling, it's DeepSeek V4 Flash at native FP4 quantization, on two Nvidia DGX Sparks. Which is a bit of kit but still paltry relative to the data centre. ~40 TPS generation, ~2000 TPS prompt processing, which makes it feel approximately as fast as typical APIs. I primarily use it with my own harness for coding. I'm not going to say it will compete with Opus in the most challenging domains, because it won't, but I will say that there's a reasonable likelihood that Opus is used for tasks that a model like Flash could comfortably handle at 1/100th the cost. So far I've only seen it struggle at tasks that I myself would struggle with. Tasks that I can describe the shape of the solution for, it has a high success rate at implementing. Useful is going to be different for everyone. I'm not working on the hardest problems, I don't need the best models. | ||