Remix.run Logo
moezd 4 hours ago

Not yet. Without pure Apple game or decent GPUs, even with a lot of RAM and threads, all you get is about 30-50 tokens/second, and that's thinking turned off. Without these optimizations your model will have a field day with your MCPs, skills and agent descriptions and you will watch the paint dry before seeing the first output token. Local model serving means you have to fight for every token in your context window, which is quite opposite of what Claude/GPT/Copilot are pushing the industry towards.

amarshall 5 minutes ago | parent [-]

Thinking doesn’t change output speed. Anthropic’s models are ~ 40–60 t/s median output speed.