I have 128 GB of unified memory (M4 Max) and the user experience with local inference is still pretty bad. I'm so glad something like llama.cpp exists so I don't have to wrangle Python (which I hate), but OpenCode is entirely disrespectful of the KV-cache so I had to switch to Pi (but Pi is going relatively well actually).

Even so, I can't really run at hundreds of tokens per second which is practically table stakes for my work. Even if I did manage to run that fast, the model would probably be completely braindead and stomp all over the task.

Wish I could afford an M5 Max but I've been between jobs for months without even a single interview. Sucks to be a developer these days.

▲

sschueller 2 hours ago | parent [-]

Try Kilocode with deepseek v4 (via API directly to deepseek, much cheaper than via kilo).

I have had very good results and compared to others it just costs pennies.

I use something similar to this https://github.com/ScotterMonk/AgentAutoFlow setup and switch between deepseek v4 to flash depending on task.

	▲	LoganDark 2 hours ago \| parent [-]
		I do use DeepSeek, it's exceptionally cheap! Inference is slow though, and it's not particularly intelligent but the experience is better than local inference.