| ▲ | LoganDark 3 hours ago | |||||||
I have 128 GB of unified memory (M4 Max) and the user experience with local inference is still pretty bad. I'm so glad something like llama.cpp exists so I don't have to wrangle Python (which I hate), but OpenCode is entirely disrespectful of the KV-cache so I had to switch to Pi (but Pi is going relatively well actually). Even so, I can't really run at hundreds of tokens per second which is practically table stakes for my work. Even if I did manage to run that fast, the model would probably be completely braindead and stomp all over the task. Wish I could afford an M5 Max but I've been between jobs for months without even a single interview. Sucks to be a developer these days. | ||||||||
| ▲ | sschueller 2 hours ago | parent [-] | |||||||
Try Kilocode with deepseek v4 (via API directly to deepseek, much cheaper than via kilo). I have had very good results and compared to others it just costs pennies. I use something similar to this https://github.com/ScotterMonk/AgentAutoFlow setup and switch between deepseek v4 to flash depending on task. | ||||||||
| ||||||||