Remix.run Logo
jodoherty 4 hours ago

I use pi with an RTX Pro 6000 Blackwell to run Gemma 4 31b to do all my agentic coding.

I find it useful.

This side project highlights a similar approach to how I scope and tackle projects at work now:

https://git.theodohertyfamily.com/wg-wrap.git/tree/README.md

https://git.theodohertyfamily.com/wg-wrap.git/tree/CASE_STUD...

You have to apply a lot of careful architecture and TDD to your approach. Eliminate technical risk by tackling hard things early and wrapping them up in a simple, easy to use interface.

I find I can get some projects done 2-3 times faster than if I wrote them by hand. It can also save about 5-10x time on mundane or broadly scoped projects by helping me consolidate and try out ideas very quickly.

Setup-wise, I switch between vLLM using nvidia/Gemma-4-31B-IT-NVFP4 and llama.cpp using unsloth/gemma-4-31B-it-qat-GGUF with MTP. I throttle the GPU power usage to 400W.

My current llama.cpp setup gets token generation rates between 60-150 t/s depending on MTP draft acceptance rates. Prefill is between 1500-4000 t/s depending on context length/depth.