Remix.run Logo
eurekin 3 hours ago

GLM 4.7 is new and promising. MinMax 2.1 is good for agents. Of course the qwen3 family, vl versions are spectacular. NVIDIA Nemotron Nano 3 excels at long context and the unsloth variant has been extended to 1m tokens.

I thought the last one was a toy, until I tried with a full 1.2 megabyte repomix project dump. It actually works quite well for general code comprehension across the whole codebase, CI scripts included.

Gpt-oss-120 is good too, altough I'm yet to try it out for coding specifically

magicalhippo an hour ago | parent | next [-]

Since I'm just a pleb with a 5090, I run GPT-OSS 20B a lot, since it fits comfortably in VRAM with max context size. I find it quite decent for a lot of things, especially after I set reasoning effort to high and disabled top-k and top-p and set min-p to something like 0.05.

For the Qwen3-VL, I recently read that someone got significantly better results by using F16 or even F32 versions of the vision model part, while using a Q4 or similar for the text model part. In llama.cpp you can specify these separately[1]. Since the vision model part is usually quite small in comparison, this isn't as rough as it sounds. Haven't had a chance to test that yet though.

[1]: https://github.com/ggml-org/llama.cpp/blob/master/tools/serv... (using --mmproj AFAIK)

nightski 2 hours ago | parent | prev [-]

Does GLM 4.7 run well on the spark? I thought I read it didn’t but it wasn’t clear.