| ▲ | vova_hn2 11 hours ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I've always thought that it is kinda weird that we spend exactly the same amount of compute to calculate both "fork" tokens and "lock" tokens. I think that with grammar-aware sampling / constrained decoding [0][1] it is possible to sometimes skip calling the model altogether if only one token is allowed by grammar and just insert it, but I don't think that any of the current, widely used combinations of models/harnesses use it. And it only skips inference in rare edge cases. I wonder if there is a more general solution that can make models spend more compute on making important choices, while making generation of the "obvious" tokens cheaper and faster. [0] https://github.com/ggml-org/llama.cpp/blob/master/grammars/R... [1] https://developers.redhat.com/articles/2025/06/03/structured... | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | jameshart 11 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Give coding agents access to intellisense and syntax highlighting. Making coding agents spit out syntactically correct code token by token is like asking a human to code on a whiteboard. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | olejorgenb 5 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> I wonder if there is a more general solution that can make models spend more compute on making important choices, while making generation of the "obvious" tokens cheaper and faster. I think speculative decoding count as a (perhaps crude) way implementing this? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | quotemstr 7 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> I wonder if there is a more general solution that can make models spend more compute on making important choices There's a lot of work going on in various streams towards making it possible to vary compute per-token, dynamically, e.g. universal transformers. Maybe one day it'll work well enough to beat conventional techniques. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||