I've always thought that it is kinda weird that we spend exactly the same amount of compute to calculate both "fork" tokens and "lock" tokens.

I think that with grammar-aware sampling / constrained decoding [0][1] it is possible to sometimes skip calling the model altogether if only one token is allowed by grammar and just insert it, but I don't think that any of the current, widely used combinations of models/harnesses use it. And it only skips inference in rare edge cases.

I wonder if there is a more general solution that can make models spend more compute on making important choices, while making generation of the "obvious" tokens cheaper and faster.

[0] https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...

[1] https://developers.redhat.com/articles/2025/06/03/structured...

▲

jameshart 11 hours ago | parent | next [-]

Give coding agents access to intellisense and syntax highlighting.

Making coding agents spit out syntactically correct code token by token is like asking a human to code on a whiteboard.

▲

vova_hn2 10 hours ago | parent | next [-]

Yeah, I was also thinking about it A LOT.

We kinda have a little bit of it with some coding harnesses giving model access to LSP, but I think that we can insert this knowledge on a lower level if we find a clever way to somehow utilize it during sampling.

I think that there is a lot of low hanging fruit in this area.

And in general, I think that people try to use LLMs too much to solve problems that can be easily solved by cheaper (computationally), and, more importantly deterministic tools.

For example, back in the day when LLM-assisted coding just became a thing people very often complained about models generating syntactically incorrect code and inventing non-existent library methods.

Well, I, an experienced human programmer, probably would also be making syntax mistakes and inventing non-existent methods if you stripped me of my tools and made me write code in a bare text editor without syntax highlighting.

Thankfully, my IDE would autocomplete real syntax and actually existing library methods for me and immediately give me feedback if I make a mistake anyway. And all of it is achieved using reliable deterministic code without the inherent issues of statistical models.

I think that it is really inefficient to reach for an expensive and unreliable tool when a cheap and reliable tool will do.

▲

jwolfe 10 hours ago | parent | prev | next [-]

In general these agents support LSPs, which is often as much information as your IDE will give you. They are also not required to output syntactically correct code token by token when running agentically, because the loop is:

1. code

2. syntax check / build / format / lint (details language dependent)

3. test

and they can hop between 1 and 2 however many times they want.

▲

tadfisher 9 hours ago | parent | prev | next [-]

Doing a tool call for autocomplete is not going to make coding agents faster.

I do think there is some merit in a tool that dumps all namespaces and reachable symbols so the agent can do its own autocomplete without a round-trip.

▲

jameshart 5 hours ago | parent [-]

Doesn’t need to be a tool call.

As a human coder you don’t summon intellisense. It’s just popped up into your visual field as extra input - contextual cues.

You could force intellisense state into the context vector the LLM receives.

▲

foota an hour ago | parent [-]

Not really, because the LLM loop doesn't have the ability to get updates from the agent live. It would have to somehow be integrated all the way down the stack.

	▲	jameshart an hour ago \| parent [-]
		LLMs can have whatever abilities we build for them. The fact we currently start their context out with a static prompt which we keep feeding in on every iteration of the token prediction loop is a choice. We don’t have to keep doing that if there are other options available.

▲

sgbeal 10 hours ago | parent | prev [-]

> Give coding agents access to intellisense and syntax highlighting.

i once asked an LLM if it could ingest code from an interactive session more easily if it were in appropriately-typed markdown fences and it said absolutely yes, and that the syntax highlighting fed to it that way helps it immensely. i was downright shocked that syntax highlighting was anything more than noise for them.

▲

devmor 6 hours ago | parent [-]

Why would this be surprising? That’s exactly how much of the code they were trained on is presented in PRs, Forums, etc.

▲

3 hours ago | parent | next [-]

[deleted]

▲

astrange 5 hours ago | parent | prev [-]

Is that true? That depends on how their web scraping works, like whether it runs client-side highlighting, strips out HTML tags, etc.

	▲	devmor 15 minutes ago \| parent [-]
		The highlighting isn't what matters, its the pretext. E.g. An LLM seeing "```python" before a code block is going to better recall python codeblocks by people that prefixed them that way.

▲

olejorgenb 5 hours ago | parent | prev | next [-]

> I wonder if there is a more general solution that can make models spend more compute on making important choices, while making generation of the "obvious" tokens cheaper and faster.

I think speculative decoding count as a (perhaps crude) way implementing this?

▲

quotemstr 7 hours ago | parent | prev [-]

> I wonder if there is a more general solution that can make models spend more compute on making important choices

There's a lot of work going on in various streams towards making it possible to vary compute per-token, dynamically, e.g. universal transformers. Maybe one day it'll work well enough to beat conventional techniques.