| ▲ | radarsat1 3 hours ago | |
> Is it speed? > Is it that you can backprop through this computation? Do you do so? With respect, I feel that you may not have read the article. > Because the execution trace is part of the forward pass, the whole process remains differentiable: we can even propagate gradients through the computation itself. That makes this fundamentally different from an external tool. It becomes a trainable computational substrate that can be integrated directly into a larger model. and, > By storing points across nested convex hulls, this yields a decoding cost of O(k+log n). and, > Regardless of their eventual capability ceiling, they already suggest a powerful systems primitive for speeding up larger models. So yes, and yes. > Where are the benchmarks? Not clear what they should benchmark it against. They do compare speed to a normal KV Cache. As for performance.. if it's actually executing a Sudoku solver with a 100% success rate, it seems pretty trivial to find any model doing < 100% success rate. Sure, it would be nice to see the data here, agree with you there. Personally I think it would be really interesting to see if this method can be combined with a normal model MoE-style. It is likely possible, the router module should pick up quite quickly that it predicts the right tokens for some subset of problems deterministically. I like the idea of embed all sorts of general solvers directly into the model, like a prolog solver for example. In fact it never would have occurred to me to just go straight for WASM, pretty interesting choice to directly embed a VM. But it makes me wonder what "smaller" interpreters could be useful in this context. | ||
| ▲ | mike_hearn an hour ago | parent | next [-] | |
I read the article and had the same question. It's written in such a way that it feels like it's answering these questions without actually doing so. The right thing to benchmark against isn't a regular transformer, it's a transformer that writes programs that are then interpreted. They have a little visual demo where it looks faster but only because they make Python absurdly slow, and it's clearly not meant to be a real benchmark. I spent the whole article thinking, wow, cool, but also ... how is this better than an LLM steering a regular computer? The closest we get is a statement about the need to "internalize what computation is" which doesn't say anything to me. Fundamentally, running actual instructions on a real CPU is always going to be faster than running them via a neural network. So the interesting part is where they say you can backprop through it, but, ok, backprop is for cases where we don't know how to encode a function using strict logic. Why would you try and backprop through a Sudoku solver? It's probably my imagination is just limited but I could have used more on that. | ||
| ▲ | bonoboTP 23 minutes ago | parent | prev [-] | |
Benchmark it against a fast Python interpreter optimized for AI tool calling, like Monty: https://github.com/pydantic/monty | ||