Remix.run Logo
MattRogish 4 hours ago

On the one hand, I agree.

The whole point of LLM-based code execution is, well, I can just type in any old language it understands and it ought to figure out what I mean!

A "skill" for searching a pdf could be :

* "You can search PDFs. The code is in /lib/pdf.py"

or it could be:

* "Here's a pile of libraries, figure out which you want to use for stuff"

or it could be:

* "Feel free to generate code (in any executable programming language) on the fly when you want to search a PDF."

or it could be:

* "Solve this problem <x>" and the LLM sees a pile of PDFs in the problem and decides to invent a parser.

or any other nearly infinite way of trying to get a non-deterministic LLM to do a thing you want it to do.

At some level, this is all the same. At least, it rounds to the same in a sort of kinda "Big O" order-of-magnitude comparison.

On the other hand, I also agree, but I can definitely see present value in trying to standardize it because humans want to see what is going on (see: JSON - it's highly desirable for programmers to be able to look at a string representation of data than send opaque binary over the wire, even though to a computer binary is gonna be a lot faster).

There is probably an argument, too, for optimization of context windows and tokens burned and all that kinda jazz. `O(n)` is the same as `O(10*n)` (where n is tokens burned or $$$ per second or context window size) and that doesn't matter in theory but certainly does in practice when you're the one paying the bill or you fill up the context window and get nonsense.

So if this is a _thoughtful_ standard that takes that kinda stuff into account then, well, great! It gives a benchmark we can improve and iterate upon.

With some hypothetical super LLM that has a nearly infinite context window and a cost/tok of nearly zero and throughput nearing infinity, you can just say "solve my problem" and it will (eventually) do it. But for now, I can squint and see how this might be helpful.