Remix.run Logo
dataviz1000 8 hours ago

This is the problem with co-opting the word "harness". What agents need is a test harness but that doesn't mean much in the AI world.

Agents are not deterministic; they are probabilistic. If the same agent is run it will accomplish the task a consistent percentage of the time. I wish I was better at math or English so I could explain this.

I think they call it EVAL but developers don't discuss that too much. All they discuss is how frustrated they are.

A prompt can solve a problem 80% of the time. Change a sentence and it will solve the same problem 90% of time. Remove a sentence it will solve the problem 70% of the time.

It is so friggen' easy to set up -- stealing the word from AI sphere -- a TEST HARNESS.

Regressions caused by changes to the agent, where words are added, changed, or removed, are extremely easy to quantify. It isn’t pass/fail. It’s whether the agent still solves the problem at the same percentage of the time it consistently has.

arjie 7 hours ago | parent | next [-]

The word is not co-opted. A harness is just supportive scaffolding to run something. A test harness is scaffolding to run tests against software, a fuzz harness is scaffolding to run a fuzzer against the software, and so on. I've seen it being used in this manner many times over the past 15 years. It's the device that wraps your software so you can run it repeatedly with modifications of parameters, source code, or test condition.

dataviz1000 6 hours ago | parent [-]

> A harness is just supportive scaffolding to run something.

Thank you for the perfect explanation.

Last week in my confusion about the word because Anthropic was using test, eval, and harness in the same sentence so I thought Anthropic made a test harness, I used Google asking "in computer science what is a harness". It responded only discussing test harnesses which solidified my thinking that is what it is.

I wish Google had responded as clearly you did. In my defense, we don't know if we understand something unless we discuss it.

thesz 6 hours ago | parent | prev [-]

To have some confidence in consistency of results (p-value), one has to start from cohort of around 30, if I remember correctly. This is 1.5 orders of magnitude increase of computing power needed to find (absence of) consistent changes of agent's behavior.

dataviz1000 6 hours ago | parent [-]

I apologize for the potato quality of these links, however, I have been working tirelessly to wrap my head how to reason about how agents and LLM models work. They are more than just a black box.

The first tries to answer what happens when I give the models harder and harder arithmetic problems to the point Sonnet will burn 200k tokens for 20minutes. [0]

The other is a very deep dive into the math of a reasoning model in the only way I could think to approach it, with data visualizations, seeing the computation of the model in real time in relation to all the parts.[1]

Two things I've learned are that the behavior of an agent that will reverse engineer any website and the behavior of an agent that does arithmetic are the same. Which means the probability that either will solve their intended task is the same for the given agent and task -- it is a distribution. The other, is that models have a blind spot, therefore creating a red team adversary bug hunter agent will not surface a bug if the same model originally wrote the code.

Understanding that, knowing that I can verify at the end or use majority of votes (MoV), using the agents to automate extremely complicated tasks can be very reliable with an amount of certainty.

[0] https://adamsohn.com/reliably-incorrect/

[1] https://adamsohn.com/grpo/