Remix.run Logo
culopatin 18 hours ago

But are those tests relevant? I tried using LLMs to write tests at work and whenever I review them I end up asking it “Ok great, passes the test, but is the test relevant? Does it test anything useful?” And I get a “Oh yeah, you’re right, this test is pointless”

manmal 17 hours ago | parent | next [-]

Keep track of test coverage and ask it to delete tests without lowering coverage by more than let’s say 0.01 percent points. If you have a script that gives it only the test coverage, and a file with all tests including line number ranges, it is more or less a dumb task it can work on for hours, without actually reading the files (which would fill context too quickly).

gaigalas 16 hours ago | parent [-]

That does not work as advertised.

If you leave an agent for hours trying to increase coverage by percentage without further guiding instructions you will end up with lots of garbage.

In order to achieve this, you need several distinct loops. One that creates tests (there will be garbage), one that consolidates redundant tests, one that parametrizes repetitive tests, and so on.

Agents create redundant tests for all sorts of reasons. Maybe they're trying a hard to reach line and leave several attempts behind. Or maybe they "get creative" and try to guess what is uncovered instead of actually following the coverage report, etc.

Less capable models are actually better at doing this. They're faster, don't "get creative" with weird ideas mid-task and cost less. Just make them work one test at the time. Spawn, do one test that verifiably increases overall coverage, exit. Once you reach a treshold, start the consolidating loop: pick a redundant pair of tests, consolidate, exit. And so on...

Of course, you can use a powerful model and babysit it as well. A few disambiguating questions and interruptions will guide them well. If you want true unattended though, it's damn hard to get stable results.

tlarkworthy 10 hours ago | parent | prev | next [-]

We fixed this at work by instructing it to maximize coverage with minimal tests, which is closer to our coding style.

elbear 9 hours ago | parent | prev | next [-]

Those tests were written by people. That's why they were confident that what the LLM implemented was correct.

jackschultz 2 hours ago | parent [-]

Meta about how important context is.

People see LLMs and tons of tests tests written in the same sentence, and think that shows how models love writing pointless tests. Rather than realizing that the tests are standard and people written to show that the model wrote code that is validated by a currently trusted source.

Shows the importance for us to always write comments that humans are going to read with the right context is _very_ similar to how we need to interact with LLMs. And if we fail to communicate with humans, clearly we're going to fail with models.

wahnfrieden 18 hours ago | parent | prev [-]

Yes

Skill issue... And perhaps the wrong model + harness