Super cool!

A related situation I was in recently was where I was trying to bisect a perf regression, but the benchmarks themselves were quite noisy, making it hard to tell whether I was looking at a "good" vs "bad" commit without repeated trials (in practice I just did repeats).

I could pick a threshold and use bayesect as described, but that involves throwing away information. How hard would it be to generalize this to let me plug in a raw benchmark score at each step?

▲

hauntsaninja an hour ago | parent | next [-]

I don't yet know a better way to do this than using a threshold!

I think if you assume perf is normally distributed, you can still get some of the math to work out. But I will need to think more about this... if I ever choose this adventure, I'll post an update on https://github.com/hauntsaninja/git_bayesect/issues/25

(I really enjoy how many generalisations there are of this problem :-) )

▲

furyofantares 5 hours ago | parent | prev | next [-]

I have this same issue a lot.

I vibe up a lot of really simple casual games, which should have very minimal demands, and the LLM-agent introduces bad things a lot that don't present right away. Either it takes multiple bad things to notice, or it doesn't really affect anything on a dev machine but is horrible on wasm+mobile builds, or I just don't notice right away.

This is all really hard to track down, there's noise in the heuristics, and I don't know if I'm looking for one really dumb thing or a bunch of small things that have happened over time.

	▲	rs545837 an hour ago \| parent [-]
		This is a real pain point. One thing that helps: when an LLM agent makes changes across multiple commits, look at what it actually touched structurally. Often the agent adds a feature in commit 5 but subtly breaks something in commit 3 by changing a shared function it didn't fully understand.

▲

ajb 9 hours ago | parent | prev [-]

At a guess, you can reuse the entropy part, but you'd need to plug in a new probability distribution.