| ▲ | Retr0id 12 hours ago | |||||||
Super cool! A related situation I was in recently was where I was trying to bisect a perf regression, but the benchmarks themselves were quite noisy, making it hard to tell whether I was looking at a "good" vs "bad" commit without repeated trials (in practice I just did repeats). I could pick a threshold and use bayesect as described, but that involves throwing away information. How hard would it be to generalize this to let me plug in a raw benchmark score at each step? | ||||||||
| ▲ | hauntsaninja an hour ago | parent | next [-] | |||||||
I don't yet know a better way to do this than using a threshold! I think if you assume perf is normally distributed, you can still get some of the math to work out. But I will need to think more about this... if I ever choose this adventure, I'll post an update on https://github.com/hauntsaninja/git_bayesect/issues/25 (I really enjoy how many generalisations there are of this problem :-) ) | ||||||||
| ▲ | furyofantares 5 hours ago | parent | prev | next [-] | |||||||
I have this same issue a lot. I vibe up a lot of really simple casual games, which should have very minimal demands, and the LLM-agent introduces bad things a lot that don't present right away. Either it takes multiple bad things to notice, or it doesn't really affect anything on a dev machine but is horrible on wasm+mobile builds, or I just don't notice right away. This is all really hard to track down, there's noise in the heuristics, and I don't know if I'm looking for one really dumb thing or a bunch of small things that have happened over time. | ||||||||
| ||||||||
| ▲ | ajb 9 hours ago | parent | prev [-] | |||||||
At a guess, you can reuse the entropy part, but you'd need to plug in a new probability distribution. | ||||||||