Not “may be”: just look how swe-bench scores drop to single digits once it in C#

fine_tune 5 days ago | parent | next [-]

I was going to argue "LLM's need code samples to-do well on languages and if we are honest C# is a language mostly held in private repo's" but Github's 2024 report[0] says its the 5th most used language (I'm to lazy to check if this report includes private repo's but I'll assume it doesn't).

So kinda neat to see this paper!

[0]https://github.blog/news-insights/octoverse/octoverse-2024/#...

▲

CuriouslyC 5 days ago | parent | next [-]

The big labs are almost certainly using compiler/repl output for generated code as an oracle for RL. I doubt they have C# in the mix.

▲

tomjakubowski 5 days ago | parent [-]

Why do you doubt that? It's a widely used language. And there is even an open source C# REPL.

▲

5 days ago | parent | next [-]

[deleted]

▲

CuriouslyC 5 days ago | parent | prev [-]

Because RL time is expensive and I don't think the languages which are more popular than C# have such high performance that it's worth bumping their batches for C#.

	▲	stingraycharles 5 days ago \| parent [-]
		But C# is a typical enterprise language which has people who are willing to pay a lot of money for AI. We’re just guessing and the fact of the matter is that we don’t know what inputs they use for their models.

▲

yieldcrv 5 days ago | parent | prev [-]

5th most used language based on private repos that the group making the report has the exclusive direct access to seeing

I don't see that contradicting your assumption

	▲	BoorishBears 5 days ago \| parent [-]
		"In this year’s Octoverse report, we study how public and open source activity on GitHub..."

▲

stefan_ 5 days ago | parent | prev | next [-]

So the "Verified" part of "SWE Bench Verified" means.. not "Verified" at all.

I don't get it, who is so opposed to doing the bare minimum of manual work and check what these models are doing? At least back in the day grad students doing an easy meta-paper understood it meant doing some repetitive manual work. Now we got benchmarks by hype vendors who think they can use the thing they are benchmarking to .. mark the bench.

▲

yorwba 5 days ago | parent | next [-]

The "Verified" part of "SWE-Bench Verified" means that there was plain "SWE-Bench" before it, which had actually not been verified at all and included a lot of tasks that didn't really make sense for use as a benchmark: https://openai.com/index/introducing-swe-bench-verified/#ada...

Data contamination stemming from the fact that it's based on already-solved problems in public repositories is a different issue that cannot be addressed by verifying the benchmark questions harder, but only by putting stricter limits on the model under test.

	▲	kronks 4 days ago \| parent [-]
		[dead]

▲

jsheard 5 days ago | parent | prev | next [-]

> So the "Verified" part of "SWE Bench Verified" means.. not "Verified" at all.

Seems on-brand for an LLM-related thing to claim that it has verified something without actually checking.

	▲	geekymartian 5 days ago \| parent \| next [-]
		that was my exact thought. how fitting
	▲	hhh 4 days ago \| parent \| prev [-]
		Verified has a completely different meaning for this, it's that the questions have verified valid solutions.

▲

lieret 5 days ago | parent | prev | next [-]

[On the SWE-bench team] As someone pointed out SWE-bench Verified is a subset of tasks that were reviewed to be solvable (i.e., have enough context in the task description) as well are scored with unit tests that aren't overly specific to rule out valid solutions.

We've all read & analyzed a large number of agent trajectories. This loophole seems to be something that popped up with the more recent models and we simply weren't aware of it.

As discussed in the github issue, there's a fix in the new version of the SWE-bench containers (currently being rolled out) that makes sure that the relevant commits aren't available.

Part of what makes SWE-bench a very interesting benchmark is the enormous action space that agents that compete on it can take. However that also means that there's unexpected things happening when models get better. We're currently working on making all agent runs easily browsable on a website (rather than having to download our AWS buckets) to get even more eyes on the trajectories. Thanks to everyone who uncovered this loophole.

▲

sebzim4500 5 days ago | parent | prev | next [-]

The verified refers to the fact that the benchmark problems were verified by human experts to be reasonable.

It says nothing about data contamination, which would depend on the model and would not be the fault of the benchmark.

▲

blibble 5 days ago | parent | prev | next [-]

> I don't get it, who is so opposed to doing the bare minimum of manual work and check what these models are doing?

I doubt any of the AI company employees are encouraged to go looking for cheating

▲

5 days ago | parent | prev [-]

[deleted]

▲

teaearlgraycold 5 days ago | parent | prev [-]

Personally I don't look at or respect LLM benchmarks at all. I've seen SOTA models fail in incredibly shocking ways even recently. Those moments immediately bring me out of the delusion that LLMs have thinking capacity or an understanding of code.

	▲	phatskat 5 days ago \| parent [-]
		> the delusion that LLMs have thinking capacity It’s such a strange delusion too, because it’s easy to get caught up in for a moment and it’s easy to remember “oh no this thing is as smart as a bag of bricks”. What strikes me more is how these companies sell their AI offerings - we watched an OpenAI presentation about spec-driven development recently and the presenter was fairly, idk, fine enough if maybe a bit grandiose. But what really nagged me was the way he ended his presentation with something along the lines of “we’re excited to see AGI continue to grow” and it’s honestly A) depressing and B) downright fraud - there is no current AGI to speak of, it’s all just guessing the string of words that sound best together and this OpenAI rep _knows this_. They know that no amount of up-front spec writing will prevent bugs. They know that their LLM doesn’t “know” anything in an actually meaningful way. They know that calling what they have “AGI” is aspirational at best and lying at worst.