▲ | gpt5 3 days ago | ||||||||||||||||||||||||||||||||||||||||
Slightly tangent question - they said that they have protected the public test set with a strong copyleft license to prevent training private models on them. Does it actually work? Isn’t AI training so far simply ignores all license and copyright restrictions completely? | |||||||||||||||||||||||||||||||||||||||||
▲ | candiddevmike 3 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||
Sir, we've already ingested 503,377 copyleft licensed codebases, I don't think the training set can take anymore! | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
▲ | joefkelley 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
I happen to have worked on exactly this at Google. No, we don't train on restrictively-licensed code to the best of our abilities. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
▲ | BoorishBears 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
I feel like public datasets are something we're holding onto with LLM benchmarks for historical reasons, but need to move on from. Older, non-instruction tuned models needed post-training on public datasets to even reliably produce meaningful answers. Now we're testing tasks that are so complex that the LLM should reasonably be expected to answer without additional post-training. Once you have a public dataset, even feeding those examples to an LLM and producing synthetic variations is enough to let you game the benchmark. And the worst part is you don't need to be unethical to do this: some people would say it's just a good way to expand your training data even though it incidentally allows you to overfit on the task, without overfitting on the public dataset. So everyone's doing stuff like that, and we're getting models that are increasing overfit to a few narrow tasks. - The alternative is just giving detailed plain english descriptions of the tasks in question. Those can be used to generate synthetic tasks, but won't result in matching the benchmark's "shape" perfectly (as long as the questions stay hidden), and that alone is enough to ensure some level of generalization takes place. | |||||||||||||||||||||||||||||||||||||||||
▲ | ej88 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
https://scale.com/leaderboard/swe_bench_pro_commercial I definitely trust the totally private dataset more. | |||||||||||||||||||||||||||||||||||||||||
▲ | kenstler 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
One of the authors here -- we should clarify that strong copyleft license is a best attempt at decontamination for the public set. It's part of the tradeoff for having an open source set -- true decontamination is available with the private commercial set, but we can't release these, and if we did they'd be immediately susceptible to future contamination. | |||||||||||||||||||||||||||||||||||||||||
▲ | stephendause 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
This is a key question in my opinion. It's one of the things that make benchmarking the SWE capabilities of LLMs difficult. It's usually impossible to know whether the LLM has seen a problem before, and coming up with new, representative problem sets is time-consuming. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
▲ | heavyset_go 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
If courts find model training and inference to be fair use of data sets, licenses mean nothing. It looks like one court did in a non-precedent binding case, but I might be remembering incorrectly. | |||||||||||||||||||||||||||||||||||||||||
▲ | stri8ed 3 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||
Not a chance. Even if American companies did abide by it, there is no reason Chinese companies would. And good luck definitely proving that a model trained on it. |