| ▲ | vb-8448 3 hours ago | |||||||||||||
> how many other people have encountered a problem close enough to yours and solved it somewhere on the open internet I'm 100% sure that all our web, cc, codex or whatsoever sessions are used in the training, RL or either both. This makes the size of the universe models know about at least one order of magnitude bigger than the open internet. | ||||||||||||||
| ▲ | beepbooptheory 3 hours ago | parent | next [-] | |||||||||||||
I get how this is a trueism now but I never really understood why it would be useful to scrape cc/codex sessions for training. The relative amount of human input for that is so low (isn't that why they are so loved and used?), how could it actually be useful to them? Wouldn't you wanna focus on people not using it? | ||||||||||||||
| ||||||||||||||
| ▲ | nathan_compton 3 hours ago | parent | prev [-] | |||||||||||||
I think this is a rosy estimate. The vast majority of what people do with these models is just the same old shit, I would be surprised if 1% of it were genuinely novel stuff worth folding back into the training data. | ||||||||||||||
| ||||||||||||||