new | show | ask | jobs Github

rzmmm 15 hours ago

The model has multiple layers of mechanisms to prevent carbon copy output of the training data.

▲

glemion43 13 hours ago | parent | next [-]

Do you have a source for this?

Carbon copy would mean over fitting

▲

fweimer 7 hours ago | parent | next [-]

I saw weird results with Gemini 2.5 Pro when I asked it to provide concrete source code examples matching certain criteria, and to quote the source code it found verbatim. It said it in its response quoted the sources verbatim, but that wasn't true at all—they had been rewritten, still in the style of the project it was quoting from, but otherwise quite different, and without a match in the Git history.

It looked a bit like someone at Google subscribed to a legal theory under which you can avoid copyright infringement if you take a derivative work and apply a mechanical obfuscation to it.

▲

Workaccount2 4 hours ago | parent [-]

LLM's are not archives of information.

People seem to have this belief, or perhaps just general intuition, that LLMs are a google search on a training set with a fancy language engine on the front end. That's not what they are. The models (almost) self avoid copyright, because they never copy anything in the first place, hence why the model is a dense web of weight connections rather than an orderly bookshelf of copied training data.

Picture yourself contorting your hands under a spotlight to generate a shadow in the shape of a bird. The bird is not in your fingers, despite the shadow of the bird, and the shadow of your hand, looking very similar. Furthermore, your hand-shadow has no idea what a bird is.

	▲	fweimer 3 hours ago \| parent [-]
		For a task like this, I expect the tool to use web searches and sift through the results, similar to what a human would do. Based on progress indicators shown during the process, this is what happens. It's not an offline synthesis purely from training data, something you would get from running a model locally. (At least if we can believe the progress indicators, but who knows.)

▲

NewsaHackO 3 hours ago | parent | prev | next [-]

It is the classic "He made it up"

▲

Der_Einzige 6 hours ago | parent | prev [-]

Source is just read the definition of what "temperature" is.

But honestly source = "a knuckle sandwich" would be appropriate here.

▲

TZubiri 15 hours ago | parent | prev | next [-]

forgive the skepticism, but this translates directly to "we asked the model pretty please not to do it in the system prompt"

▲

ffsm8 14 hours ago | parent | next [-]

It's mind boggling if you think about the fact they're essential "just" statistical models

It really contextualizes the old wisdom of Pythagoras that everything can be represented as numbers / math is the ultimate truth

▲

glemion43 13 hours ago | parent | next [-]

They are not just statistical models

They create concepts in latent space which is basically compression which forces this

▲

jrmg 7 hours ago | parent | next [-]

You’re describing a complex statistical model.

	▲	glemion43 4 hours ago \| parent [-]
		Debatable I would argue. It's definitely not 'just a statistical model's and I would argue that the compression into this space fixes potential issues differently than just statistics. But I'm not a mathematics expert if this is the real official definition I'm fine with it. But are you though?

▲

mmooss 6 hours ago | parent | prev [-]

What is "latent space"? I'm wary of metamagical descriptions of technology that's in a hype cycle.

▲

DoctorOetker 5 hours ago | parent | next [-]

its a statistical term, a latent variable is one that is either known to exist, or believed to exist, and then estimated.

consider estimating the position of an object from noisy readings. One presumes that position to exist in some sense, and then one can estimate it by combining multiple measurements, increasing positioning resolution.

its any variable that is postulated or known to exist, and for which you run some fitting procedure

▲

AIorNot 5 hours ago | parent | prev | next [-]

See this video

https://youtu.be/D8GOeCFFby4?si=AtqH6cmkOLvqKdr0

▲

glemion43 4 hours ago | parent | prev [-]

I'm disappointed that you had to add the 'metamagical' to your question tbh

It doesn't matter if ai is in a hype cycle or not it doesn't change how a technology works.

Check out the yt videos from 1blue3brown he explains LLMs quite well. .your first step is the word embedding this vector space represents the relationship between words. Father - grandfather. The vector which makes a father a grandfather is the same vector as mother to grandmother.

You the use these word vectors in the attention layer to create a n dimensional space aka latent space which basically reflects a 'world' the LLM walks through. This makes the 'magic' of LLMs.

Basically a form of compression by having higher dimensions reflecting kind a meaning.

Your brain does the same thing. It can't store pixels so when you go back to some childhood environment like your old room, you remember it in some efficient (brain efficient) way. Like the 'feeling' of it.

That's also the reason why an LLM is not just some statistical parrot.

▲

mmooss 2 hours ago | parent [-]

> It doesn't matter if ai is in a hype cycle or not it doesn't change how a technology works.

It does change what people say about it. Our words are not reality itself; the map is not the territory.

Are you saying people should take everything said about LLMs at face value?

	▲	glemion43 27 minutes ago \| parent [-]
		Being dismissive of technical terms on hn because something seems to be a hype is really weird. It's the reason why I'm here because we discuss more technically about technology

▲

GrowingSideways 14 hours ago | parent | prev [-]

How so? Truth is naturally an apriori concept; you don't need a chatbot to reach this conclusion.

▲

ComplexSystems 6 hours ago | parent | prev | next [-]

The model doesn't know what its training data is, nor does it know what sequences of tokens appeared verbatim in there, so this kind of thing doesn't work.

▲

mikaraento 14 hours ago | parent | prev | next [-]

That might be somewhat ungenerous unless you have more detail to provide.

I know that at least some LLM products explicitly check output for similarity to training data to prevent direct reproduction.

▲

TZubiri an hour ago | parent | next [-]

So it would be able to produce the training data but with sufficient changes or added magic dust to be able to claim it as one's own.

Legally I think it works, but evidence in a court works differently than in science. It's the same word but don't let that confuse you and don't mix them both.

▲

guenthert 6 hours ago | parent | prev [-]

Should they though? If the answer to a question^Wprompt happens to be in the training set, wouldn't it be disingenuous to not provide that?

▲

ttctciyf 5 hours ago | parent [-]

Maybe it's intended to avoid legal liability resulting from reproducing copyright material not licensed for training?

	▲	TZubiri an hour ago \| parent [-]
		Ding! It's great business to minimally modify valuable stuff and then take credit for it. As was explained to me by bar-certified counsel "if you take a recipe and add, remove or change just one thing, it's now your recipe" The new trend in this is asking Claude Code to create a software on some type, like a Browser or a DICOM viewer, and then publishing that it's managed to do this very expensive thing (but if you check source code, which is never published, it probably imports a lot of open source dependencies that actually do the thing) Now this is especially useful in business, but it seems that some people are repurposing this for proving math theorems. The Terence Tao effort which later checks for previous material is great! But the fact that the Section 2 (for such cases) is filled to the brim, and section 1 is mostly documented failed attempts (except for 1 proof, congratulations to the authors), mostly confirms my hypothesis, claiming that the model has guards that prevent it is a deus ex machina cope against the evidence.

▲

efskap 14 hours ago | parent | prev [-]

Would it really be infeasible to take a sample and do a search over an indexed training set? Maybe a bloom filter can be adapted

	▲	hexaga 13 hours ago \| parent [-]
		It's not the searching that's infeasible. Efficient algorithms for massive scale full text search are available. The infeasibility is searching for the (unknown) set of translations that the LLM would put that data through. Even if you posit only basic symbolic LUT mappings in the weights (it's not), there's no good way to enumerate them anyway. The model might as well be a learned hash function that maintains semantic identity while utterly eradicating literal symbolic equivalence.

▲

Den_VR 14 hours ago | parent | prev | next [-]

Unfortunately.

▲

GeoAtreides 8 hours ago | parent | prev [-]

does it?

this is a verbatim quote from gemini 3 pro from a chat couple of days ago:

"Because I have done this exact project on a hot water tank, I can tell you exactly [...]"

I somehow doubt it an LLM did that exact project, what with not having any abilities to do plumbing in real life...

▲

retsibsi 7 hours ago | parent [-]

Isn't that easily explicable as hallucination, rather than regurgitation?

	▲	ttctciyf 5 hours ago \| parent [-]
		Those are not mutually exclusive in this instance, it seems.