Remix clone Hacker News

new | show | ask | jobs Github

	▲	hansmayer 2 hours ago
		Well, you do understand the "penalising" or as the ML scientific community likes to call it - "adjusting the weights downwards" - is part of setting up the evaluation functions, for gasp - calculating the next most likely tokens, or to be more precise, tokens with the highest possible probability? You are effectively proving my point, perhaps in a bit hand-wavy fashion, that nevertheless still can be translated into the technical language.
	▲	andy12_ an hour ago \| parent [-]
		You do understand that the mechanism through which an auto-regressive transformer works (predicting one token at a time) is completely unrelated to how a model with that architecture behaves or how it's trained, right? You can have both: - An LLM that works through completely different mechanisms, like predicting masked words, predicting the previous word, or predicting several words at a time. - A normal traditional program, like a calculator, encoded as an autoregressive transformer that calculates its output one word at a time (compiled neural networks) [1][2] So saying "it predicts the next word" is a nothing-burger. That a program calculates its output one token at a time tells you nothing about its behavior. [1] https://arxiv.org/pdf/2106.06981 [2] https://wengsyx.github.io/NC/static/paper_iclr.pdf