Remix clone Hacker News

new | show | ask | jobs Github

	▲	swingboy 2 hours ago
		I’ve always assumed any LLM output that was some type of rating or score was bullshit. Unless the LLM writes a Python script to calculate the score (and even then…) then the score it outputs is just the next most likely token, taking into account temperature and what not. You see a lot of frameworks for things like spec-driven development make use of scoring how good the spec/design/plan is and it’s like, uhhh…
	▲	joelthelion 2 hours ago \| parent [-]
		> is just the next most likely token, taking into account temperature and what not. This doesn't mean anything. All LLM output is like that. That said, I agree that LLMs are terrible at grading stuff, except perhaps if you give them a very detailed evaluation grid.