Remix clone Hacker News

new | show | ask | jobs Github

	▲	iLoveOncall 4 days ago
		Unless they asked the same question multiple times and verified that the AI always gets the right answer, this is a very faulty result. Even looking at the reasoning, in a majority of the cases you cannot prove that the LLM got it right because it actually found the right pattern instead of on a fluke. Here's an example reasoning that got the right answer but that is not specific enough and therefore could apply to literally any answer (model is Bing Copilot, picked randomly): > Option D : A shape resembling a clock. The clock shows the time 9:00.* The pattern involves shifting times across rows and columns in a logical progression. Observing the sequence in the third row, where the first two clocks show times moving forward in increments, the next logical step is a clock displaying 9:00 to fit the established rhythm. This ensures symmetry and continuity within the overall grid. Here's a comparison to "OpenAI o4 mini high" which is a very specific answer and shows it got the logic of the puzzle correctly: > D Each row adds +1:30, then +3:00. - Row 1: 12:00 → 1:30 (+1:30), 1:30 → 4:30 (+3:00) - Row 2: 3:00 → 4:30 (+1:30), 4:30 → 7:30 (+3:00) - Row 3: 4:30 → 6:00 (+1:30), so 6:00 → 9:00 (+3:00) (Down each column it’s +3:00 then +1:30, which also fits.)
	▲	gus_massa 4 days ago \| parent [-]
		That applies to humans too. If each question has 6 options, you can assume that everyone will get 16.6% for free and compensate in the grading criteria.