Remix clone Hacker News

new | show | ask | jobs Github

	▲	jameshart 5 days ago
		I did see that. But since that focused really on how Claude handled that particular prompt format, it’s not clear whether the LLMs that scored low here were just failing at producing valid input, struggled to handle that specific prompt/output structure, or were doing fine at basically operating the text adventure but were struggling at building a world model and problem solving.
	▲	kqr 5 days ago \| parent [-]
		Ah, I see what you mean. Yeah, there was too much output from too many models at once (combined with not enough spare time) to really perform useful qualitative analysis on all the models' performance.