Remix clone Hacker News

new | show | ask | jobs Github

	▲	vibe42 2 days ago
		One thing to benchmark is if LLMs are better at solving complex problems if they're described in one language vs others. There's SWE-bench Multilingual for example, but translating a problem into multiple natural languages before passing it to the LLM has not been benchmarked afaik. If there's some residual of the natural language left when the middle layers execute, that would in part validate Sapir-Whorf.