Remix clone Hacker News

new | show | ask | jobs Github

	▲	eob 2 hours ago
		Do you have any suspicion about what is different between the backends? That's an absolutely bonkers statistic: it would mean spurious differences in hosting container overwhelm the performance differences between models.
	▲	zambelli an hour ago \| parent [-]
		I genuinely don't, sadly. I'm a mathematician originally, evolved organically into ML then AI - but I never really was a SWE. I feel like there's some backend decoding or chat template thing going on at a much lower level than what I'm best at. Maybe it's injecting headers or something that eventually compounds to model confusion? I really have no idea. I really hope folks better than me at backend stuff take a look and dive into it though because it's definitely under-reported and super consistent across model families and backends ranging from ollama, lama.cpp native, prompt, llamafile, and even vLLM that I didn't formally benchmark in the repo.