Remix clone Hacker News

new | show | ask | jobs Github

	▲	Workaccount2 7 months ago
		This is a pretty interesting benchmark because it seems to break the common ordering we see with all the other benchmarks.
	▲	_peregrine_ 7 months ago \| parent [-]
		Yeah I mean SQL is pretty nuanced - one of the things we want to improve in the benchmark is how we measure "success", in the sense that multiple correct SQL results can look structurally dissimilar while semantically answering the prompt. There's some interesting takeaways we learned here after the first round: https://www.tinybird.co/blog-posts/we-graded-19-llms-on-sql-...