This is so sick. I'm really curious to see what focused effort on optimizing a single open source model can look like over many months. Not only on the inference serving side, but also on the harness optimization side and building custom workflows to narrow the gap between things frontier models can infer and deduce and what open source models natively lack due to size, training etc.

▲

dakolli 6 hours ago | parent [-]

There will always be a huge gap between frontier models and open source models (unless you're very rich). This whole industry makes no sense, everyone is ignoring the unit economics. It cost 20k a month to running Kimi 2.6 at decent tok/ps, to sell those tokens at a profit you'd need your hardware costs to be less 1k a month.

Everyone who's betting their competency on the generosity of billionaires selling tokens for 1/10-1/20th of the cost, or a delusional future where capable OS models fit on consumer grade hardware are actually cooked.

▲

bensyverson 6 hours ago | parent | next [-]

If you looked at a graph of GPU power in consumer hardware and model capability per billion parameters over time, it seems inevitable that in the next few years a "good enough" model will run on entry-level hardware.

Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.

▲

physicsguy 6 hours ago | parent | next [-]

It also massively changes the value economics of the frontier models. In a lot of cases, you really don't need a general purpose intelligence model too.

	▲	bensyverson 5 hours ago \| parent [-]
		Exactly… as hn readers, we sometimes forget that a lot of people are using these tools to search for the best sunscreen, or rewrite an email.

▲

dakolli 5 hours ago | parent | prev [-]

No offense, this is a crazy delusional statement.

▲

5 hours ago | parent | next [-]

[deleted]

▲

afro88 5 hours ago | parent | prev [-]

No offense, this is a crazy worthless contribution to the discussion.

Why?

▲

dakolli 4 hours ago | parent [-]

Because everyone in these replies is in complete denial about the physical limits of memory and scaling in general. Ya'll literally living in an alternate reality where model capability increases with a decrease in size, its simply not the case. There will be small focused models that preform well on very narrow tasks, yes, but you will not have "agents" capable of "building most things" running on consumer hardware until more capable (and affordable) consumer hardware exists.

▲

bensyverson 4 hours ago | parent [-]

Ah, you haven't realized that consumer hardware gets more capable over time

	▲	adrian_b 2 hours ago \| parent [-]
		Not this year, when many vendors either offer lower memory capacities or demand higher prices for their devices.

▲

liuliu 6 hours ago | parent | prev | next [-]

I am not sure where this comment is from (possibly without looking at this project?). This project is running quasi-frontier model at reasonable tps (~30) with reasonable prefill performance (~500tps) with a high-end laptop. People simply project what they see from this project to what you optimistically can expect.

You can argue whether the projection is too optimistic or not, but this project definitely made me a little bit optimistic on that end.

▲

maherbeg an hour ago | parent | prev | next [-]

There will always be a gap, but what's interesting is that because new models are constantly coming out, we as an industry never spend any time extracting the maximal value out of an existing model. What if there are techniques, and harness workflows that could be optimized for a singular model end to end? How far can that push the state of the art.

An example is https://blog.can.ac/2026/02/12/the-harness-problem/ for just improving edits.

Or if we could really steer these open source models using well structured plans, could we spend more time planning into a specific way and kick off the build over night (a la the night shift https://jamon.dev/night-shift)

▲

amunozo 5 hours ago | parent | prev | next [-]

Most tasks do not require frontier models, so as long as these models cover 95-99 per cent of the tasks, closed frontier models can be left for niche and specialized cases that are harder.

▲

dakolli 4 hours ago | parent [-]

Frontier models can hardly do the tasks I want them too, I simply cannot buy into this notion.

	▲	drob518 3 hours ago \| parent [-]
		For instance?

▲

otabdeveloper4 6 hours ago | parent | prev [-]

> a delusional future where capable OS models fit on consumer grade hardware

48 gb is enough for a capable LLM.

Doing that on consumer grade hardware is entirely possible. The bottleneck is CUDA and other intellectual property moats.