At a certain rate we will be able to move towards continuous / real-time inference systems. The discrete, turn based solutions are quite confining with how they must be trained. Continuous and real-time would fundamentally alter the domain.

From an information theory perspective we are still in dial-up territory with regard to the actual information rate. 750 tokens per second would be a really bad dialup connection. Imagine 10 millions tokens per second.

▲

nyrikki a day ago | parent | next [-]

We still have the problem that auto regressive decoders are memory bound.

The new Blackwell hardware combined with TensorRT-LLM and speculative decoding consistently can hit 1,000 TPS/user barrier, comparing to closer to ~250 TPS/user (out of 10k+/TPS on the server)

Is there something I missed, this looks more like 14.4 to 56 on a 64kbps backing channel modem story. I have no doubt that there are still massive gains to be found, but they seem to be using existing constraints more efficiently, not that fios is coming.

I don’t have the budget to work on the foundational model scale, but with a draft model 10x–20x faster than target and an 60-80 acceptance rate I can see how they could promise 750/TPS (with a lot of other hard work) but I would appreciate where I should look to figure out what I am missing.

	▲	rsalus a day ago \| parent \| next [-]
		agree, from my POV the constraints are still there but we've optimized now. still haven't solved the core problems.
	▲	kolinko 18 hours ago \| parent \| prev [-]
		1000TPS - what model size?

▲

mikepurvis a day ago | parent | prev | next [-]

Is there anyone exploring or writing about this in public? I've felt for a while that the turn-based model was not quite right, but also felt too stupid and ill-informed to have much of an opinion about what else it could be.

	▲	chorsestudios a day ago \| parent \| next [-]
		Thinking Machines, the started founded by former OpenAI CTO Mira Murati. The interaction models demo’s in their videos imo breaks the awkward turn-based barrier. Returning responses quickly reaches a threshold where it starts to feel like a natural conversation. Their approach to solving this problem is rather clever.
	▲	b112 a day ago \| parent \| prev [-]
		I have an active 'sleep' mode, where when the user is AFK the LLM goes into a loop with a sleep 10 between turns, and determines (via tool use) if something should be done. That's still a 'turn' in a way, but it's all the LLM just sort of sitting around like a human would, pondering what to do next. But I could imagine after each space(eg, word) having a 27b model on a nice rig, with thinking off, doing a quick look at the sentence and determine if it should interrupt and start a real turn with thinking on. Which kind of is non-turn based in a way. If you're typing fast, it might hit that run every 3 or 4 words, but that's sort of how a human might be when a person is talking to them. That is, waiting for enough info to interrupt, if needed. There might be a way to process chunks of a sentence using commas as break points, eg for comma delimitated phrases in sentences, so the whole sentence doesn't need to be re-processed each "should I break in" assessment at word break. Could be fascinating. Could actually do some of this right now. I don't think this is what the parent poster was thinking, but the idea even at this level seems fun.

▲

dennisy a day ago | parent | prev | next [-]

That would be interesting.

Do you feel most of the speed upgrade will come from the software or hardware side?

▲

dyauspitr a day ago | parent | prev | next [-]

And more importantly those 10 million tokens/s should cost fractions of a penny. Tokens need to be dirt cheap so I hope they build out massive solar+battery powered data centers asap.

	▲	pylotlight a day ago \| parent [-]
		No anything but wasteful, weak, expensive, environmentally harmful solar. Nuclear is the only path forward for superior energy production, at least until we figure out fusion.

▲

b112 a day ago | parent | prev | next [-]

Your comment made me think of another real time. Real time, dynamic code/apis.

Imagine a world where there is no code, just things mildly handshaking and then creating data APIs on the fly. Where communication is fuzzy and locked in on an individual basis. No years of RFCs, no RFCs at all, just... data.

Just data, man.

An API arbitration aberratically assigned at authorized access, abridged and annotated, analytically assuring absolute assurance.

	▲	kevindamm a day ago \| parent \| next [-]
		Why remove the code and binary artifacts, though? Don't you want to verify that the business logic is accurate and the processing is deterministic? In some circumstances there is no substitute for something that you know will produce the same answer for a given input, consistently. And that's before even considering the watts per response.
	▲	Gareth321 a day ago \| parent \| prev \| next [-]
		It's very easy to see how world changing this technology will be. In a few years these AIs are going to be negotiating how they communicate with each other. Humans won't necessarily be included in that negotiation unless we have some kind of specific reason to. So many communication layers are going to be opaque to humans. We just have to trust our AIs are communicating efficiently and safely.
	▲	rdedev a day ago \| parent \| prev \| next [-]
		I'm pretty sure the LLM will get fed up and start writing an RPC Also > An API arbitration aberratically assigned at authorized access, abridged and annotated, analytically assuring absolute assurance Cool that you wrote all the words starting with "a" but I don't understand what you mean
	▲	pjmorris a day ago \| parent \| prev \| next [-]
		What this made me think of is life before computers, where people mildly handshake, create agreements on the fly. "Where communication is fuzzy and locked in on an individual basis." TBH, to me, this imagined future looks a lot like it'd have all the problems we already have.
	▲	alehlopeh a day ago \| parent \| prev \| next [-]
		I made this https://github.com/alehlopeh/hallu
	▲	rubslopes a day ago \| parent \| prev \| next [-]
		Wow. Sci-fi stuff!
	▲	dyauspitr a day ago \| parent \| prev [-]
		I’ve thought about this before. No flaky config files, no updating endpoints, no status monitors. Just fuzzy everything that works almost all of the time.

▲

ai_fry_ur_brain a day ago | parent | prev [-]

Ahh yes slop at the speed of light, how useful!

	▲	zbaby a day ago \| parent [-]
		AI is improving and seems to be reaching the point of not being slop (I am talking about flagship models).