Remix.run Logo
skybrian 6 hours ago

Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.

macNchz 5 hours ago | parent | next [-]

This is something I've been thinking about for a while...the current state of things really does feel kind of like the dialup era, wondering what the "broadband" era could look like. Watching tokens stream in is reminiscent of watching a jpeg load a few rows of pixels at a time, and the various different loading and connecting animations that applications implemented before things got fast enough to make them less relevant.

Some of the work in that direction like Cerebras or Taalas have been doing is an interesting glimpse of what might be possible. In the meantime it's a fun thought experiment to wonder about what might be possible if even current state of the art models were available at like, a million tokens per second at a very low cost.

gavmor 3 hours ago | parent [-]

Take a look at https://chatjimmy.ai/ -- it's running against Taalas' "hardcore" silicon model, ie a dedicated, ASIC-like chip.

garciasn 5 hours ago | parent | prev | next [-]

You're right about it being reminiscent of the dial-up area, but I don't believe it's 300 to 1200; it's more like 4800:

Modem vs Claude according to Claude:

300 @ 2368 characters - 1m 19s

1200 @ 2368 characters - 19.7s

2400 @ 2368 characters - 9.9s

14.4K @ 2368 characters - 1.6s

33.6K @ 2368 characters - 705 ms

56K @ 2368 characters - 447 ms

Claude @ 2368 characters - 7.9s

jeffhuys 5 hours ago | parent | prev | next [-]

Check chatjimmy.ai

lelandbatey 3 hours ago | parent [-]

https://chatjimmy.ai being a demo of the "burn the model to an ASIC" approach being sold by Taalas[0], an approach which they use to run Llama 3.1 8B at ~17000 tokens per second.

[0] - https://taalas.com/products/

MagicMoonlight 5 hours ago | parent | prev [-]

There was a startup posted here which built custom hardware that let the AI respond instantly. Thousands of tokens per second.

tln 3 hours ago | parent | next [-]

Taalas. A sibling comment of yours posted the chat demo URL -

https://chatjimmy.ai/

2ndorderthought 3 hours ago | parent [-]

Woah. How is this working? It's stupid fast.

Grosvenor 4 hours ago | parent | prev | next [-]

cerebras

They built an entire wafer ASIC. The entire thing is one huge active ASIC. it takes a lot of cool engineering and cooling to make it work, and is very cool.

zargon 5 hours ago | parent | prev [-]

Groq.

beavisringdin 4 hours ago | parent [-]

No, it was a custom ASIC chip with weights baked in for a singular model. I do envision a future where we return to cartridges. Local AI is de facto and massively optimised chips are built to be plug and play running a single SoTA model.

SJMG 4 hours ago | parent | next [-]

Likely https://taalas.com

2 hours ago | parent | prev [-]
[deleted]