Remix.run Logo
2001zhaozhao 3 hours ago

Cerebras is serving GLM4.6 at 1000 tokens/s right now. They're probably likely to upgrade to this model.

I really wonder if GLM 4.7 or models a few generations from now will be able to function effectively in simulated software dev org environments, especially that they self-correct their errors well enough that they build up useful code over time in such a simulated org as opposed to increasing piles of technical debt. Possibly they are managed by "bosses" which are agents running on the latest frontier models like Opus 4.5 or Gemini 3. I'm thinking in the direction of this article: https://www.anthropic.com/engineering/effective-harnesses-fo...

If the open source models get good enough, then the ability to run them at 1k tokens per second on Cerebras would be a massive benefit compared to any other models in being able to run such an overall SWE org quickly.

allovertheworld 25 minutes ago | parent | next [-]

How cheap is glm at Cerebras? I cant imagine why they cant tune the tokens to be lower but drastically reduce the power, and thus the cost for the API

Zetaphor 9 minutes ago | parent [-]

They're running on custom ASICs as far as I understand, it may not be possible to run them effectively at lower clock speeds. That and/or the market for it doesn't exist in the volume required to be profitable. OpenAI has been aggressively slashing its token costs, not to mention all the free inference offerings you can take advantage of

chrisfrantz 3 hours ago | parent | prev [-]

This is where I believe we are headed as well. Frontier models "curate" and provide guardrails, very fast and competent agents do the work at incredibly high throughput. Once frontier hits cracks the "taste" barrier and context is wide enough, even this level of delivery + intelligence will be sufficient to implement the work.