Remix.run Logo
marksully 5 hours ago

Where does "1T parameter model" come from? I can only see models with 70B params or less mentioned in the repo.

tatef 3 hours ago | parent | next [-]

I'm referencing it as being possible, however I didn't share benchmarks because candidly the performance would be so slow it would only be useful for very specific tasks over long time horizons. The more practical use cases are less flashy but capable of achieving multiple tokens/sec (ie smaller MoE models where not all experts need to be loaded in memory simultaneously)

causal 5 hours ago | parent | prev [-]

Yeah title comes from nowhere in the link. No doubt it's possible but all that matters is speed and we learn nothing of that here...