| ▲ | andrewdea 3 hours ago | |
the hardwired model is Llama 3.1 8B, which is a lightweight model from two years ago. Unlike other models, it doesn't use "reasoning:" the time between question and answer is spent predicting the next tokens. It doesn't run faster because it uses less time to "think," It runs faster because its weights are hardwired into the chip rather than loaded from memory. A larger model running on a larger hardwired chip would run about as fast and get far more accurate results. That's what this proof of concept shows | ||
| ▲ | Etheryte 3 hours ago | parent [-] | |
I see, that's very cool, that's the context I was missing, thanks a lot for explaining. | ||