▲ | cubefox 4 days ago | |
This raises the question whether the performance of LLMs with SSM architecture (Mamba) would be different from the Transformer models they tested. Because SSMs do not use attention layers. The model architecture is actually already known to have effects on some tasks. In particular, SSMs are worse than transformers at retrieving specific information from the context window [1], which e.g. reduces their performance on multiple choice benchmarks. Which is a performance difference that isn't reflected in their language modeling ability (perplexity). |