▲ | bigdict 2 days ago | |
Sure, you can get better model performance by throwing more compute at the problem in different places. Does is it improve perf on an isoflop basis? | ||
▲ | Reubend 2 days ago | parent | next [-] | |
It's a valid criticism that this method would increase compute requirements, but sometimes an improvement in the end result justifies the compute needed. For things like code generation in large datasets, many people would be willing to "pay" with more compute if the results were better. And this doesn't seem to require more memory bandwidth, so it could be particularly good for local models. | ||
▲ | fabmilo 2 days ago | parent | prev | next [-] | |
I read the paper and the results don't really convince me that is the case. But the problem still remains of being able to use information from different part of the model without squishing it to a single value with the softmax. | ||
▲ | eightysixfour 2 days ago | parent | prev | next [-] | |
That's... not always a given for SOTA sized models. When the ROI on more training stops, it is nice to have alternatives, whether that is RL-tuned reasoning models or alternative architectures that improve specific areas of weakness. | ||
▲ | jwilber 2 days ago | parent | prev [-] | |
There’s no one-size-fits-all answer here, but in my experience, for long contexts, perf for conv-based methods outperforms strictly attention-based methods. See evo2: “With the current implementation of Evo2, we do not have the heavily optimized kernels in place for convolution operators like we do for attention layers in a model like llama2. Even with this shortcoming, we see that the benefit from including more convolutional layers makes up for the earlier stage of optimization at around the 64k context length. Beyond that point we see an improvement in performance even compared to a highly optimized transformer model.“ https://docs.nvidia.com/bionemo-framework/latest/models/evo2... |