▲ | canyon289 6 days ago | |
Not rude at all and I'll again share what I can. We ran a bunch of experimental architectures at this size to get a sense of performance at this size, in particular how well it was able to adapt to datasets across some loss measures. For the embedding size it comes from a mix of "hard technical" data, like the loss measures I mentioned above, and for this model it also comes from community considerations such as adaptability across input tokens and consistency with the gemma ecosystem. At this size you are right its a bit funny the embedding is so large. For more details read the Gemma3 technical report https://arxiv.org/pdf/2503.19786. It doesnt cover the 270m model as this was written from the 1b to 27b gemma3 release but itll answer some of your questions. As for 270m we may share more information in the future, Up until now we were just focused on getting the model out there. |