| ▲ | thesz 3 hours ago | |
You took a simple path, embedding smaller into larger. What if you need to reduce number of layers and/or width of hidden layers? How will you embed larger into smaller? As for the "addition of same layers" - would the process of "layers to add" selection be considered training? What if you still have to obtain the best result possible for given coefficient/tokenization budget? I think that my comment express general case, while yours provide some exceptions. | ||