| ▲ | dTal 7 hours ago | |
None of that is true, at least in theory. You can trivially change layer size simply by adding extra columns initialized as 0, effectively embedding your smaller network in a larger network. You can add layers in a similar way, and in fact LLMs are surprisingly robust to having layers added and removed - you can sometimes actually improve performance simply by duplicating some middle layers[0]. Tokenization is probably the hardest but all the layers between the first and last just encode embeddings; it's probably not impossible to retrain those while preserving the middle parts. [0] https://news.ycombinator.com/item?id=47431671 https://news.ycombinator.com/item?id=47322887 | ||
| ▲ | thesz 4 hours ago | parent | next [-] | |
You took a simple path, embedding smaller into larger. What if you need to reduce number of layers and/or width of hidden layers? How will you embed larger into smaller? As for the "addition of same layers" - would the process of "layers to add" selection be considered training? What if you still have to obtain the best result possible for given coefficient/tokenization budget? I think that my comment express general case, while yours provide some exceptions. | ||
| ▲ | andriy_koval 4 hours ago | parent | prev [-] | |
there is evidence it is useful in some cases, but obviously no evidence it is enough if you chase to beat SOTA. | ||