▲ | cubefox 3 days ago | |
Apart from backpropagation, the biggest improvements were probably changes in the network architecture. Standard feed-forward MPCs are fairly inefficient. Then there were architectures like CNNs, LSTMs, Transformers. There were also improvements in the activation function and in the gradient descent method (AdamW), but I'm not sure whether these had a substantial impact like CNNs or Transformers. Another factor was training on GPUs. |