Remix.run Logo
nsagent 2 days ago

You might be interested in reading about the minimum description length (MDL) principle [1]. Despite all the dissenters to your argument, what your positing is quite similar to MDL. It's how you can fairly compare models (I did some research in this area for LLMs during my PhD).

Simply put, to compare models, you describe both the model and training data using a code (usual reported as number of bits). The trained model that represents the data within the fewest number of bits is the more powerful model.

This paper [2] from ICML 2021 shows a practical approach for attempting to estimate MDL for NLP models applied to text datasets.

[1]: http://www.modelselection.org/mdl/

[2]: https://proceedings.mlr.press/v139/perez21a.html