I am extremely skeptical of a 27M parameter model being trained “from scratch” on 1000 datapoints. I am likewise incredulous of the lack of comparison with any other model which is trained “from scratch” using their data preparation. Instead they strictly compare with 3rd party LLMs which are massively more general purpose and may not have any of those 1000 examples in their training set.

This smells like some kind of overfit to me.

▲

cs702 3 days ago | parent [-]

Yeah, the results look incredible indeed. That's why I and many others here have decided to download, review, and test the code published by the authors.[a] If their code doesn't live up to their claims, we will all ignore their work and move on. If their code lives up to their claims, no one can argue with it. In my experience, when authors publish working code, it's usually a good sign.

---

[a] https://github.com/sapientinc/HRM

▲

lumost 3 days ago | parent [-]

Did it work? :)

The architecture is very similar offset lstms which have been studied extensively. The main difference is the handover of the hidden state, which my naive mind would assume makes optimization substantially more difficult.

▲

cs702 3 days ago | parent [-]

I haven't had a chance to read the preprint carefully or play with the code yet. Best place to follow what's happening is by looking at the github repo, specifically open and closed issues and pull requests.

	▲	lumost 2 days ago \| parent [-]
		I'll wait until some more benchmarks are run in this case. Unlike traditional software, vetting a model architecture works better than alternatives is a time and compute intensive process. You really can't just download it and "try it out" outside of general purpose models (which this is not).