Remix.run Logo
sigmoid10 4 days ago

Skepticism is an understatement. There are tons of issues with this paper. Why are they comparing results of their expert model that was trained from scratch on a single task to general purpose reasoning models? It is well established in the literature that you can still beat general purpose LLMs in narrow domain tasks with specially trained, small models. The only comparison that would have made sense is one to vanilla transformers using the same nr of parameters and trained on the same input-output dataset. But the paper shows no such comparison. In fact, I would be surprised if it was significantly better, because such architecture improvements are usually very modest or not applicable in general. And insinuating that this is some significant development to improve general purpose AI by throwing in ARC is just straight up dishonest. I could probably cook up a neural net in pytorch in a few minutes that beats a hand-crafted single task that o3 can't solve in an hour. That doesn't mean that I made any progress towards AGI.

bubblyworld 4 days ago | parent [-]

Have you spent much time with the ARC-1 challenge? Their results on that are extremely compelling, showing results close to the initial competition's SOTA (as of closing anyway) with a tiny model and no hacks like data augmentation, pretraining, etc that all of the winning approaches leaned on heavily.

Your criticism makes sense for the maze solving and sudoku sets, of course, but I think it kinda misses the point (there are traditional algos that solve those just fine - it's more about the ability of neural nets to figure them out during training, and known issues with existing recurrent architectures).

Assuming this isn't fake news lol.

smokel 4 days ago | parent | next [-]

Looking at the code, there is a lot of data augmentation going on there. For the Sudoku and ARC data sets, they augment every example by a factor of 1,000.

https://github.com/sapientinc/HRM/blob/main/dataset/build_ar...

bubblyworld 4 days ago | parent [-]

That's fair, they are relabelling colours and rotating the boards. I meant more like mass generation of novel puzzles to try and train specific patterns. But you are right that technically there is some augmentation going on here, my bad.

smokel 4 days ago | parent | next [-]

Hm, I'm not so sure it's fair play for the Sudoku puzzle. Suggesting that the AI will understand the rules of the game with only 1,000 examples, and then adding 1,000,000 derived examples does not feel fair to me. Those extra examples leak a lot of information about the rules of the game.

I'm not too familiar with the ARC data set, so I can't comment on that.

bubblyworld 4 days ago | parent [-]

True, it leaks information about all the symmetries of the puzzle, but that's about it. I guess someone needs to test how much that actually helps - if I get the model running I'll give it a try!

westurner 3 days ago | parent | prev [-]

> That's fair, they are relabelling colours and rotating the boards.

Photometric augmentation, Geometric augmentation

> I meant more like mass generation of novel puzzles to try and train specific patterns.

What is the difference between Synthetic Data Generation and Self Play (like AlphaZero)? Don't self play simulations generate synthetic training data as compared to real observations?

bubblyworld 3 days ago | parent [-]

I don't know the jargon, but for me the main thing is the distinction between humans injecting additional bits of information into the training set vs the algorithm itself discovering those bits of information. So self-play is very interesting (it's automated as part of the algorithm) but stuff like generating tons of novel sudoku puzzles and adding them to the training set is less interesting (the information is being fed into the training set "out-of-band", so to speak).

In this case I was wrong, the authors are clearly adding bits of information themselves by augmenting the dataset with symmetries (I propose "symmetry augmentation" as a much more sensible phrase for this =P). Since symmetries share a lot of mutual information with each other, I don't think this is nearly as much of a crutch as adding novel data points into the mix before training, but ideally no augmentation would be needed.

I guess you could argue that in some sense it's fair play - when humans are told the rules of sudoku the symmetry is implicit, but here the AI is only really "aware" of the gradient.

westurner 3 days ago | parent [-]

Symmetry augmentation sounds good for software.

Traditional ML CV Computer Vision research has perhaps been supplanted by multimodal LLMs that are trained on image analysis annotations. (CLIP, Brownian-motion based Dall-E and Latent Diffusion were published in 2021. More recent research: Brownian Bridges, SDEs, Lévy processes. What are foundational papers in video genai?)

TOPS are now necessary.

I suspect that existing CV algos for feature extraction would also be useful for training LLMs. OpenCV, for example, has open algorithms like ORB (Oriented FAST and Rotated BRIEF), KAZE and AKAZE, and SIFT since 2020. SIFT "is highly robust to rotation, scale, and illumination changes".

But do existing CV feature extraction and transform algos produce useful training data for LLMs as-is?

Similarly, pairing code and tests with a feature transform at training time probably yields better solutions to SWE-bench.

Self Play algos are given rules of the sim. Are self play simulations already used as synthetic training data for LLMS and SLMs?

There are effectively rules for generating synthetic training data.

The orbits of the planets might be a good example of where synthetic training data is limited and perhaps we should rely upon real observations at different scales given cost of experimentation and confirmations of scale invariance.

Extrapolations from orbital observations and classical mechanics failed to predict the Perihelion precession of Mercury (the first confirmation of GR General Relativity).

To generate synthetic training data from orbital observations where Mercury's 43 arcsecond deviation from Newtonian mechanics was disregarded as an outlier would result in a model overweighted by existing biases in real observations.

Tests of general relativity > Perihelion precession of Mercury https://en.wikipedia.org/wiki/Tests_of_general_relativity#Pe...

bubblyworld 3 days ago | parent [-]

Okay, haha, I'm not sure what we're doing here.

westurner 3 days ago | parent [-]

I have a list of questions for ai or an expert, IDK what

sigmoid10 4 days ago | parent | prev [-]

As the other commenter already pointed out, I'll believe it when I see it on the leaderboard. But even then it already lost twice against the winner of last year's competition, because that too was a general purpose LLM that could also do other things.

bubblyworld 4 days ago | parent [-]

Let's not move the goalposts here =) I don't think it's really fair to compare them directly like that. But I agree, this is triggering my "too good to be true" reflex very hard.

sigmoid10 4 days ago | parent [-]

If anything, they moved the goalpost closer to the starting line. I'm merely putting it back where it belongs.