Looking at the code, there is a lot of data augmentation going on there. For the Sudoku and ARC data sets, they augment every example by a factor of 1,000.

https://github.com/sapientinc/HRM/blob/main/dataset/build_ar...

▲

bubblyworld 4 days ago | parent [-]

That's fair, they are relabelling colours and rotating the boards. I meant more like mass generation of novel puzzles to try and train specific patterns. But you are right that technically there is some augmentation going on here, my bad.

▲

smokel 4 days ago | parent | next [-]

Hm, I'm not so sure it's fair play for the Sudoku puzzle. Suggesting that the AI will understand the rules of the game with only 1,000 examples, and then adding 1,000,000 derived examples does not feel fair to me. Those extra examples leak a lot of information about the rules of the game.

I'm not too familiar with the ARC data set, so I can't comment on that.

	▲	bubblyworld 4 days ago \| parent [-]
		True, it leaks information about all the symmetries of the puzzle, but that's about it. I guess someone needs to test how much that actually helps - if I get the model running I'll give it a try!

▲

westurner 3 days ago | parent | prev [-]

> That's fair, they are relabelling colours and rotating the boards.

Photometric augmentation, Geometric augmentation

> I meant more like mass generation of novel puzzles to try and train specific patterns.

What is the difference between Synthetic Data Generation and Self Play (like AlphaZero)? Don't self play simulations generate synthetic training data as compared to real observations?

▲

bubblyworld 3 days ago | parent [-]

I don't know the jargon, but for me the main thing is the distinction between humans injecting additional bits of information into the training set vs the algorithm itself discovering those bits of information. So self-play is very interesting (it's automated as part of the algorithm) but stuff like generating tons of novel sudoku puzzles and adding them to the training set is less interesting (the information is being fed into the training set "out-of-band", so to speak).

In this case I was wrong, the authors are clearly adding bits of information themselves by augmenting the dataset with symmetries (I propose "symmetry augmentation" as a much more sensible phrase for this =P). Since symmetries share a lot of mutual information with each other, I don't think this is nearly as much of a crutch as adding novel data points into the mix before training, but ideally no augmentation would be needed.

I guess you could argue that in some sense it's fair play - when humans are told the rules of sudoku the symmetry is implicit, but here the AI is only really "aware" of the gradient.

▲

westurner 3 days ago | parent [-]

Symmetry augmentation sounds good for software.

Traditional ML CV Computer Vision research has perhaps been supplanted by multimodal LLMs that are trained on image analysis annotations. (CLIP, Brownian-motion based Dall-E and Latent Diffusion were published in 2021. More recent research: Brownian Bridges, SDEs, Lévy processes. What are foundational papers in video genai?)

TOPS are now necessary.

I suspect that existing CV algos for feature extraction would also be useful for training LLMs. OpenCV, for example, has open algorithms like ORB (Oriented FAST and Rotated BRIEF), KAZE and AKAZE, and SIFT since 2020. SIFT "is highly robust to rotation, scale, and illumination changes".

But do existing CV feature extraction and transform algos produce useful training data for LLMs as-is?

Similarly, pairing code and tests with a feature transform at training time probably yields better solutions to SWE-bench.

Self Play algos are given rules of the sim. Are self play simulations already used as synthetic training data for LLMS and SLMs?

There are effectively rules for generating synthetic training data.

The orbits of the planets might be a good example of where synthetic training data is limited and perhaps we should rely upon real observations at different scales given cost of experimentation and confirmations of scale invariance.

Extrapolations from orbital observations and classical mechanics failed to predict the Perihelion precession of Mercury (the first confirmation of GR General Relativity).

To generate synthetic training data from orbital observations where Mercury's 43 arcsecond deviation from Newtonian mechanics was disregarded as an outlier would result in a model overweighted by existing biases in real observations.

Tests of general relativity > Perihelion precession of Mercury https://en.wikipedia.org/wiki/Tests_of_general_relativity#Pe...

▲

bubblyworld 3 days ago | parent [-]

Okay, haha, I'm not sure what we're doing here.

	▲	westurner 3 days ago \| parent [-]
		I have a list of questions for ai or an expert, IDK what