> When you’re looking at a pre-training dataset in the frontier lab and you look at a random internet document, it’s total garbage. I don't even know how this works at all. It’s [stuff] like stock tickers, symbols, it's a huge amount of slop and garbage from like all the corners of the internet

Seems like there would be low hanging fruit in heavier pre processing then? Something deterministic like a reading level score. Or even a tiny model trained for the task to pick out good data?

▲

qrios 3 days ago | parent | next [-]

"low hanging" is relative. At least from my perspective. A significant part of my work involves cleaning up structured and unstructured data.

An example: More than ten years ago a friend of mine was fascinated by the german edition of the book "A Cultural History of Physics" by Károly Simonyi. He scanned the book (600+ pages) and created a PDF (nearly) same layout.

Against my advice he used Adobe tools for it instead of creating an epub or something like DocBook.

The PDF looks great, but the text inside is impossible to use as training data for a small LLM. The lines from the two columns are mixed and a lot of spaces are randomly placed (makes it particularly difficult because mathematical formulas often appear in the text itself).

After many attempts (with RegEx and LLMs), I gave up and rendered each page and had a large LLM extract the text.

▲

azath92 3 days ago | parent | prev | next [-]

For small models this is for sure the way forward, there are some great small datasets out there (check out the tiny stories dataset that limits vocab to a certain age but keeps core reasoning inherent in even simple language https://huggingface.co/datasets/roneneldan/TinyStories https://arxiv.org/abs/2305.07759)

I have less concrete examples but my understanding is that dataset curation is for sure the way many improvements are gained at any model size. Unless you are building a frontier model, you can use a better model to help curate or generate that dataset for sure. TinyStories was generated with GPT-4 for example.

	▲	gpjt 3 days ago \| parent [-]
		OP here: one thing that surprised me in this experiment was that the model trained on the more curated FineWeb-Edu dataset was worse than the one trained on FineWeb. That is very counterintuitive to me.

▲

embedding-shape 3 days ago | parent | prev | next [-]

Makes me wonder what kind of model we could get if we just trained on Wikidata and similar datasets, but pre-processed to be natural language rather than just triplets of data.

▲

haolez 3 days ago | parent | prev | next [-]

If you can create this filtering model, you have created Skynet and solved AGI :D

▲

ACCount37 3 days ago | parent | prev [-]

Data filtering. Dataset curation. Curriculum learning. All already in use.

It's not sexy, it's not a breakthrough, but it does help.

▲

Havoc 3 days ago | parent | next [-]

> All already in use.

At the big labs that makes sense. Bit more puzzled by why it isn’t used in the toy projects. Certainly more complexity but seems like it would make a big difference

▲

famouswaffles 2 days ago | parent | prev [-]

Curriculum learning is not really a thing for these large SOTA LLM training runs (specifically pre-training). We know it would help, but ordering trillions of tokens of data in this way would be a herculean task.

	▲	ACCount37 2 days ago \| parent [-]
		I've heard things about pre-training optimization. "Soft start" and such. So I struggle to believe that curriculum learning is not a thing on any frontier runs. Sure, it's a lot of data to sift through, and the time and cost to do so can be substantial. But if you are already planning on funneling all of that through a 1T LLM? You might as well pass the fragments through a small classifier before you do that.