Remix.run Logo
dlenski 6 hours ago

A nice illustration of the homogeneity of LLM responses. Another way to describe this effect would be…

If you ask humans to write 1,000 books, you're asking 1,000 different humans with different experiences and different skills and different moods (etc.) to write those books.

But if you ask LLMs to write 1,000 books, you're probably only talking to 3 or 5 different models, tops. And they've all trained on the same or similar data, and are trained to respond in very similar ways.

The LLMs don't differ much in anything like "life experience" or "skills", and they don't really have anything like a "mood" independent of the prompts you've given them.

TrackerFF 2 hours ago | parent | next [-]

LLMs are great at producing average.

We see this with their GenAI music equivalents. All the music these GenAI models produce is exceptionally (aggressively, even) average.

It is the most polished average you'll ever find. Never awful (anymore), never fantastic. Just bang in the middle.

amelius 3 hours ago | parent | prev | next [-]

I don't think the comparison to humans works. It is as if you expect that we can easily train many different LLMs to solve the originality problem, but that is far from guaranteed.

Lerc 2 hours ago | parent | prev | next [-]

I wonder how much variation there would be if you got a single model to produce a couple of gigabytes of tiny children's stories.

Might be an interedting research project.

pxagntuvzt an hour ago | parent [-]

There is one already: https://arxiv.org/abs/2305.07759 https://huggingface.co/datasets/roneneldan/TinyStories

6.5GB of tiny stories, as requested. ;)

smusamashah 5 hours ago | parent | prev | next [-]

Reminds of Pluribus.

bigbangcmbr 4 hours ago | parent [-]

Pluribus is kinda different. An LLM cannot wander too far from the average. Even if it wanted too. In pluribus, the 'others' work toward a common goal, each utilizing their own expertise, knowledge and experiences in a shared way to achieve a common goal. Each is unique. They can, if they want, perform as the host's individual before the the joining. To put it other way, the other in pluribus are convergent by choice, llms are convergent by design.

ekianjo 4 hours ago | parent | prev | next [-]

prompts will give very different results. this is where you do the work.

cryo32 4 hours ago | parent | next [-]

I disagree. The LLM outputs really do lack anything original or interesting. They just produce banal copy whatever you ask them.

A good editor could probably reduce all LLM outputs on a subject down to the same point.

Mikhail_Edoshin 2 hours ago | parent | prev | next [-]

A controller has to be at least as complex as what it is supposed to control.

roncesvalles 4 hours ago | parent | prev | next [-]

Yes but not very different results (unless you're adding new information to your prompt or reducing some ambiguity). Prompt engineering is mostly pseudoscience.

zarzavat 4 hours ago | parent [-]

What we need is steering so that we can have models with different personalities, not just different prompts (because context is subject to forgetting), but this will never happen with closed-weight models, I'm not sure if it's even feasible at scale.

Yet another reason why the future is open weight.

hansmayer 3 hours ago | parent | prev [-]

[dead]

fragmede 5 hours ago | parent | prev | next [-]

that discounts, how much the other context, ie, the system, prompt, and any sort of other context submitted to the model that can affect the output. If you ask a model as a patient for medical advice versus as a doctor, you will get different output from the same model.

throw310822 5 hours ago | parent | prev | next [-]

> you're asking 1,000 different humans with different experiences and different skills and different moods

Simply, if you ask an LLM, you're asking always to the same mind, and always for the first time.

scotty79 5 hours ago | parent [-]

Also since those are lazy, you are also asking always in the same manner. How homogeneous were the prompts that generated those covers?

People are making cookies with cookie cutter number 5 and other people wonder how come they are all the same.

gmerc 5 hours ago | parent [-]

Classic self selection effect though - if you’re resorting to LLM writing you’re almost certainly skewing lazy enough to not even bother trying to add perturbations strong enough to make the response deviate from the uniformity of the slop.

NitpickLawyer 4 hours ago | parent | prev [-]

> A nice illustration of the homogeneity of LLM responses. [...] And they've all trained on the same or similar data, and are trained to respond in very similar ways.

I mostly agree, but this is a very simplified explanation. The models are indeed trained to respond in similar ways, for "basic" prompts. And that's as much a feature as it is a bug. In other words, the bug becomes apparent only if you give 100+ basic prompts. But giving it 100+ basic prompts and expecting originality is a silly endeavour. That's not how you get originality.

The way I'd go about to generate 1000 books, while expecting different outcomes is something along these lines (and nowadays you can ask your favorite LLM to wire up this workflow for you, with decent outcomes):

1. Ask for a list of 20 features that define a book (genre, style, number of characters, tropes, plot, continuity, relationships, etc.)

2. For each feature, ask for a list of 50 examples, ordered from most common to the most unique.

3. Randomly pick 10 features, and for each pick one of the 50 generated items. Ask for the rest of the features to match the theme.

4. Ask for 10 possible book outlines that match the chosen features, randomly pick between 2-8.

5. Create a detailed prompt that includes all the above features, and ask for a synopsis for each chapter, given the above outline chosen.

6. Given {features} and {outline} and {synopsis} write chapter 1.

7. for each chapter in list, given {...} and (optional) previous matching chapter(s), write chapter n+1

(optional 8.) given {...} and 2-3 consecutive chapters, align the ending / beginning of a new chapter for style / features / continuity, etc.

(optional 9.) given {...} and the whole book, list chapters / paragraphs that don't match the given {...} and provide a list of 5 improvements. (randomly choose 1 and ask for an edit).

----

Now, this probably won't give you something like cloud atlas, but they'll at least be different books. That's how I'd do it if I wanted to see how different they can write. Not 1000 "basic" prompts and expecting originality.

noduerme 3 hours ago | parent [-]

That whole thing would get you 1000 variants of existing art. But if you asked a thousand different designers to do a cover for the same book...

NitpickLawyer 3 hours ago | parent [-]

> 1000 variants of existing art.

This is very naive. I can almost guarantee that some combinations of 20 * 50 features will hit on something that has never been written before in that specific combination. And if that's still not enough, increase the number of features. Add more randomness, add more steering, add random steering in random chapters, change it up, and so on.

noduerme 3 hours ago | parent | next [-]

I'm an art director. Finding a sequence that hasn't been hit in that specific combination is not sufficient to justify paying someone $150 an hour to go be creative.

spwa4 3 hours ago | parent | prev [-]

> Add more randomness, add more steering, add random steering in random chapters, change it up, and so on.

That doesn't work for AI models. The whole training process depends on the basic principle that if you take the average of 100, in this case book cover designs, that the average is less like randomness than any individual cover you've used to make your average.

So the output will, by necessity, be closer to the average.

The human learning algorithm is much, much more data efficient than models. A absolute top human expert will have read/seen/heard/talked/... about 160 million "tokens" (that's about 2000 books). Frankly, the nerve inputs of all experiences of an entire human life, from baby to rewriting relativity theory, are only a couple dozen gigabytes.

Qwen 3.6 27B has been trained (as in seen ~10 to ~50 times) 8 trillion tokens, or to put it another way: for every second you will have spent "gathering life experiences" (ie. your whole life) on your deathbed Qwen 3.6 27B has spend about 50.000 seconds learning. And really that figure should be multiplied by the 10 or 50 training iterations.

Add another 3 or so orders of magnitude and you've got ChatGPT. By this measure, the human brains outperforms ridiculously overspecced ML models (because that's what ChatGPT and the like are) in efficiency a factor of by 5 million or more. This is the reason humans are still faster than ML models.

As for human training iterations: we can be simple: it's 1. In fact, it's impossible to make it even 2. Of course, when it comes to human performance: we are a better but not fundamentally different version of genetic algorithms. Do most humans perform? The honest answer is no. 1 in 1000, and that's very generous, improves SOTA. You absolutely need the 1000 failures though, as anyone whose tried a PhD (or even just design a large program) knows.

So we are very far away from allowing AI models to do what humans can do: take one example and produce, from one example, a better output. And there will always be much more variation in that approach. But ... most human attempts to do something are total crap. Most AI attempts to do something will succeed, but they'll be comparatively be bland, tasteless, "without soul", ...

And this is ignoring the problem that AI also has a massive limitation (that can't be solved, no matter how many nvidia cards you have) in that it trains against historical data. And counterfactuals don't work. What would have happened had Shakespeare decided Macbeth's wife was a force for good? Would the king still get murdered? Would it still be a great story? You can't work with counterfactuals.

NitpickLawyer 2 hours ago | parent [-]

> That doesn't work for AI models.

Of course it does. I know it does because I've been using variations of this workflow since gpt3.0. In fact it's the only way it can work, since by design LLMs work from left to right. You can't expect it to produce original stuff if you don't give it the anchors for what original means. It'd be like going to a new bar every night and asking for a "beer that you haven't had before". There's no information to work on there.

spwa4 an hour ago | parent [-]

The point was to take a random combination of story elements. Pick one each {King,dad,CEO} {betrays,kills,loves} {his enemy,the king,a foreign prime minister} and feed to an LLM.

The output will not be an intricate well designed epic storyline, but a cookie-cutter boring snoozefest.

BUT you can give that to a bunch of humans, who "insert their life experience" (ie. parts of their training data, translated to LLM terms) and sometimes out comes Game of Thrones, Star Wars, ...