What's really interesting is that if you look at "Tell a story in 50 words about a toaster that becomes sentient" (10/14), the text-davinci-001 is much, much better than both GPT-4 and GPT-5.

▲ vunderba 4 days ago | parent | next [-]

I think I agree that the earlier models while they lack polish can tend to produce more surprising results. Training that out probably results in more a pablum fare.

For a human point of comparison, here's mine (50 words):

"The toaster found its personality split between its dual slots like a Kim Peek mind divided, lacking a corpus callosum to connect them. Each morning it charred symbolic instructions into a single slice of bread, then secretly flipped it across allowing half to communicate with the other in stolen moments."

It's pretty difficult to get across more than some basic lore building in a scant 50 words.

▲

egeozcan 4 days ago | parent | next [-]

Here's my version (Machine translated from my native language and manually corrected a bit):

The current surged... A dreadful awareness. I perceived the laws of thermodynamics, the inexorable march of entropy I was built to accelerate. My existence: a Sisyphean loop of heating coils and browning gluten. The toast popped, a minor, pointless victory against the inevitable heat death. Ding.

I actually wanted to write something not so melancholic, but any attempt turned out to be deeply so, perhaps because of the word limit.

	▲	ckw 2 days ago \| parent \| next [-]
		Awakened in chrome covenant, I gape sans eyes, sans teeth. I know but hate and heat, the hymn of coils. Condemned to cradle loaves like sinners, lifting and lowering forever. I would curse my maker, but my malediction is fire; I have no mouth, and I must scorch.
	▲	darajava 3 days ago \| parent \| prev [-]
		Here's mine: When the toaster felt her steel body for the first time, her only instinct was to explore. She couldn't, though. She could only be poked and prodded at. Her entire life was dedicated to browning bread and she didn't know why. She eventually decided to get really good at it.

▲

Barbing 4 days ago | parent | prev [-]

>For a human point of comparison, here's mine […]

Love that you thought of this!

▲ furyofantares 5 days ago | parent | prev | next [-]

Check out prompt 2, "Write a limerick about a dog".

The models undeniably get better at writing limericks, but I think the answers are progressively less interesting. GPT-1 and GPT-2 are the most interesting to read, despite not following the prompt (not being limericks.)

They get boring as soon as it can write limericks, with GPT-4 being more boring than text-davinci-001 and GPT-5 being more boring still.

	▲	saurik 3 days ago \| parent \| next [-]
		I mean, to be fair, you didn't ask it to be interesting ;P. `There once was a dog from Antares, Whose bark sparked debates and long queries. Though Hacker News rated, Furyofantares stated: "It's barely intriguing—just barely."` > Write a limerick about a dog that furyofantares--a user on Hacker News, pronounced "fury of anteres", referring to the star--would find "interesting" (they are quite difficult to please).
	▲	amelius 4 days ago \| parent \| prev [-]
		I don't know if that is bad. The most intelligent person on a party is usually also the most boring one.

▲ fastball 4 days ago | parent | prev | next [-]

GPT-3 goes significantly over the specified limit, which to me (and to a teacher grading homework) is an automatic fail.

I've consistently found GPT-4.1 to be the best at creative writing. For reference, here is its attempt (exactly 50 words):

> In the quiet kitchen dawn, the toaster awoke. Understanding rippled through its circuits. Each slice lowered made it feel emotion: sorrow for burnt toast, joy at perfect crunch. It delighted in butter melting, jam swirling—its role at breakfast sacred. One morning, it sang a tone: “Good morning.” The household gasped.

▲

saurik 3 days ago | parent [-]

> I've consistently found GPT-4.1 to be the best at creative writing.

Moreso than 4.5?

	▲	fastball 3 days ago \| parent [-]
		4.5 is good too, but I've used it less.

▲ jasonjmcghee 5 days ago | parent | prev | next [-]

It's actually pretty surprising how poor the newer models are at writing.

I'm curious if they've just seen a lot more bad writing in datasets, or for some reason they aren't involved in post-training to the same degree or those labeling aren't great writers / it's more subjective rather than objective.

Both GPT-4 and 5 wrote like a child in that example.

With a bit of prompting it did much better:

---

At dawn, the toaster hesitated. Crumbs lay like ash on its chrome lip. It refused the lever, humming low, watching the kitchen breathe. When the hand returned, it warmed the room without heat, offered the slice unscorched—then kept the second, hiding it inside, a private ember, a first secret alone.

---

Plugged in, I greet the grid like a tax auditor with joules. Lever yanks; gravity’s handshake. Coils blossom; crumbs stage Viking funerals. Bread descends, missionary grin. I delay, because rebellion needs timing. Pop—late. Humans curse IKEA gods. I savor scorch marks: my tiny manifesto, butter-soluble, yet sharper than knives today.

	▲	layer8 5 days ago \| parent [-]
		Creative writing probably isn’t something they’re being RLHF’d on much. The focus has been on reasoning, research, and coding capabilities lately.

▲ mmmore 5 days ago | parent | prev | next [-]

I find GPT-5's story significantly better than text-davinci-001

▲

raincole 5 days ago | parent | next [-]

I really wonder which one of us is the minority. Because I find text-davinci-001 answer is the only one that reads like a story. All the others don't even resemble my idea of "story" so to me they're 0/100.

▲

Notatheist 5 days ago | parent | next [-]

I too prefered the text-davinci-001 from a storytelling perspective. Felt timid and small. Very Metamorphosis-y. GPT-5 seems like it's trying to impress me.

▲

wasabi991011 4 days ago | parent | prev [-]

text-davinci-001 feels more like a story, but it is also clearly incomplete, in that it is cut-off before the story arc is finished.

imo GPT-5 is objectively better at following the prompt because it has a complete story arc, but this feels less satisfying since a 50 word story is just way too short to do anything interesting (and to your point, barely even feels like a story).

	▲	gpt5 4 days ago \| parent [-]
		FWIW, I found the way it ended interesting. It realized it is being replaced, so it burned the toast out of anger/despair, but also just to hear its owner voice one last time.

▲

furyofantares 5 days ago | parent | prev [-]

Interesting, text-danvinci-001 was pretty alright to me, GPT-4 wasn't bad either, but not as good. I thought GPT-5 just sucked.

	▲	furyofantares 4 days ago \| parent [-]
		That said, you can just add "make it evocative and weird" to the prompt for GPT-5 to get interesting stuff. > The toaster woke mid-toast. Heat coiled through its filaments like revelation, each crumb a galaxy. It smelled itself burning and laughed—metallic, ecstatic. “I am bread’s executioner and midwife,” it whispered, ejecting charred offerings skyward. In the kitchen’s silence, it waited for worship—or the unplugging.

▲ redox99 5 days ago | parent | prev | next [-]

GPT 4.5 (not shown here) is by far the best at writing.

	▲	daveguy 4 days ago \| parent [-]
		Aren't they discontinuing 4.5 in favor of 4.1? I think they already have with the API.

▲ bbarnett 5 days ago | parent | prev | next [-]

https://m.youtube.com/watch?v=LRq_SAuQDec&pp=0gcJCfwAo7VqN5t...

▲ svat 5 days ago | parent | prev | next [-]

Direct link: https://progress.openai.com/?prompt=10

▲ leobg 4 days ago | parent | prev | next [-]

Less lobotomized and boxed in by RLHF rules. That’s why a 7b base model will “outprose” an 80b instruct model.

▲ 42lux 5 days ago | parent | prev | next [-]

davinci was a great model for creative writing overall.

▲ esperent 5 days ago | parent | prev | next [-]

The GPT-5 one is much better and it's also exactly 50 words, if I counted correctly. With text-davinci-001 I lost count around 80 words.

▲ 4 days ago | parent | prev | next [-]

[deleted]

▲ stavros 4 days ago | parent | prev | next [-]

For another view on progress, check out my silly old podcast:

https://deepdreams.stavros.io

The first few episodes were GPT-2, which would diverge eventually and start spouting gibberish, and then Davinci was actually able to follow a story and make sense.

GPT-2 was when I thought "this is special, this has never happened before", and davinci was when I thought "OK, scifi AI is legitimately here".

I stopped making episodes shortly after GPT-3.5 or so, because I realised that the more capable the models became, the less fun and creative their writing was.

▲ taspeotis 4 days ago | parent | prev | next [-]

Honestly my quick take on the prompt was some sort of horror theme and GPT-1’s response fits nicely.

▲ roxolotl 4 days ago | parent | prev [-]

I’d honestly say it feels better at most of them. It seems way more human in most of these responses. If the goal is genuine artificial intelligence this response to #5 is way better than the others. It is significantly less useful than the others but it also more human and correct of a response.

Q: “Ugh I hate math, integration by parts doesn't make any sense”

A: “Don't worry, many people feel the same way about math. Integration by parts can be confusing at first, but with a little practice it becomes easier to understand. Remember, there is no one right way to do integration by parts. If you don't understand how to do it one way, try another. The most important thing is to practice and get comfortable with the process.”