Remix.run Logo
cj 2 days ago

Not sure what your system prompt is, but asking the exact same prompt word for word for me results in a response talking about "Zealandia, a continent that is 93% submerged underwater."

The 2nd example isn't all that impressive since you're asking it to provide you something very specific. It succeeded in not hallucinating. It didn't succeed at saying "I'm not sure" in the face of ambiguity.

I want the LLM to respond more like a librarian: When they know something for sure, they tell you definitively, otherwise they say "I'm not entirely sure, but I can point you to where you need to look to get the information you need."

simonw 2 days ago | parent [-]

I'm using regular GPT-5, no custom instructions and memory turned off.

Can you link to your shared Zealandia result?

I think that mural result was spectacularly impressive, given that it started with a photo I took of the mural with almost no additional context.

cj 2 days ago | parent [-]

I can't link since it's in an enterprise account.

Interestingly I tried the same question in a separate ChatGPT account and it gave a similar response you got. Maybe it was pulling context from the (separate) chat thread where it was talking about Zealandia. Which raises another question: once it gets something wrong once, will it just keep reenforcing the inaccuracy in future chats? That could lead to some very suboptimal behavior.

Getting back on topic, I strongly dislike the argument that this is all "user error". These models are on track to be worth a trillion dollars at some point in the future. Let's raise our expectations of them. Fix the models, not the users.

simonw 2 days ago | parent [-]

I wonder if you're stuck on an older model like GPT-4o?

EDIT: I think that's likely what is happening here: I tried the prompt against GPT-4o and got this https://chatgpt.com/share/68b8683b-09b0-8006-8f66-a316bfebda...

My consistent position on this stuff is that it's actually way harder to use than most people (and the companies marketing it) let on.

I'm not sure if it's getting easier to use over time either. The models are getting "better" but that partly means their error cases are harder to reason about, especially as they become less common.