Remix.run Logo
faizshah 5 days ago

o3 was also an anomaly in terms of speed vs response quality and price vs performance. It used to be one of the fastest ways to do some basic web searches you would have done to get an answer if you used o3 pro you it would take 5x longer for not much better response.

So far I haven’t been impressed with GPT5 thinking but I can’t concretely say why yet. I am thinking of comparing the same prompt side by side between o3 and GPT5 thinking.

Also just from my first few hours with GPT5 Thinking I feel that it’s not as good at short prompts as o3 e.g instead of using a big xml or json prompt I would just type the shortest possible phrase for the task e.g “best gpu for home LLM inference vs cloud api.”

jjani 5 days ago | parent | next [-]

My chats so far have been similar to yours, across the board worse than o3, never better. I've had cases where it completely misinterpreted what I was asking for, a very strange experience which I'd never had with the other frontier models (o3, Sonnet, Gemini Pro). Those would of course get things wrong, make mistakes, but never completely misunderstand what I'm asking. I tried the same prompt on Sonnet and Gemini and both understood correctly.

It was related to software architecture, so supposedly something it should be good at. But for some reason it interpreted me as asking from an end-user perspective instead of a developer of the service, even though it was plenty clear to any human - and other models - that I meant the latter.

faizshah 5 days ago | parent | next [-]

> I've had cases where it completely misinterpreted what I was asking for, a very strange experience which I'd never had with the other frontier models (o3, Sonnet, Gemini Pro).

Yes! This exactly, with o3 you could ask your question imprecisely or word it badly/ambiguously and it would figure out what you meant, with GPT5 I have had several cases just in the last few hours where it misunderstands the question and requires refinement.

> It was related to software architecture, so supposedly something it should be good at. But for some reason it interpreted me as asking from an end-user perspective instead of a developer of the service, even though it was plenty clear to any human - and other models - that I meant the latter.

For me I was using o3 in daily life like yesterday we were playing a board game so I wanted to ask GPT5 Thinking to clarify a rule, I used the ambiguous prompt with a picture of a card’s draw 1 card power and asked “Is this from the deck or both?” (From the deck or from the board). It responded by saying the card I took a picture of was from the game wingspan’s deck instead of clarifying the actual power on the card (o3 would never).

I’m not looking forward to how much time this will waste on my weekend coding projects this weekend.

jjani 5 days ago | parent [-]

It appears to be overtuned on extremy strict instruction following, interpreting things in a very unhuman way, which may be a benefit to agentic tasks at the costs of everything else.

My limited API testing with gpt-5 also showed this. As an example, the instruction "don't use academic language" caused it to basically omit half of what it output without that instruction. The other frontier models, and even open source Chinese ones like Kimi and Deepseek, understand perfectly fine what we mean by it.

int_19h 5 days ago | parent [-]

It's not great at agentic tasks either. Not the least because it seems very timid about doing things on its own, and demands (not asks - demands) that user confirm every tiny step.

SomewhatLikely 5 days ago | parent | prev [-]

The default outputs are considerably shorter even in thinking mode. Something that helped me get the thinking mode back to an acceptable state was to switch to the Nerd personality and in the traits customization setting tell it to be complete and add extra relevant details. With those additions it compares favorably to o3 on my recent chat history and even improved some cases. I prefer to scan a longer output than have the LLM guess what to omit. But I know many people have complained about verbosity so I can understand why they may have moved to less verbiage.

energy123 5 days ago | parent | prev [-]

Through chat subscription, reasoning effort for gpt-5 is probably set to "low" or "medium" and verbosity is probably "medium".