Remix.run Logo
jug 16 hours ago

This is my feeling too, across the board. Nowadays, benchmark wins seem to come from tuning, but then causing losses in other areas. o3, o4-mini also hallucinates more than o1 in SimpleQA, PersonQA. Synthetic data seems to cause higher hallucination rates. Reasoning models at even higher risk due to hallucinations risking to throw the model off track at each reasoning step.

LLM’s in a generic use sense are done since already earlier this year. OpenAI discovered this when they had to cancel GPT-5 and later released the ”too costly for gains” GPT-4.5 that will be sunset soon.

I’m not sure the stock market has factored all this in yet. There needs to be a breakthrough to get us past this place.