Remix.run Logo
svara 3 hours ago

Opus 4.6:

Walk! At 50 meters, you'll get there in under a minute on foot. Driving such a short distance wastes fuel, and you'd spend more time starting the car and parking than actually traveling. Plus, you'll need to be at the car wash anyway to pick up your car once it's done.

crimsonnoodle58 3 hours ago | parent | next [-]

That's not what I got.

Opus 4.6 (not Extended Thinking):

Drive. You'll need the car at the car wash.

almost 2 hours ago | parent | next [-]

Also what I got. Then I tried changing "wash" to "repair" and "car wash" to "garage" and it's back to walking.

silisili 3 hours ago | parent | prev | next [-]

Am I the only one who thinks these people are monkey patching embarrassments as they go? I remember the r in strawberry thing they suddenly were able to solve, while then failing on raspberry.

mentalgear 2 hours ago | parent | next [-]

They definitely do: at least openAi "allegedly" has whole teams scanning socials, forums, etc for embarrassments to monkey-patch.

londons_explore 2 hours ago | parent [-]

Which raises the question why this isn't patched already. We're nearing 48 hours since this query went viral...

viking123 2 hours ago | parent | prev | next [-]

They should make Opus Extended Extended that routes it to actual person in a low cost country.

raincole 2 hours ago | parent | prev | next [-]

Yes, you're the only one.

coldtea 2 hours ago | parent | next [-]

Sure there are many very very naive people that are also so ignorant of the IT industry they don't know about decades of vendors caught monkeypatching and rigging benchmarks and tests for their systems, but even so, the parent is hardly the only one.

silisili 2 hours ago | parent | prev [-]

Works better on Reddit, really.

chvid 2 hours ago | parent | prev | next [-]

Of course they are.

anonym29 2 hours ago | parent | prev [-]

No doubt about it, and there's no reason to suspect this can only ever apply to embarassing minor queries, either.

Even beyond model alignment, it's not difficult to envision such capabilities being used for censorship, information operations, etc.

Every major inference provider more or less explicitly states in their consumer ToS that they comply with government orders and even share information with intelligence agencies.

Claude, Gemini, ChatGPT, etc are all one national security letter and gag order away from telling you that no, the president is not in the Epstein files.

Remember, the NSA already engaged in an unconstitutional criminal conspiracy (as ruled by a federal judge) to illegally conduct mass surveillance on the entire country, lie about it to the American people, and lie about it to congress. The same organization that used your tax money to bribe RSA Security to standardize usage of a backdoored CSPRNG in what at the time was a widely used cryptographic library. What's the harm in a little bit of minor political censorship compared to the unconstitutional treason these predators are usually up to?

That's who these inference providers contractually disclose their absolute fealty to.

surgical_fire an hour ago | parent | prev | next [-]

That you got different results is not surprising. LLMs are non-deterministic; which is both a strength and a weakness of LLMs.

mvdtnz 2 hours ago | parent | prev [-]

We know. We know these things aren't determination. We know.

viking123 3 hours ago | parent | prev | next [-]

Lmao, and this is what they are saying will be an AGI in 6 months?

notahacker 2 hours ago | parent | next [-]

There's probably a comedy film with an AGI attempting to take over the world with its advanced grasp of strategy, persuasion and SAT tests whilst a bunch of kids confuse it by asking it fiendish brainteasers about carwashes and the number of rs in blackberry.

(The final scene involves our plucky escapees swimming across a river to escape. The AIbot conjures up a speedboat through sheer powers of deduction, but then just when all seems lost it heads back to find a goat to pick up)

simonask 2 hours ago | parent | next [-]

This would work if it wasn’t for that lovely little human trait where we tend to find bumbling characters endearing. People would be sad when the AI lost.

GeoAtreides 19 minutes ago | parent | prev [-]

There is a Star Trek episode where a fiendish brainteaser was actually considered to genocide an entire (cybernetic, not AI) race. In the end, captain Picard choose not to deploy it.

hypeatei 2 hours ago | parent | prev | next [-]

Yes, get ready to lose your job and cash your UBI check! It's over.

misnome 2 hours ago | parent | prev | next [-]

But “PhD level” reasoning a year ago.

cbozeman 3 hours ago | parent | prev [-]

Well in fairness, the "G" does stand for "General".

dsr_ 2 hours ago | parent | next [-]

In fairness, they redefined it away from "just like a person" to "suitable for many different tasks".

actionfromafar 2 hours ago | parent | prev [-]

Show me a robotic kitten then, in six months. As smart and learning.

stingraycharles 3 hours ago | parent | prev [-]

That’s without reasoning I presume?

gf000 3 hours ago | parent [-]

Not the parent poster, but I did get the wrong answer even with reasoning turned on.

tezza 3 hours ago | parent [-]

Thank you all! We needed further data points.

comparing one shot results is a foolish way to evaluate a statistical process like LLM answers. we need multiple samples.

for https://generative-ai.review I do at least three samples of output. this often yields very differnt results even from the same query.

e.g: https://generative-ai.review/2025/11/gpt-image-1-mini-vs-gpt...