There's also this seven-week-old example [0] (linked in the essay) of ChatGPT very confidently recommending a asinine course of action because it was unable to understand what the hell it was being told.

Listening to the audio is not required, as there's a reasonably accurate on-screen transcript, but it is valuable to listen to just how very hard they've worked to make this tool sound both confident and capable, even in situations where it's soul-crushingly incorrect. Those of us who have worked in Blasted Corporate Hellscapes may recognize how this manner of speaking can be very, very compelling to a certain sort of person (who -as it turns out- is frequently found in a management position).

[0] <https://www.instagram.com/reel/DUylL79kvub/>

▲ simianwords 2 hours ago | parent [-]

This is classic case of not using the proper version. Use the thinking version gpt5.4 (text) and tell me if it bullshits.

Surely you must be able to find at least one example no?

▲ simoncion an hour ago | parent [-]

To be clear, is your assertion that apyhr was also not using the proper version? If that is your assertion, do tell me how you've come by that information.

(You did notice that the author of the essay and the author of the video I linked to are not the same person, and that neither of them share a nym with me, yes?)

▲ simianwords an hour ago | parent [-]

Hi, my position on the issue is that LLMs are powerful but may make mistakes in long context problems like coding (which the harness solves by feedback). But makes close to no (undergrad level) mistakes in questions that fit 2-3 pages. For you personally: do you believe me on this specific part on 2-3 pages?

I don't know what aphyr did and tbh his whole screed on LLMs make me feel he didn't use it properly or at least coming from a bad faith angle.

That's why I'm asking you (and others). Please come up with a text prompt spanning < 4 pages and lets see if it bullshits.

Surely the implication of such a screed is that it should be super simple to find at least one example of it clearly bullshitting in my constraint, no? Or am I interpreting the post in a bad faith way?

▲ simoncion an hour ago | parent [-]

Neat.

So, despite the fact that it looks like you have to pay for ChatGPT Voice mode with video, [0] it doesn't count as an

  example of it bullshitting on ChatGPT (paid version)

That is, father_phi's use of what seems to be a paid version of ChatGPT to have a bullshit-filled conversation that definitely spans less than four pages doesn't count?

[0] The page at [1] declares that the video feature is "Available in ChatGPT Plus, Pro, Business, Enterprise, and Edu on mobile"

[1] <https://chatgpt.com/features/voice-with-video/>

▲ simianwords an hour ago | parent [-]

Lets stick to my challenge please - thinking version, find bullshit. If you can't, thats ok. Do you accept then under the constraints that the thinking version doesn't produce bullshit?

▲ simoncion 39 minutes ago | parent [-]

Given aphyr's vocation (and how very lucrative it is), and how years and years of his writing indicates that he's very devoted to getting a correct and complete answer when investigating a question, I find it hard to believe that he's not using a paid version of the LLMs. If I knew him, I'd ask and verify, but I don't, so I won't.

> Lets stick to my challenge please...

I did. Your challenge was literally:

  If it bullshits so much, you wouldn't have a problem giving me an example of it bullshitting on ChatGPT (paid version)? Lets take any example of a text prompt fitting a few pages - it may be a question in science or math or any domain. Can you get it to bullshit?

father_phi's two-sentence question about the whether one can use a cup that's closed at the top and open at the bottom definitely counts. Given what I've mentioned about apyhr above, I expect he has already run your challenge on the fanciest-available version and reported on the results in the essay under discussion.

▲

simianwords 35 minutes ago | parent [-]

> Use the thinking version gpt5.4 (text) and tell me if it bullshits

This was what I said. Text! Despite me specifically asking for text, you've shown a voice example. Not sure why?

I believe you and I agree that GPT 5.4 thinking on text that fits < 4 pages never bullshits? Then we are good!

If we agree on this, I think the post doesn't capture this in spirit.

▲

simoncion 30 minutes ago | parent [-]

> This was what I said. Text!

No, that's what you said after I provided an example of paid ChatGPT emitting complete bullshit from a two sentence prompt.

The challenge you issued is at [0].

[0] <https://news.ycombinator.com/item?id=47692592>

▲

simianwords 27 minutes ago | parent [-]

> If it bullshits so much, you wouldn't have a problem giving me an example of it bullshitting on ChatGPT (paid version)? Lets take any example of a text prompt fitting a few pages - it may be a question in science or math or any domain. Can you get it to bullshit?

I have clearly written text prompt here. And I repeated a few times. It’s not my fault you didn’t read it. You are coming across as a bit of a bad faith arguer.

In any case, you agree that under these constraints bullshitting doesn’t exist?

▲

simoncion 21 minutes ago | parent [-]

> I have clearly written text prompt here.

How do you think the "voice" interface works? It runs speech-to-text on the input and turns the input into text. The LLMs don't decode voice, they work on text.

You can see this process in action on many of father_phi's videos.

Regardless, I expect that aphyr's reported results are on the very latest publicly-available ChatGPT models.

▲

simianwords 17 minutes ago | parent [-]

Very bad faith arguments. I clearly said text and you disregarded it multiple times and you are still arguing.

You've still not given me a single example of it bullshitting 5.4 thinking in text. It shows a lot that you have ignored this multiple times. Unfortunate!

▲

simoncion 6 minutes ago | parent [-]

I'm not sure why you're ignoring aphyr's reports. I'm also unsure why you're ignoring my original statement that having the text of the conversation that lead ChatGPT to bullshit is entirely irrelevant, as being unable to repro the report is even worse for ChatGPT than being able to repro would be.

shrug

	▲	simianwords 3 minutes ago \| parent [-]
		I specified text just to ignore the voice one because it uses 4o-mini underneath. And its kinda stupid to keep ignoring that and saving face now - reconsider this approach. I believe this is the 5th time I'm asking this: you are not able to produce a _single_ counter example for my challenge? After all this surely I can get a direct acknowledgement here.