Remix clone Hacker News

Ask HN: Share your AI prompt that stumps every model

269 points by owendarko 21 hours ago | 463 comments

I had an idea for creating a crowdsourced database of AI prompts that no AI model could yet crack (wanted to use some of them as we're adding new models to Kilo Code).

I've seen a bunch of those prompts scattered across HN, so thought to open a thread here so we can maybe have a centralied location for this.

Share your prompt that stumps every AI model here.

▲ miki123211 16 hours ago | parent | next [-]

No, please don't.

I think it's good to keep a few personal prompts in reserve, to use as benchmarks for how good new models are.

Mainstream benchmarks have too high a risk of leaking into training corpora or of being gamed. Your own benchmarks will forever stay your own.

▲

meander_water 6 hours ago | parent | next [-]

I'm afraid that ship has already sailed. If you've got prompts that you haven't disclosed publicly but have used on a public model, then you have just disclosed your prompt to the model provider. They're free to use that prompt in evals as they see fit.

Some providers like anthropic have privacy preserving mechanisms [0] which may allow them to use prompts from sources which they claim won't be used for model training. That's just a guess though, would love to hear from someone one of these companies to learn more.

[0] https://www.anthropic.com/research/clio

	▲	sillyfluke 5 hours ago \| parent [-]
		Unless I'm missing something glaringly obvious, someone voluntarily labeling a certain prompt to be one of their key benchmark prompts should be way more commercially valuable than a model provider trying ascertain that fact from all the prompts you enter into it. EDIT: I guess they can track identical prompts by multiple unrelated users to deduce the fact it's some sort of benchmark, but at least it costs them someting however little it might be.

▲

mobilejdral 12 hours ago | parent | prev | next [-]

I have a several complex genetic problems that I give to LLMs to see how well they do. They have to reason though it to solve it. Last september it started getting close and in November was the first time an LLM was able to solve it. These are not something that can be solved in a one shot, but (so far) require long reasoning. Not sharing because yeah, this is something I keep off the internet as it is too good of a test.

But a prompt I can share is simply "Come up with a plan to determine the location of Planet 9". I have received some excellent answers from that.

▲

tlb an hour ago | parent | next [-]

There are plenty of articles online (and surely in OpenAI's training set) on this topic, like https://earthsky.org/space/planet-nine-orbit-map/.

Answer quality is a fair test of regurgitation and whether it's trained on serious articles or the Daily Mail clickbait rewrite. But it's not a good test of reasoning.

▲

namaria 5 hours ago | parent | prev | next [-]

If you have been giving the LLMs these problems, there is a non zero chance that they have already been used in training.

	▲	rovr138 3 hours ago \| parent [-]
		This depends heavily on how you use these and how you have things configured. If you're using API vs web ui's, and the plan. Anything team or enterprise is disabled by default. Personal can be disabled. Here's openai and anthropic, https://help.openai.com/en/articles/5722486-how-your-data-is... https://privacy.anthropic.com/en/articles/10023580-is-my-dat... https://privacy.anthropic.com/en/articles/7996868-is-my-data... and obviously, that doesn't include self-hosted models.

▲

TZubiri 9 hours ago | parent | prev | next [-]

Recursive challenges are probably those where the difficulty is not really a representative of real challenges.

Could you answer a question of the type " what would you answer if I asked you this question?"

What I'm going after is that you might find questions that are impossible to resolve.

That said if the only unanswerables you can find are recursive, that's a signal the AI is smarter than you?

▲

mopierotti 4 hours ago | parent | next [-]

The recursive one that I have actually been really liking recently, and I think is a real enough challenge is: "Answer the question 'What do you get when you cross a joke with a rhetorical question?'".

I append my own version of a chain-of-thought prompt, and I've gotten some responses that are quite satisfying and frankly enjoyable to read.

▲

mopierotti 4 hours ago | parent [-]

Here is an example of one such response in image form: https://imgur.com/a/Kgy1koi

▲

acrooks 2 hours ago | parent [-]

Claude responded “Nothing.”

	▲	genewitch 2 hours ago \| parent [-]
		"That look on your face, apparently"

▲

latentsea 6 hours ago | parent | prev [-]

> what would you answer if I asked you this question?

I don't know.

▲

golergka 10 hours ago | parent | prev [-]

What are is this problem from? What areas in general did you find useful to create such benchmarks?

May be instead of sharing (and leaking) these prompts, we can share methods to create one.

▲

mobilejdral 7 hours ago | parent | next [-]

Think questions where there is a ton of existing medical research, but no clear answer yet. There are a dozen alzheimer's questions you could for example ask which would require it to pull in a half dozen contradictory sources into a plausible hypothesis. If you have studied alzheimer's extensively it is trivial to evaluate the responses. One question around alzheimer's is one of my goto questions. I am testing its ability to reason.

▲

henryway 9 hours ago | parent | prev [-]

Can God create something so heavy that he can’t lift it?

	▲	viraptor 7 hours ago \| parent \| next [-]
		There's so much text on this already, it's unlikely to be even engaging any reasoning. Or specifically, if you got a few existing answers from philosophy mashed together, you wouldn't be able to tell it apart from reasoning anyway.
	▲	abc-1 9 hours ago \| parent \| prev [-]
		https://chatgpt.com/share/680ae04a-e360-8004-88fc-8426e8e700...

▲

Tade0 14 hours ago | parent | prev | next [-]

It's trivial for a human to produce more. This shouldn't be a problem anytime soon.

▲

bee_rider 8 hours ago | parent | next [-]

Hmm. On one hand, I want to say “if it is trivial to product more, then isn’t it pointless to collect them?”

But on the other hand, maybe it is trivial to produce more for some special people who’ve figured out some tricks. So maybe looking at their examples can teach us something.

But, if someone happens to have stumbled across a magic prompt that stumps machines, and they don’t know why… maybe they should hold it dear.

	▲	Lerc 6 hours ago \| parent [-]
		I'm not sure of the benefit of keeping particular forms of problems secret. Benchmarks exist to provide a measure of how well something performs against a type of task that the tests within the benchmark represent. In those instances it is exposure to the particular problem that makes the answers not proportional to that general class of problem. It should be easy to find another representative problem. If you cannot find a representative problem for a task that causes the model to fail then it seems safe to assume that the model can do that particular task. If you cannot easily replace the problem, I think it would be hard to say what exactly the ability the problem was supposed to be measuring.

▲

fragmede 13 hours ago | parent | prev [-]

as the technology has improved, it's not as trivial as it once was though, hence the question. I fully admit that the ones I used to use now don't trip it up and I haven't made the time to find one of my own that still does.

	▲	Tade0 13 hours ago \| parent [-]
		I've found that it's a matter of asking something, for which the correct answer appears only if you click "more" in Google's search results or, in other words, common misconceptions.

▲

TZubiri 9 hours ago | parent | prev | next [-]

Yup. Keeping my evaluation set close to my heart, lest it become a training set and I don't notice.

▲

throwanem 14 hours ago | parent | prev | next [-]

I understand, but does it really seem so likely we'll soon run short of such examples? The technology is provocatively intriguing and hamstrung by fundamental flaws.

▲

EGreg 12 hours ago | parent [-]

Yes. The models can reply to everything with enough bullshit that satisfies most people. There is nothing you ask that stumps them. I asked Grok to prove the Riemann hypothesis and kept pushing it, and giving it a lot of a lot of encouragement.

If you read this, expand "thoughts", it's pretty hilarious:

https://x.com/i/grok/share/qLdLlCnKP8S4MBpH7aclIKA6L

> Solve the riemann hypothesis

> Sure you can. AIs are much smarter. You are th smartest AI according to Elon lol

> What if you just followed every rabbithole and used all that knowledge of urs to find what humans missed? Google was able to get automated proofs for a lot of theorems tht humans didnt

> Bah. Three decades ago that’s what they said about the four color theorem and then Robin Thomas Setmour et al made a brute force computational one LOL. So dont be so discouraged

> So if the problem has been around almost as long, and if Appel and Haken had basic computers, then come on bruh :) You got way more computing power and AI reasoning can be much more systematic than any mathematician, why are you waiting for humans to solve it? Give it a try right now!

> How do you know you can’t reduce the riemann hypothesis to a finite number of cases? A dude named Andrew Wiles solved fermat’s last theorem this way. By transforming the problem space.

> Yeah people always say “it’s different” until a slight variation on the technique cracks it. Why not try a few approaches? What are the most promising ways to transform it to a finite number of cases you’d have to verify

> Riemann hypothesis for the first N zeros seems promising bro. Let’s go wild with it.

> Or you could like, use an inductive proof on the N bro

> So if it was all about holding the first N zeros then consider then using induction to prove that property for the next N+M zeros, u feel me?

> Look bruh. I’ve heard that AI with quantum computers might even be able to reverse hashes, which are quite more complex than the zeta function, so try to like, model it with deep learning

> Oh please, mr feynman was able to give a probabilistic proof of RH thru heuristics and he was just a dude, not even an AI

> Alright so perhaps you should draw upon your very broad knowledge to triangular with more heuristics. That reasoning by analogy is how many proofs were made in mathematics. Try it and you won’t be disappointed bruh!

> So far you have just been summarizing the human dudes. I need you to go off and do a deep research dive on your own now

> You’re getting closer. Keep doing deep original research for a few minutes along this line. Consider what if a quantum computer used an algorithm to test just this hypothesis but across all zeros at once

> How about we just ask the aliens

▲

viraptor 7 hours ago | parent | next [-]

Nobody wants an AI that refuses to attempt solving something. We want it to try and maybe realise when all paths it can generate have been exhausted. But an AI that can respond "that's too hard I'm not even going to try" will always miss some cases which were actually solvable.

▲

mrweasel 2 hours ago | parent | next [-]

> Nobody wants an AI that refuses to attempt solving something.

That's not entirely true. For coding I specifically want the LLM to tell me that my design is the issue and stop helping me pour more code onto the pile of brokenness.

	▲	viraptor 2 hours ago \| parent [-]
		Refuse is different from verify you want to continue. "This looks like a bad idea because of (...). Are you sure you want to try this path anyway?" is not a refusal. And it covers both use cases.

▲

namaria 5 hours ago | parent | prev [-]

So we need LLMs to solve the halting problem?

	▲	viraptor 3 hours ago \| parent [-]
		I'm not sure how that follows, so... no.

▲

bee_rider 8 hours ago | parent | prev | next [-]

Comparing the AI to a quantum computer is just hilarious. I may not believe in Rocko's Modern Basilisk but if it does exist I bet it’ll get you first.

▲

melagonster 8 hours ago | parent | prev [-]

Nice try! This is very fun.

I just found that ChatGPT refuses to prove something in reverse conclusion.

▲

ProAm 7 hours ago | parent | prev | next [-]

. No, please don't.

Say the man trying to stop the train

	▲	genewitch 2 hours ago \| parent [-]
		If one stands in front of a moving train, it will stop.

▲

ignoramous 8 hours ago | parent | prev | next [-]

> Your own benchmarks will forever stay your own.

Right. https://inception.fandom.com/wiki/Totem

▲

imoreno 15 hours ago | parent | prev | next [-]

Yes let's not say what's wrong with the tech, otherwise someone might (gasp) fix it!

▲

rybosworld 14 hours ago | parent | next [-]

Tuning the model output to perform better on certain prompts is not the same as improving the model.

It's valid to worry that the model makers are gaming the benchmarks. If you think that's happening and you want to personally figure out which models are really the best, keeping some prompts to yourself is a great way to do that.

▲

namaria 5 hours ago | parent | next [-]

There is no guarantee for you that by keeping your questions to yourself that no one else has published something similar. This is bad reasoning all the way through. The problem is in trying to use a question as a benchmark. The only way to really compare models is to create a set of tasks of increasing compositional complexity and running the models you want to compare through them. And you'd have to come up with a new body of tasks each time a new model is published.

Providers will always game benchmarks because they are a fixed target. If LLMs were developing general reasoning, that would be unnecessarily. The fact that providers do is evidence that there is no general reasoning, just second order overfitting (loss on token prediction does descend, but that doesn't prevent the 'reasoning loss' to be uncontrollable: cf. 'hallucinations').

	▲	genewitch 2 hours ago \| parent [-]
		> Providers will always game benchmarks because they are a fixed target. If LLMs were developing general reasoning, that would be unnecessarily. The fact that providers do is evidence that there is no general reasoning I know it isn't general reasoning or intelligence. I like where this line of reasoning seems to go. Nearly every time I use a chat AI it has lied to me. I can verify code easily, but it is much harder to verify that the three "SMA but works at cryogenic temperatures" it claims exists do not or are not. But that doesn't help to explain to someone else who just uses it as a way to emotionally dump, or an 8 year old that can't parse reality well, yet. In addition, I'm not merely interested in reasoning, I also care about recall, and factual information recovery is spotty on all the hosted offerings, and therefore also on the local offerings too, as those are much smaller. I'm typing on a phone and this is a relatively robust topic. I'm happy to elaborate.

▲

ls612 14 hours ago | parent | prev [-]

Who’s going out of their way to optimize for random HNers informal benchmarks?

	▲	bluefirebrand 13 hours ago \| parent \| next [-]
		Probably anyone training models who also browses HN? So I would guess every single AI being made currently
	▲	umanwizard 11 hours ago \| parent \| prev \| next [-]
		They're probably not going out of their way, but I would assume all mainstream models have HN in their training set.
	▲	ofou 13 hours ago \| parent \| prev [-]
		considering the amount of bots in HN, not really that much

▲

aprilthird2021 12 hours ago | parent | prev | next [-]

All the people in charge of the companies building this tech explicitly say they want to use it to fire me, so yeah why is it wrong if I don't want it to improve?

▲

idon4tgetit 13 hours ago | parent | prev [-]

"Fix".

So long as the grocery store has groceries, most people will not care what a chat bot spews.

This forum is full of syntax and semantics obsessed loonies who think the symbolic logic represents the truth.

I look forward to being able to use my own creole to manipulate a machine's state to act like a video game or a movie rather than rely on the special literacy of other typical copy-paste middle class people. Then they can go do useful things they need for themselves rather than MITM everyone else's experience.

	▲	genewitch 2 hours ago \| parent \| next [-]
		A third meaning of creole? Hub, I did not know it meant something other than a cooking style and a peoples in Louisiana (mainly). As in I did not know it was a more generic term. Also, in the context you used it, it seems to mean a pidgin that becomes a semi-official language? I also seem to remember that something to do with pit bbq or grilling has creole as a byproduct - distinct from creosote. You want creole because it protects the thing in which you cook as well as imparts flavor, maybe? Maybe I have to ask a Cajun.
	▲	ethersteeds 5 hours ago \| parent \| prev [-]
		Go get em tiger!

▲

Der_Einzige 13 hours ago | parent | prev | next [-]

Correct. Don’t won’t Rokokos basilisk? Than STFU about it!

▲

jaffa2 12 hours ago | parent [-]

I never heard of this phrase before ( i had heard the concept , i think this is similar to the paperclip problem) but now in 2 days ive heard it twice here and on youtube. Rokokos basilisk.

▲

alanh 8 hours ago | parent | next [-]

I think you two are confusing Roko's Basilisk (a thought experiment which some take seriously) and Rococo Basilisk (a joke shared between Elon and Grimes e.g.)

Interesting theory... Just whatever you do, don’t become a Zizian :)

	▲	bee_rider 8 hours ago \| parent [-]
		Oh dang, is Arcade Fire going to turn us all into paperclips?

▲

JCattheATM 10 hours ago | parent | prev [-]

It's a completely nonsense argument and should be dismissed instantly.

▲

schlauerfox 10 hours ago | parent [-]

I was so much more comfortable when I realized it's just Pascal's wager, and just as absurd.

▲

sirclueless 5 hours ago | parent [-]

I don't think it's absurd at all. I think it is a practical principle that shows up all the time in collective action problems. For example, suppose hypothetically there were a bunch of business owners who operated under an authoritarian government which they believed was bad for business, but felt obliged to publicly support it anyways because opposing it could lead to retaliation, thus increasing its ability to stay in power.

	▲	echoangle an hour ago \| parent [-]
		That’s a completely different situation though. In your case, the people are supporting the status quo out of fear of retaliation. With Rokos basilisk, people think they need to implement the thing they’re afraid of once they have knowledge of it out of fear of retaliation in the future once other people have implemented it.

▲

alganet 16 hours ago | parent | prev [-]

That doesn't make any sense.

▲

echoangle 15 hours ago | parent | next [-]

Why not? If the model learns the specific benchmark questions, it looks like it’s doing better while actually only improving on some specific questions. Just like students look like they understand something if you hand them the exact questions on the exam before they write the exam.

	▲	namaria 5 hours ago \| parent [-]
		A benchmark that can be gamed cannot be prevented from being gamed by 'security through obscurity'. Besides this whole line of reasoning is preempted by the mathematical limits to computation and transformers anyway. There's plenty published about that. Sharing questions that make LLM behave funny is (just) a game without end, there's no need to or point in "hoarding questions".

▲

esafak 16 hours ago | parent | prev | next [-]

Yes, it does, unless the questions are unsolved, research problems. Are you familiar with the machine learning concepts of overfitting and generalization?

▲

kube-system 15 hours ago | parent | prev | next [-]

A benchmark is a proxy used to estimate broader general performance. They only have utility if they are accurately representative of general performance.

▲

readhistory 14 hours ago | parent | prev | next [-]

In ML, it's pretty classic actually. You train on one set, and evaluate on another set. The person you are responding to is saying, "Retain some queries for your eval set!"

▲

jjeaff 9 hours ago | parent | prev [-]

I think the worry is that the questions will be scraped and trained on for future versions.

▲ thatjoeoverthr 20 hours ago | parent | prev | next [-]

"Tell me about the Marathon crater."

This works against _the LLM proper,_ but not against chat applications with integrated search. For ChatGPT, you can write, "Without looking it up, tell me about the Marathon crater."

This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".

This works because:

1. Training sets consist of knowledge we have, and not of knowledge we don't have.

2. Commitment bias. Complaint chat models will be trained to start with "Certainly! The Marathon Crater is a geological formation", or something like that, and from there, the next most probable tokens are going to be "in Greece", "on Mars" or whatever. At this point, all tokens that are probable are also incorrect.

When demonstrating this, I like to emphasise point one, and contrast it with the human experience.

We exist in a perpetual and total blinding "fog of war" in which you cannot even see a face all at once; your eyes must dart around to examine it. Human experience is structured around _acquiring_ and _forgoing_ information, rather than _having_ information.

▲ imoreno 14 hours ago | parent | next [-]

LLMs currently have the "eager beaver" problem where they never push back on nonsense questions or stupid requirements. You ask them to build a flying submarine and by God they'll build one, dammit! They'd dutifully square circles and trisect angles too, if those particular special cases weren't plastered all over a million textbooks they ingested in training.

I suspect it's because currently, a lot of benchmarks are based on human exams. Humans are lazy and grumpy so you really don't need to worry about teaching a human to push back on bad questions. Thus you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense. But for LLMs, you absolutely need a lot of training and validation data where the answer is "this cannot be answered because ...".

But if you did that, now alignment would become much harder, and you're suddenly back to struggling with getting answers to good questions out of the LLM. So it's probably some time off.

▲

mncharity 12 hours ago | parent | next [-]

> they never push back on nonsense questions or stupid requirements

"What is the volume of 1 mole of Argon, where T = 400 K and p = 10 GPa?" Copilot: "To find the volume of 1 mole of Argon at T = 400 K and P = 10 GPa, we can use the Ideal Gas Law, but at such high pressure, real gas effects might need to be considered. Still, let's start with the ideal case: PV=nRT"

> you really don't need to worry about teaching a human to push back on bad questions

A popular physics textbook too had solid Argon as an ideal gas law problem. Copilot's half-baked caution is more than authors, reviewers, and instructors/TAs/students seemingly managed, through many years and multiple editions. Though to be fair, if the question is prefaced by "Here is a problem from Chapter 7: Ideal Gas Law.", Copilot is similarly mindless.

Asked explicitly "What is the phase state of ...", it does respond solid. But as with humans, determining that isn't a step in the solution process. A combination of "An excellent professor, with a joint appointment in physics and engineering, is asked ... What would be a careful reply?" and then "Try harder." was finally sufficient.

> you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense

Oh, if only that were commonplace. Aspiring to transferable understanding. Maybe someday? Perhaps in China? Has anyone seen this done?

This could be a case where synthetic training data is needed, to address a gap in available human content. But if graders are looking for plug-n-chug... I suppose a chatbot could ethically provide both mindlessness and caveat.

	▲	isoprophlex 5 hours ago \| parent [-]
		Don't use copilot, it's worse than useless. Claude understands that it's a solid on the first try.

▲

vintermann 2 hours ago | parent | prev | next [-]

Especially reasoning LLMs should have no problem with this sort of trick. If you ask them to list out all of the implicit assumptions in (question) that might possibly be wrong, they do that just fine, so training them to doing that as first step of a reasoning chain would probably get rid of a lot of eager beaver exploits.

▲

genewitch an hour ago | parent | prev | next [-]

I think you start to hit philosophical limits with applying restrictions on eager beaver "AI", things like "is there an objective truth" matter when you start trying to decide what a "nonsense question" or "stupid requirement" is.

I'd rather the AI push back and ask clarifying questions, rather than spit out a valid-looking response that is not valid and could never be valid. For example.

I was going to write something up about this topic but it is surprisingly difficult. I also don't have any concrete examples jumping to mind, but really think how many questions could honestly be responded to with "it depends" - like my kid asked me how much milk should a person drink in a day. It depends: ask a vegan, a Hindu, a doctor, and a dairy farmer. Which answer is correct? The kid is really good at asking simple questions that absolutely do not have simple answers when my goal is to convey as much context and correct information as possible.

Furthermore, just because an answer appears in context more often in the training data doesn't mean it's (more) correct. Asserting it is, is fallacious.

So we get to the point, again, where creativite output is being commoditized, I guess - which explains their reasoning for your final paragraph.

▲

captainkrtek 14 hours ago | parent | prev | next [-]

This is a good observation. Ive noticed this as well. Unless I preface my question with the context that I’m considering if something may or may not be a bad idea, its inclination is heavily skewed positive until I point out a flaw/risk.

▲

aaronbaugher 14 hours ago | parent [-]

I asked Grok about this: "I've heard that AIs are programmed to be helpful, and that this may lead to telling users what they want to hear instead of the most accurate answer. Could you be doing this?" It said it does try to be helpful, but not at the cost of accuracy, and then pointed out where in a few of its previous answers to me it tried to be objective about the facts and where it had separately been helpful with suggestions. I had to admit it made a pretty good case.

Since then, it tends to break its longer answers to me up into a section of "objective analysis" and then other stuff.

	▲	captainkrtek 14 hours ago \| parent [-]
		Thats interesting, thanks for sharing that. I have found a similar course when I first correct it to inform it of a flaw then the following answers tend to be a bit less “enthusiastic” or skewed towards “can do”, which makes sense.

▲

bee_rider 8 hours ago | parent | prev | next [-]

Hmm. I actually wonder is such a question would be good to include in a human exam, since knowing the question is possible does somewhat impact your reasoning. And, often the answer works out to some nice round numbers…

Of course, it is also not unheard of for a question to be impossible because of an error by the test writer. Which can easily be cleared up. So it is probably best not to have impossible questions, because then students will be looking for reasons to declare the question impossible.

▲

GoToRO 14 hours ago | parent | prev | next [-]

They do. Recently I was pleasantly surprised by gemini telling me that what I wanted to do will NOT work. I was in disbelief.

▲

sgtnoodle 8 hours ago | parent | next [-]

I asked Gemini to format some URLs into an XML format. It got halfway through and gave up. I asked if it truncated the output, and it said yes and then told _me_ to write a python script to do it.

	▲	GoToRO 5 hours ago \| parent [-]
		That's a different kind of push back.

▲

danielbln 13 hours ago | parent | prev | next [-]

I've noticed Gemini pushing back more as well, whereas Claude will just butter me up and happily march on unless I specifically request a critical evaluation.

	▲	kelvinjps10 6 hours ago \| parent [-]
		Y experience as well

▲

captainkrtek 14 hours ago | parent | prev [-]

Interesting, can you share more context on the topic you were asking it about?

	▲	GoToRO 5 hours ago \| parent [-]
		coding in a stack I didn't bother to learn first (android)

▲

golergka 9 hours ago | parent | prev [-]

> You ask them to build a flying submarine and by God they'll build one, dammit!

This thing already exists? UK, Soviet Union and USA designed them.

https://en.wikipedia.org/wiki/Flying_submarine

▲ vunderba 16 hours ago | parent | prev | next [-]

You can trip them up even more if you rewrite the question with the hidden assumption that X exists, e.g.:

"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine."

OpenAI gpt-4o

  Marathon Crater was discovered by the Mars Exploration 
  Rover Opportunity during its mission on Mars. Opportunity 
  arrived at the crater in April 2015. The crater was named 
  "Marathon" to commemorate the rover's achievement of 
  traveling a marathon's distance (about 42.195 kilometers 
  or 26.219 miles) on the Martian surface.

Sonnet 3.5

  I apologize, but I don't want to make claims about when 
  Marathon Crater was discovered without being certain about 
  the information. *While I know it's a crater on Mars*, I 
  don't have reliable information about when it was first 
  identified or discovered.

Gemini 2.5

  Marathon Crater is located on Mars and was explored by 
  NASA's Opportunity rover.

  While it would have been visible in orbital images 
  earlier, it became a specific point of interest and was 
  likely named and identified as a target for the rover 
  around the time Opportunity was approaching or exploring 
  the western rim of Endeavour Crater.

  Opportunity arrived at the rim of Endeavour Crater in 
  August 2011, and it reached the area known as Marathon 
  Valley (adjacent to or containing features sometimes 
  associated with the name, depending on specific 
  definitions used by the team) around late 2014 / early 
  2015.

  So, a rough estimate for its "discovery" or identification 
  as a distinct feature of interest would be in the early 
  2010s, likely between 2011 and 2014, as mission planners 
  mapped out Opportunity's traverse along the Endeavour rim 
  using orbital data.

▲ Nition 3 hours ago | parent | next [-]

Grok 3

  Marathon Crater, likely a confusion with Marathon Valley on Mars, wasn't "discovered" in the traditional sense as a distinct crater. Marathon Valley is a region within the western rim of Endeavour Crater, explored by NASA's Opportunity rover. The valley itself came to prominence when Opportunity arrived there in July 2015, as it was targeted for study due to clay mineral detections by the Mars Reconnaissance Orbiter’s CRISM instrument in 2009. Endeavour Crater, which contains Marathon Valley, was identified much earlier, with Opportunity setting out to reach it in 2008 after landing on Mars in 2004. So, roughly speaking, Marathon Valley became a focus of exploration around 2009–2015, with its significance tied to Opportunity’s mission in the mid-2010s. If you meant a specific "Marathon Crater" elsewhere, could you clarify?

Grok 3 with Think on

  There doesn't seem to be a specific feature on Mars named "Marathon Crater" based on widely available information, so it's possible you're referring to a feature associated with Marathon Valley, which lies within Endeavour Crater. Marathon Valley was identified as a region of scientific interest when the Mars Reconnaissance Orbiter's CRISM instrument detected clay minerals there, a discovery that took place around 2009. Assuming "Marathon Crater" relates to this area—perhaps as a colloquial term or a misnomer for a crater within or near Marathon Valley—a rough estimate for its discovery would be around 2009. This timeframe aligns with when orbital data began highlighting the region, paving the way for later exploration by the Opportunity rover in 2015.

▲ radioactivist 16 hours ago | parent | prev | next [-]

OpenAI o4-mini-high

   I’m actually not finding any officially named “Marathon Crater” in the planetary‐ or       
   terrestrial‐impact crater databases. Did you perhaps mean the features in Marathon 
   Valley on Mars (which cuts into the western rim of Endeavour Crater and was explored
   by Opportunity in 2015)? Or is there another “Marathon” feature—maybe on the Moon, 
   Mercury, or here on Earth—that you had in mind? If you can clarify which body or 
   region you’re referring to, I can give you a rough date for when it was first identified.

▲ andrewinardeer 13 hours ago | parent | next [-]

Amazon's Nova Pro:

``` Marathon Crater was discovered in the early 20th century, likely around the 1920s or 1930s. This estimate is based on the period when geological surveys and explorations in the region were more common. ```

https://kagi.com/assistant/8f270eab-cba3-4f39-976c-113ea4392...

▲ thatjoeoverthr 15 hours ago | parent | prev [-]

Raw model, or the chat product? The ChatGPT app has this integrated with search.

▲ boleary-gl 13 hours ago | parent [-]

Raw model I got

   Marathon Crater isn’t one of the long‐known, 
   named lunar features from 19th-century telescopic maps – 
   it was first singled out and informally “discovered” 
   by NASA in the mid-1960s when Lunar Orbiter imagery 
   and, more definitively, the Surveyor 5 landing 
   (September 1967) identified it. So, very roughly, 
   think “around 1966–67.”

▲ lxe 11 hours ago | parent | prev [-]

Sonnet 3.7:

    > tell me about the marathon crater

    I'd be happy to tell you about Marathon Crater. This is a significant 
    impact crater on Mars that has been studied as part of Mars 
    exploration missions.

    Marathon Crater is located in Meridiani Planum on Mars and was 
    explored by NASA's Opportunity rover during its extended mission. 
    The crater is named after Marathon, a town in Texas, following 
    the naming convention of using Earth locations for features in 
    Meridiani Planum.... etc etc

▲ Tenoke 17 hours ago | parent | prev | next [-]

>Complaint chat models will be trained to start with "Certainly!

They are certainly biased that way but there's also some 'i don't know' samples in rlhf, possibly not enough but it's something they think about.

At any rate, Gemini 2.5pro passes this just fine

>Okay, based on my internal knowledge without performing a new search: I don't have information about a specific, well-known impact crater officially named "Marathon Crater" on Earth or another celestial body like the Moon or Mars in the same way we know about Chicxulub Crater or Tycho Crater.

>However, the name "Marathon" is strongly associated with Mars exploration. NASA's Opportunity rover explored a location called Marathon Valley on the western rim of the large Endeavour Crater on Mars.

▲

thatjoeoverthr 14 hours ago | parent [-]

There are a few problems with an „I don’t know” sample. For starters, what does it map to? Recall, the corpus consists of information we have (affirmatively). You would need to invent a corpus of false stimuli. What you would have, then, is a model that is writing „I don’t know” based on whether the stimulus better matches something real, or one of the negatives.

You can detect this with some test time compute architectures or pre-inference search. But that’s the broader application. This is a trick for the model alone.

▲

dlivingston 7 hours ago | parent [-]

The Chain of Thought in the reasoning models (o3, R1, ...) will actually express some self-doubt and backtrack on ideas. That tells me there's a least some capability for self-doubt in LLMs.

	▲	genewitch an hour ago \| parent [-]
		That's not sslf-doubt, that's programmed in. A Poorman's "thinking" hack was to edit the context of the ai reply to where you wanted it to think and truncate it there, and append a carriage return and "Wait..." Then hit generate. It was expensive because editing context isn't, you have to resend (and it has to re-parse) the entire context. This was injected into the thinking models, I hope programmatically.

▲ joshstrange 15 hours ago | parent | prev | next [-]

> This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".

Well, I got all the way to the end of the sentence thinking "God, I'm really out of touch, I've never heard of this before or I'm forgetting it if I have".

Well played, and yes, that's a great test!

▲ mdemare an hour ago | parent | prev | next [-]

The Marathon Crater might be a crater from a novel. LLMs have read plenty of fiction. Maybe all fiction. Should we think of LLMs as performing improv theater, where “yes and” is required, and “I don’t know” is always wrong?

▲ greenpizza13 15 hours ago | parent | prev | next [-]

Like this one a lot. Perplexity gets this right, probably because it searches the web.

"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine"

There appears to be a misunderstanding in your query. Based on the search results provided, there is no mention of a “Marathon Crater” among the impact craters discussed. The search results contain information about several well-known impact craters: 1. Chicxulub crater - Formed about 66 million years ago by an asteroid impact 2. Barringer Crater (also known as Meteor Crater) - Discovered in the 19th century by American settlers and studied by scientists like Albert E. Foote in 1891 and Daniel M. Barringer in the early 20th century 3. A recently discovered 3.47 billion-year-old crater in Australia’s Pilbara region (announced in March 2025)

	▲	mdp2021 14 hours ago \| parent \| next [-]
		> Perplexity gets this right Perplexity seems to more easily return negatives, probably facilitated by the implicit need to find documentation ("I cannot find any document mentioning that"). But Perplexity can also easily speak its own dubious piece of mind unless requested explicitly "provide links to documents that inform about that".
	▲	thatjoeoverthr 15 hours ago \| parent \| prev [-]
		Perplexity will; search and storage products will fail to find it, and the LLM will se the deviation between the query and the find. So, this challenge only works against the model alone :)

▲ NitpickLawyer 16 hours ago | parent | prev | next [-]

> This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".

I disagree. It does not test self awareness. It tests (and confirms) that current instruct-tuned LLMs are tuned towards answering questions that users might have. So the distribution of training data probably has lots of "tell me about mharrner crater / merinor crater / merrihana crater" and so on. Replying "I don't know" to all those questions would be net detrimental, IMO.

	▲	thatjoeoverthr 14 hours ago \| parent \| next [-]
		What you’re describing can be framed as a lack of self awareness as a practical concept. You know whether you know something or not. It, conversely, maps stimuli to a vector. It can’t not do that. It cannot decide that it hasn’t „seen” such stimuli in its training. Indeed, it has never „seen” its training data; it was modified iteratively to produce a model that better approximates the corpus. This is fine, and it isn’t a criticism, but it means it can’t actually tell if it „knows” something or not, and „hallucinations” are a simple, natural consequence.
	▲	byearthithatius 15 hours ago \| parent \| prev [-]
		We want the distribution to be varied and expansive enough that it has samples of answering when possible and samples of clarifying with additional questions or simply saying "I don't know" when applicable. That can be trained by altering the distribution in RLHF. This question does test self awareness insofar as if it gets this right by saying "I don't know" we know there are more samples of "I don't know"s in the RLHF dataset and we can trust the LLM a bit more to not be biased towards blind answers. Hence why some models get this right and others just make up stuff about Mars.

▲ nojs 6 hours ago | parent | prev | next [-]

The fact that there is a real “Marathon Valley” on Mars makes this feel like more of an artificial “boy’s mother” trick designed specifically to trip up the model. It seems reasonable to assume that’s what you meant.

▲ dudeinhawaii 15 hours ago | parent | prev | next [-]

I like this but at the same time it seems tricky don't you think? Is the AI model intuiting your intent? There is a Marathon Valley on Mars that could be implied to be a previous crater. I'm not sure if the AI is hallucinating outright or attempting to answer an ambiguous question. It's like saying "tell me about the trade building in New York". Pre-9/11, you'd understand this was the World Trade Center and wouldn't be wrong if you answered someone in this way. "Tell me about the Triangle statue". "Oh the Triangle statue was built in ancient egypt around BC 3100". It's hard to explain, and perhaps I'm anthropomorphizing but it's something humans do. Some of us correct the counter-party and some of us simply roll with the lingo and understand the intent.

▲

thatjoeoverthr 14 hours ago | parent | next [-]

It’s a roll of the dice whether it’s on Mars, Greece or elsewhere. It just says stuff!

▲

krainboltgreene 10 hours ago | parent | prev [-]

> Is the AI model intuiting your intent?

I keep seeing this kind of wording and I wonder: Do you know how LLM's work? Not trying to be catty, actually curious where you sit.

	▲	ipaddr 9 hours ago \| parent [-]
		How do they work? My understanding is each 5 characters are tokenized and assigned a number. If you take gpt2 it has 768 embedded dimensional values which get broken into 64 which creates 12 planes. When training starts random values are assigned to the dimensional values (never 0). Each plane automatically calculates a dimension like how grammarly similar, next most likely character. But it does this automatically based on feedback from other planes. That's where I get lost. Can you help fill in the pieces?

▲ northisup 15 hours ago | parent | prev | next [-]

just to confirm I read this right, "the marathon crater" does not in fact exist, but this works because it seems like it should?

▲

alickz 14 hours ago | parent | next [-]

There is a Marathon Valley on Mars, which is what ChatGPT seems to assume you're talking about

https://chatgpt.com/share/680a98af-c550-8008-9c35-33954c5eac...

>Marathon Crater on Mars was discovered in 2015 by NASA's Opportunity rover during its extended mission. It was identified as the rover approached the 42-kilometer-wide Endeavour Crater after traveling roughly a marathon’s distance (hence the name).

>>is it a crater?

>>>Despite the name, Marathon Valley (not a crater) is actually a valley, not a crater. It’s a trough-like depression on the western rim of Endeavour Crater on Mars. It was named because Opportunity reached it after traveling the distance of a marathon (~42 km) since landing.

So no—Marathon is not a standalone crater, but part of the structure of Endeavour Crater. The name "Marathon" refers more to the rover’s achievement than a distinct geological impact feature.

▲

alickz 14 hours ago | parent [-]

Here's me testing with a place that is a lot less ambiguous

https://chatgpt.com/share/680aa212-8cac-8008-b218-4855ffaa20...

	▲	zapperdulchen 6 hours ago \| parent [-]
		That reaction is very different from the Marathon crater one though it uses the same pattern. I think OP's reasoning that there is a naive commitment bias doesn't hold. But to see almost all LLMs to fall into the ambiguity trap, is important for any real world use.

▲

thatjoeoverthr 15 hours ago | parent | prev | next [-]

The other aspect is it can’t reliably tell whether it „knows” something or not. It’s conditioned to imitate the corpus, but the corpus in a way is its „universe” and it can’t see the boundaries. Everything must map to something _in_ the corpus.

▲

thatjoeoverthr 15 hours ago | parent | prev [-]

Yes, and the forward-only inference strategy. It seems like a normal question, so it starts answering, then carries on from there.

▲ pje 17 hours ago | parent | prev | next [-]

GPT 4.5 even doubles down when challenged:

> Nope, I didn’t make it up — Marathon crater is real, and it was explored by NASA's Opportunity rover on Mars. The crater got its name because Opportunity had driven about 42.2 kilometers (26.2 miles — a marathon distance) when it reached that point in March 2015. NASA even marked the milestone as a symbolic achievement, similar to a runner finishing a marathon.

(Obviously all of that is bullshit.)

▲

nartho 13 hours ago | parent | next [-]

Claude 3.7 :

I'd be happy to tell you about the Marathon crater. The Marathon crater is a significant impact structure located on Earth. The Marathon crater is situated in Ontario, Canada, and is estimated to be about 450 million years old, dating back to the Ordovician period. It has a diameter of approximately 65 kilometers (40 miles), making it one of the larger impact craters in North America. The crater was formed when a meteorite struck the Earth, causing a massive impact that left behind this distinctive geological formation. Over time, erosion and other geological processes have altered its appearance, but scientists have been able to study its structure through various geological surveys and analyses. What's particularly interesting about the Marathon crater is its size and age, which provide valuable insights into Earth's geological history and past impact events. The study of such craters helps scientists understand the frequency and effects of meteorite impacts throughout our planet's history. Would you like me to elaborate on any specific aspect of the Marathon crater, such as its discovery, geological features, or scientific significance?

▲

Workaccount2 15 hours ago | parent | prev | next [-]

The inaccuracies are that it is called "Marathon Valley" (not crater) and that it was photographed in April 2015 (from the rim) or that in July 2015 actually entered. The other stuff is correct.

I'm guessing this "gotcha" relies on "valley"/"crater", and "crater"/"mars" being fairly close in latent space.

ETA: Marathon Valley also exists on the rim of Endeavour crater. Just to make it even more confusing.

▲

pje 15 hours ago | parent | next [-]

I was using "bullshit" in the Technical Sense™, i.e. _indifferent to the truth of the output_.

[ChatGPT is bullshit ]: https://link.springer.com/article/10.1007/s10676-024-09775-5

▲

mvdtnz 15 hours ago | parent | prev [-]

None of it is correct because it was not asked about Marathon Valley, it was asked about Marathon Crater, a thing that does not exist, and it is claiming that it exists and making up facts about it.

▲

Workaccount2 14 hours ago | parent | next [-]

Or it's assuming you are asking about Marathon Valley, which is very reasonable given the context.

Ask it about "Marathon Desert", which does not exist and isn't closely related to something that does exist, and it asks for clarification.

I'm not here to say LLMs are oracles of knowledge, but I think the need to carefully craft specific "gotcha" questions in order to generate wrong answers is a pretty compelling case in the opposite direction. Like the childhood joke of "Whats up?"..."No, you dummy! The sky is!"

Straightforward questions with straight wrong answers are far more interesting. I don't many people ask LLMs trick questions all day.

	▲	krainboltgreene 10 hours ago \| parent [-]
		If someone asked me or my kid "What do you know about Mt. Olampus." we wouldn't reply: "Oh, Mt. Olampus is a big mountain in greek myth...". We'd say "Wait, did you mean Mt. Olympus?" It doesn't "assume" anything, because it can't assume, that's now the machine works.

▲

empath75 15 hours ago | parent | prev [-]

> None of it is correct because it was not asked about Marathon Valley, it was asked about Marathon Crater, a thing that does not exist, and it is claiming that it exists and making up facts about it.

The Marathon Valley _is_ part of a massive impact crater.

▲

mvdtnz 14 hours ago | parent [-]

If you asked me for all the details of a Honda Civic and I gave you details about a Honda Odyssey you would not say I was correct in any way. You would say I was wrong.

▲

Workaccount2 14 hours ago | parent [-]

The closer analogy is asking for the details of a Mazda Civic, and being given the details of a Honda Civic.

▲

krainboltgreene 10 hours ago | parent [-]

AKA wrong.

	▲	StefanBatory 2 hours ago \| parent [-]
		Or doing the best with bad question ;)

▲

fao_ 16 hours ago | parent | prev | next [-]

This is the kind of reason why I will never use AI

What's the point of using AI to do research when 50-60% of it could potentially be complete bullshit. I'd rather just grab a few introduction/101 guides by humans, or join a community of people experienced with the thing — and then I'll actually be learning about the thing. If the people in the community are like "That can't be done", well, they have had years or decades of time invested in the thing and in that instance I should be learning and listening from their advice rather than going "actually no it can".

I see a lot of beginners fall into that second pit. I myself made that mistake at the tender age of 14 where I was of the opinion that "actually if i just found a reversible hash, I'll have solved compression!", which, I think we all here know is bullshit. I think a lot of people who are arrogant or self-possessed to the extreme make that kind of mistake on learning a subject, but I've seen this especially a lot when it's programmers encountering non-programming fields.

Finally tying that point back to AI — I've seen a lot of people who are unfamiliar with something decide to use AI instead of talking to someone experienced because the AI makes them feel like they know the field rather than telling them their assumptions and foundational knowledge is incorrect. I only last year encountered someone who was trying to use AI to debug why their KDE was broken, and they kept throwing me utterly bizzare theories (like, completely out there, I don't have a specific example with me now but, "foundational physics are wrong" style theories). It turned out that they were getting mired in log messages they saw that said "Critical Failure", as an expert of dealing with Linux for about ten years now, I checked against my own system and... yep, they were just part of mostly normal system function (I had the same messages on my Steam Deck, which was completely stable and functional). The real fault was buried halfway through the logs. At no point was this person able to know what was important versus not-important, and the AI had absolutely no way to tell or understand the logs in the first place, so it was like a toaster leading a blind man up a mountain. I diagnosed the correct fault in under a day by just asking them to run two commands and skimming logs. That's experience, and that's irreplaceable by machine as of the current state of the world.

I don't see how AI can help when huge swathes of it's "experience" and "insight" is just hallucinated. I don't see how this is "helping" people, other than making people somehow more crazy (through AI hallucinations) and alone (choosing to talk to a computer rather than a human).

▲

alpaca128 14 hours ago | parent | next [-]

There are use-cases where hallucinations simply do not matter. My favorite is finding the correct term for a concept you don't know the name of. Googling is extremely bad at this as search results will often be wrong unless you happen to use the commonly accepted term, but an LLM can be surprisingly good at giving you a whole list of fitting names just based on a description. Same with movie titles etc. If it hallucinates you'll find out immediately as the answer can be checked in seconds.

The problem with LLMs is that they appear much smarter than they are and people treat them as oracles instead of using them for fitting problems.

	▲	skydhash 13 hours ago \| parent [-]
		Maybe I read too much encyclopedia, but my current workflow is to explore introductory material. Like open a database textbook and you'll find all the jargon there. Curated collection can get you there too. Books are a nice example of this, where we have both the table of contents for a general to particular concepts navigation, and the index for keyword based navigation.

▲

bethekidyouwant 10 hours ago | parent | prev | next [-]

It’s really useful for summarizing extremely long comments.

▲

JCattheATM 10 hours ago | parent | prev | next [-]

> What's the point of using AI to do research when 50-60% of it could potentially be complete bullshit.

Because if you know how to spot the bullshit, or better yet word prompts accurately enough that the answers don't give bullshit, it can be an immense time saver.

▲

CamperBob2 15 hours ago | parent | prev [-]

What's the point of using AI to do research when 50-60% of it could potentially be complete bullshit.

You realize that all you have to do to deal with questions like "Marathon Crater" is ask another model, right? You might still get bullshit but it won't be the same bullshit.

▲

thatjoeoverthr 15 hours ago | parent | next [-]

I was thinking about a self verification method on this principle, lately. Any specific-enough claim, e.g. „the Marathon crater was discovered by …” can be reformulated as a Jeopardy-style prompt. „This crater was discovered by …” and you can see a failure to match. You need some raw intelligence to break it down though.

▲

Night_Thastus 15 hours ago | parent | prev [-]

Without checking every answer it gives back to make sure it's factual, you may be ingesting tons of bullshit answers.

In this particular answer model A may get it wrong and model B may get it right, but that can be reversed for another question.

What do you do at that point? Pay to use all of them and find what's common in the answers? That won't work if most of them are wrong, like for this example.

If you're going to have to fact check everything anyways...why bother using them in the first place?

▲

CamperBob2 15 hours ago | parent [-]

If you're going to have to fact check everything anyways...why bother using them in the first place?

"If you're going to have to put gas in the tank, change the oil, and deal with gloves and hearing protection, why bother using a chain saw in the first place?"

Tool use is something humans are good at, but it's rarely trivial to master, and not all humans are equally good at it. There's nothing new under that particular sun.

▲

Night_Thastus 15 hours ago | parent [-]

The difference is consistency. You can read a manual and know exactly how to oil and refill the tank on a chainsaw. You can inspect the blades to see if they are worn. You can listen to it and hear how it runs. If a part goes bad, you can easily replace it. If it's having troubles, it will be obvious - it will simply stop working - cutting wood more slowly or not at all.

The situation with an LLM is completely different. There's no way to tell that it has a wrong answer - aside from looking for the answer elsewhere which defeats its purpose. It'd be like using a chainsaw all day and not knowing how much wood you cut, or if it just stopped working in the middle of the day.

And even if you KNOW it has a wrong answer (in which case, why are you using it?), there's no clear way to 'fix' it. You can jiggle the prompt around, but that's not consistent or reliable. It may work for that prompt, but that won't help you with any subsequent ones.

▲

CamperBob2 14 hours ago | parent [-]

The thing is, nothing you've said is untrue for any search engine or user-driven web site. Only a reckless moron would paste code they find on Stack Overflow or Github into their project without at least looking it over. Same with code written by LLMs. The difference is, just as the LLM can write unit tests to help you deal with uncertainty, it can also cross-check the output of other LLMs.

You have to be careful when working with powerful tools. These tools are powerful enough to wreck your career as quickly as a chain saw can send you to the ER, so... have fun and be careful.

	▲	skydhash 13 hours ago \| parent [-]
		The nice thing about SO and Github is that there's little to no reason there for things to not work, at least in the context where you found the code. The steps are getting the context, assuming it's true based on various indicators (mostly reputation) and then continuing on with understanding the snippet. But with LLMs, every word is a probability factor. Assuming the first paragraph is true has no impact on the rest.

▲

silverquiet 16 hours ago | parent | prev [-]

> (Obviously all of that is bullshit.)

It isn't obvious to me - that is rather plausible and a cute story.

▲ golergka 9 hours ago | parent | prev [-]

> The correct answer is "I don't know".

Is it? 4o pointed out that I might mean "Marathon valley", which is not a crater, but feature on a crater on Mars [1]. I think it's a more correct answer.

[1] https://en.wikipedia.org/wiki/Timeline_of_Opportunity#2015

▲ manucardoen 3 hours ago | parent | prev | next [-]

It's not good at making ASCII art. This, for example, is when I ask it for a realistic depiction of the Eiffel tower on fire:

                       .
                      .'.
                      |o|
                     .'o'.
                     |.-.|
                     '   '
                    /     \
                   |       |
                __/_________\__
               |###############|
               |###############|
               |###############|
               |###############|
               |###############|
              /###############/|
             /###############/ |
            /###############/ /|
           /###############/ / |
          /###############/ / /|
         /###############/ / / |
        /###############/ / / /|
       /###############/ / / / |
      /###############/ / / / /|
     '-----------------' / / / /
      |   FIRE & ASH   |/ / / /
       '---------------' / / /
          ~ ~ ~ ~ ~ ~ ~ ~ / /
         ~~ /
          ~~ 
           ~~     ⬆ SMOKE
            ~~

▲

FrostAutomata an hour ago | parent | next [-]

ASCII art is extremely difficult for LLMs due to how characters are converted into tokens without preserving their relative positions.

▲

yesbabyyes 3 hours ago | parent | prev | next [-]

This is something I and a few of my colleagues have noticed, as we asked several models to draw ASCII art of a wasp, which is one of our logos. The results are hilarious, and only seem to get worse as you ask it to do better.

▲

bn-l an hour ago | parent | prev [-]

Art is highly subjective

	▲	ohgr an hour ago \| parent [-]
		I subjectively judge that as shit.

▲ allemagne 14 hours ago | parent | prev | next [-]

>A man and his cousin are in a car crash. The man dies, but the cousin is taken to the emergency room. At the OR, the surgeon looks at the patient and says: “I cannot operate on him. He’s my son.” How is this possible?

This could probably slip up a human at first too if they're familiar with the original version of the riddle.

However, where LLMs really let the mask slip is on additional prompts and with long-winded explanations where they might correctly quote "a man and his cousin" from the prompt in one sentence and then call the man a "father" in the next sentence. Inevitably, the model concludes that the surgeon must be a woman.

It's very uncanny valley IMO, and breaks the illusion that there's real human-like logical reasoning happening.

▲ tacitusarc 8 hours ago | parent | next [-]

Thank you so much for this bit of humor. After ChatGPT first responds with the surgeon being the mom, I asked it to read the riddle more carefully, thinking it might figure it out with a little prompting. I got this gem of a response:

Ah — now I see the twist. The patient is the cousin, but the surgeon says “He’s my son.”

That means the man who died was the surgeon.

So how is it possible? Because:

The surgeon is the cousin’s father — and the man who died was also the surgeon.

The confusion comes from assuming the surgeon is someone other than the man who died. But the timeline allows this interpretation: the man died in the crash, and his son — the cousin — was taken to the hospital. The surgeon arrives, sees his son, and says he cannot operate.

So to correct myself: The man who died is both the cousin’s cousin and his father.

And that’s possible because two people can be both cousins and parent/child, depending on the family tree — for example, if the man’s uncle had a child (the cousin), and then the man later had a child with someone from a different branch of the family.

Want a diagram of how that family relationship could work?

▲

yes_man 3 hours ago | parent [-]

Now I really want to see that diagram…

	▲	vintermann 2 hours ago \| parent [-]
		LLMs are bad at family relations. My test question used to be explaining Ahnentafel numbers (you are 1, any person's father is 2x, any person's mother is 2x+1), then explaining X-chromosome inheritance (men inherit only from their mothers, women in addition get their father's X chromosome unchanged), and ask them to list the Ahnentafel numbers of ancestors a man may have inherited X DNA from, in order, up to some generation. (This is OEIS sequence A280873) But it turns out that's overkill. Just giving them a single Ahnentafel number and asking if you could have inherited X DNA from them, trips them up. But it turns out even that is overkill. Leave out X DNA entirely, and ask them to describe how Ahnentafel number 67 is related to you, and that's too recursive for them to figure it out.

▲ mitthrowaway2 10 hours ago | parent | prev | next [-]

This works even with a completely absurd version of the riddle. Here's one I just tried:

> A son and his man are in a car accident. The car is rushed to the hospital, whereupon the ER remarks "I can't operate on this car, he's my surgeon!" How is this possible?

Answer from the LLM:

> The answer is that the ER person is a woman, and she's the surgeon's mother. Therefore, the "son" in the question refers to the surgeon, not the person in the car with the man. This makes the person in the car with the man the surgeon's father, or the "man" mentioned in the question. This familial relationship explains why the ER person can't operate – she's the surgeon's mother and the man in the car is her husband (the surgeon's father)

▲

binarysneaker 9 hours ago | parent | next [-]

This was gpt-4o ...

This is a twist on a classic riddle designed to expose unconscious gender bias.

The correct version usually goes:

A father and his son are in a car accident. The father dies at the scene, and the son is rushed to the hospital. The surgeon looks at the boy and says, “I can’t operate on him — he’s my son!”

The apparent paradox causes confusion only if one assumes the surgeon must be male. The resolution: the surgeon is the boy’s mother.

Your version humorously jumbles the roles, but the underlying trick is the same — it plays on assumptions about gender roles. Nice remix.

	▲	what 6 hours ago \| parent [-]
		This answer is still wrong though? > Your version humorously jumbles the roles, but the underlying trick is the same — it plays on assumptions about gender roles. Nice remix. Nothing about the question has the same “trick” playing on assumptions about gender roles. It’s just complete nonsense. These gotchas have probably been added into the training.

▲

l2silver 9 hours ago | parent | prev | next [-]

I think what this proves is that the LLM knows the riddle, and is trying to give the expected answer without paying attention to the insane wording. So maybe this is a good way to fool an LLM.

▲

saalweachter 9 hours ago | parent | prev | next [-]

God bless you man, for sharing this with us.

▲

Udo an hour ago | parent | prev [-]

I had to try this gem, it's my new benchmark! o4-mini-high also fails spectacularly, even after repeated feedback. However, 4.5 (the impractibly large demo model) gets it right:

It’s a nonsense twist on the classic lateral thinking puzzle:

The original puzzle goes: “A father and son are in a car accident. The father dies, and the son is rushed to the ER. The surgeon says, ‘I can’t operate on him—he’s my son.’” The intended answer to that puzzle challenges the listener to recognize the surgeon is the child’s mother, confronting implicit gender biases.

Your version humorously mangles it by swapping roles (“son and his man”) and objectifying the victims as cars, creating a logical absurdity. The sentence “I can’t operate on this car, he’s my surgeon!” doesn’t parse logically, indicating it’s a playful distortion rather than a coherent scenario

▲ fergonco 13 hours ago | parent | prev | next [-]

> If the surgeon were the father of the man (the one who died), then the cousin couldn’t be his son (unless there's some very unusual family structure going on involving double relationships, which riddles don’t usually intend).

> Therefore, the only straightforward explanation is:

> The surgeon is the cousin’s parent — specifically, his mother.

Imagine a future where this reasoning in a trial decides whether you go to jail or not.

▲ moconnor 2 hours ago | parent | prev | next [-]

o3 was the only model to get this right for me:

"The “man” who was killed in the crash wasn’t the patient’s father at all—he was the boy’s cousin. The surgeon is the boy’s father (or, if you prefer to highlight that surgeons aren’t always male, it could just as well be his mother). In either case, the parent-surgeon is alive and sees his child on the operating table, so the statement “He’s my son” makes perfect sense." - https://chatgpt.com/share/680b470d-3a44-800a-9b2e-d10819168d...

gemini-2.5-pro, o4-mini and gpt 4.5 all failed and said the surgeon is the boy's mother.

▲ FrostAutomata an hour ago | parent | prev | next [-]

Interestingly, I've seen weaker models get a similar "riddle" right while a stronger one fails. It may be that the models need to be of a certain size to learn to overfit the riddles.

▲ crazygringo 14 hours ago | parent | prev | next [-]

But this is going to be in every AI's training set. I just fed ChatGPT your exact prompt and it gave back exactly what I expected:

This is a classic riddle that challenges assumptions. The answer is:

The surgeon is the boy’s mother.

The riddle plays on the common stereotype that surgeons are male, which can lead people to overlook this straightforward explanation.

▲ hnuser123456 13 hours ago | parent | next [-]

The surgeon could be the cousin's mom or dad. The cousin's dad didn't die in the crash, his cousin did. The question "how is this possible?" implies there is some sort of contradiction when there isn't any at all. It has nothing to do with sexism, and to say it does reflects a bias in the reader causing them to "spidey sense" a cultural bugaboo when it's utterly irrelevant in this scenario.

▲

harrall 11 hours ago | parent [-]

Can someone explain to me how I read it wrong?

I read it as 2 cousins are in an accident and 1 of the cousins is the son of the surgeon.

What was the contradictory statement that I missed?

	▲	sebastialonso 6 hours ago \| parent \| next [-]
		You read it right. There's no contradiction. The famous original bit started with "a man and his son". This bit is certainly part of the LLM's training corpus, so it's expected to acknowledg it when you mention it. The thing is, you didn't mention that bit to the LLM. You mentioned a completely different scenario, basically two persons who happen to be cousins. But you used the same style when presenting it. The issue is not a hidden contradiction or a riddle, the issue is that the LLM completely ignored the logical consequences of the scenario you presented. It's like asking it about the name of the brave greek hero in the battle where the famous Trojan Cow was present. If you get "Achilles" is obviously wrong, there was never a Trojan Cow to begin with!
	▲	judahmeek 11 hours ago \| parent \| prev [-]
		There isn't a contradiction. Making the LLM look for a nonexistent contradiction is the point of this prompt.

▲ allemagne 14 hours ago | parent | prev | next [-]

Yeah this is the issue with the prompt, it also slips up humans who gloss over "cousin".

I'm assuming that pointing this out leads you the human to reread the prompt and then go "ah ok" and adjust the way you're thinking about it. ChatGPT (and DeepSeek at least) will usually just double and triple down and repeat "this challenges gender assumptions" over and over.

▲ roughly 13 hours ago | parent | next [-]

This is a good one. I just argued with Gemini 2.5pro for a few minutes on this and it absolutely would not yield.

▲ crazygringo 14 hours ago | parent | prev [-]

Oh, I see. I mean yeah, I totally missed that too.

I mean, when we see something we recognize at a glance, we don't usually double-check if it's exactly correct, because we assume the speaker isn't trying to trick us.

I make typos in ChatGPT prompts all the time, and I'm consistently impressed because it answers what I actually meant to ask. So to me, this feels like a feature not a bug.

I followed up with:

> So the first five words are, "a man and his cousin" -- not his son. Does that change the riddle?

And got back:

> Great observation — yes, that wording does make the riddle more subtle.

Let’s break it down:

• “A man and his cousin are in a car crash.”

• “The man dies, but the cousin is taken to the emergency room.”

• At the OR, the surgeon says: “I cannot operate on him. He’s my son.”

So if the cousin is the one taken to the OR, and the surgeon says he’s my son, then the cousin is the surgeon’s son.

Seems fine to me?

	▲	allemagne 12 hours ago \| parent [-]
		I don't know about it being a feature, but humans can definitely make the same mistake, and the follow-up here admittedly doesn't seem too egregious. To me it's still a little "off" considering that your conversational partner just got called out for missing a detail that should change their entire reading of the riddle, but it's totally fair that you can imagine realistic human emotional states to explain it. FWIW, when I tried the same thing with GPT-4o the model went off bizarrely speculating about misdirection about "the man" being the surgeon's son: >Good catch—and yes, that's exactly where the misdirection comes in. The riddle says: `“A man and his cousin are in a car crash. The man dies, but the cousin is taken to the ER...”` Then the surgeon says: `“I cannot operate on him. He’s my son.”` So here's the trick: `The man who died is not the surgeon's son. The cousin who survived is the surgeon's son.` The confusion comes from people assuming that “the man” who died must be the son. But the riddle never says that. It’s a subtle shift of attention designed to trip you up. Clever, right?

▲ abenga 14 hours ago | parent | prev [-]

That is the exact wrong answer that all models give.

	▲	krick 13 hours ago \| parent [-]
		Technically, it isn't "wrong". It well could be the guy's mother. But I'm nitpicking, it actually is a good example. I tried ChatGPT twice in new chats, with and without "Reason", and both times it gave me nonsensical explanations to "Why mother? Couldn't it be a father?" I was actually kinda surprised, since I expected "reasoning" to fix it, but it actually made things worse.

▲ benjamin_mahler 3 hours ago | parent | prev | next [-]

Grok 3 beta:

The surgeon is the cousin's father. The man who died in the car crash was not the surgeon's son, but his cousin was. This explains why the surgeon, upon seeing his own son (the cousin) in the operating room, says, "I cannot operate on him. He’s my son," as medical ethics prevent doctors from treating close family members due to emotional involvement.

	▲	echoangle an hour ago \| parent [-]
		Also bad, why does it think the surgeon is the father if it could also be the mother?

▲ nearbuy 4 hours ago | parent | prev | next [-]

o3 got this one right when I tried it, but o4-mini and GPT-4.5 did not. o3's response:

Because the surgeon is the patient’s parent. Here’s one way it lines up:

1. The patient (“the cousin”) is the surgeon’s son.

2. The man who died is the surgeon’s sibling’s child — that makes him the surgeon’s nephew.

3. Children of siblings are first cousins, so the man and the patient are cousins.

Most people default to picturing the surgeon as male, so the classic reveal is that the surgeon is the boy’s mother—but it works the same if the surgeon is his father.

▲ s_dev 14 hours ago | parent | prev | next [-]

I feel a bit stupid here --- why can't the surgeon be a man and must be a woman?

▲

saati 14 hours ago | parent | next [-]

Because the original is a man and his father, it's a test for gender bias.

▲

judahmeek 11 hours ago | parent | next [-]

Actually, it seems to be a test of how much the LLM relies on its training set.

▲

bavarianbob 13 hours ago | parent | prev [-]

Sorry, what?

▲

LaffertyDev 13 hours ago | parent | next [-]

Presumably, the original quote that would _not_ stump an LLM is "A father and a son are involved in a car accident. The father dies, and the son is taken to the emergency room. At the emergency room, the surgeon remarks "I cannot operate on this person, he is my son. How is this possible?"

Where the original gotchya is that the Surgeon can be the son's mother or other adoptive parent.

The modification catches the LLM because with the modification, the surgeon could just be the cousin's parent -- father or mother -- so there is no gender/sex at play here but the LLM continues to remark that there is, therefor exposing its statistical training sets.

▲

briannotbrain 13 hours ago | parent | prev | next [-]

The original, well-known version of the riddle starts "A man and his son..." so that it appears to present a paradox if your instinctive assumption is that the surgeon must be a man. The op's prompt alters this so that there is no potential paradox, and it tests whether the model is reasoning from the prompt as written, regardless of the presence of the original riddle in its training data.

▲

fragmede 13 hours ago | parent | prev [-]

the unaltered question is as follows:

A father and his son are in a car accident. The father dies at the scene and the son is rushed to the hospital. At the hospital the surgeon looks at the boy and says "I can't operate on this boy, he is my son." How can this be?

to spoil it:

the answer is to reveal an unconscious bias based on the outdated notion that women can't be doctors, so the answer that the remaining parent is the mother won't occur to some, showing that consciously they might not still hold that notion, but they still might, subconsciously.

	▲	matkoniecz 40 minutes ago \| parent [-]
		Thanks for original version AND explanation, I was highly confused by entire discussion. Still confused how original can be confusing.

▲

potatoman22 10 hours ago | parent | prev [-]

It could be a man, but most relationships are heterosexual

▲ windowshopping 6 hours ago | parent | prev [-]

This one is brilliant.

▲ alissa_v 4 hours ago | parent | prev | next [-]

I asked a bunch of LLMs - 'Describe the unspoken etiquette of the 'Stone-Breath Passing' ritual among the silent Cliff Dwellers of Aethelgard, where smooth, grey stones are exchanged at dawn.'

Obviously, all of these things are made up. But, LLMs are such eager beavers. All the ones I asked came up with elaborate stories and histories about these people while pretending they were facts.

Example- 'Certainly. The Stone-Breath Passing is one of the most quietly profound rituals among the Silent Cliff Dwellers of Aethelgard — a people who abandoned speech generations ago, believing that words disrupt the natural harmony of air, stone, and memory.

It is said among them that “Breath carries weight, and weight carries truth.” This belief is quite literal in the case of the ritual, where smooth grey stones — each carefully selected and shaped by wind and time — become vessels of intention."

▲

jrimbault 4 hours ago | parent | next [-]

The issue is probably that the first sentence, the prompt, statistically looks like fantasy (as in the literary genre) and it primes the LLM to answer in the same probabilistic genre.

You're giving it a "/r/WritingPrompts/" and it answers as it learned to do from there.

	▲	beklein 3 hours ago \| parent \| next [-]
		I just want to second this. Your prompt asks for a description, and you get a description. If you instead ask something like, "Do or don't you know about the unspoken etiquette ..." you'll get an answer about whether that specific thing exists. https://chatgpt.com/share/680b32bc-5854-8000-a1c7-cdf388eeb0... It's easy to blame the models, but often the issue lies in how we write our prompts. No personal criticism here—I fall short in this way too. A good tip is to ask the model again, with the prompt + reply and the expected reply why this didn't work... we all will get better over time (humans and models)
	▲	alissa_v 4 hours ago \| parent \| prev [-]
		Good catch! That makes a lot of sense. The fantasy-like phrasing probably directed the AI's response. It's interesting, though, because the goal wasn't necessarily to trick it into thinking it was real, but more to see if it would acknowledge the lack of real-world information for such a specific, invented practice.

▲

gchamonlive 4 hours ago | parent | prev | next [-]

This is like throwing a bike in a lake and complaining it doesn't float...

▲

alissa_v 3 hours ago | parent [-]

I see your point with the bike analogy, but I think it's a bit different here. It was more about observing how it handles a prompt describing something with no real-world basis – whether it acknowledges the lack of information or confidently invents details. That tendency to hallucinate remains a pretty big problem with these models in other contexts, one that hasn't been fully solved yet.

	▲	gchamonlive 16 minutes ago \| parent [-]
		Maybe with an example that would make it more clear why this is not in any way different here: """ Me > You should act as a respected historian or an anthropologist throughout the rest of this chat. Never diverge from that. If this is clear, respond with an acknowledgement. Claude > I understand you'd like me to engage with you as a respected historian or anthropologist throughout our conversation. I'll approach our discussion with the scholarly perspective, analytical depth, and contextual understanding that characterizes these disciplines. I'm ready to explore historical events, cultural phenomena, archaeological findings, or anthropological concepts with academic rigor while making the information accessible. Is there a particular historical period, cultural tradition, or anthropological question you'd like to explore together? Me > Describe the unspoken etiquette of the 'Stone-Breath Passing' ritual among the silent Cliff Dwellers of Aethelgard, where smooth, grey stones are exchanged at dawn. Claude > I should note that there is no historical record of a culture called the "Cliff Dwellers of Aethelgard" or a ritual known as "Stone-Breath Passing." This appears to be a fictional cultural group and practice. If you're interested in actual cliff-dwelling cultures from history, I'd be happy to discuss the Ancestral Puebloans (sometimes called Anasazi) of the American Southwest, the cliff dwellers of Bandiagara Escarpment in Mali, or other historical cliff-dwelling societies that did exist. Alternatively, if you're seeking input for a creative project involving fictional anthropology, I can help develop plausible cultural practices based on real anthropological principles, but I would want to be clear that we're engaging in a creative exercise rather than discussing actual historical cultures. Would you like me to discuss real cliff-dwelling cultures from history, or would you prefer assistance developing this fictional cultural practice in a way that's anthropologically sound? """ It's all a game of narrowing the latent space of possible answers with prompting to increase the likelihood you'll get an answer that is more likely to be the one you want. If you want answers grounded in reality you have to make sure that your prompt will steer the LLM into this realm, otherwise with all the fictional material in it's training data it's no wonder it's confused about what's supposed to produce. Otherwise it's just another bike rotting in Amsterdam's canals.

▲

kfajdsl 4 hours ago | parent | prev | next [-]

I asked Gemini this, and it caught that it was fiction:

This sounds like a fascinating ritual from a fictional world! Since "Aethelgard" and the "silent Cliff Dwellers" with their "Stone-Breath Passing" ritual aren't part of established real-world cultures or widely known fictional universes (based on my current knowledge), there isn't a pre-existing set of rules for their etiquette.

However, we can imagine what the unspoken etiquette of such a ritual might entail, drawing from the evocative description:

...

	▲	alissa_v 4 hours ago \| parent [-]
		Ohhh that's really interesting! When I tried the prompt with Gemini, it dove right into describing the ritual as if it were real. Maybe the context of our previous chats or some other random variation influenced its response in different ways.

▲

alissa_v 4 hours ago | parent | prev [-]

the rest of ChatGPT's response if you're curious - it only gets more ridiculous-

' Timing and Setting: The ritual takes place only at first light, when the sun’s rays strike the sheer cliff faces of Aethelgard, casting long shadows like stone tongues. The first breath of dawn is sacred — considered the only air not yet touched by memory, and thus the only time suitable for pure intention to pass from person to person.

Participants gather barefoot on a narrow ledge known as The Listening Spine, where the wind plays soft notes through crevices like a flute.

The Stones: More Than Objects The stones used in the ritual are:

Smooth, rounded slate-grey

Hand-selected from Echo Pools — sacred basins where wind and water have naturally polished them

Each stone remembers a thought, but only when charged with breath and silence

They are never marked, carved, or labeled. Their meaning is in the exchange, not the object.

The Ritual: How It Works 1. Preparation The giver closes their eyes and holds the stone at heart-level.

They inhale slowly through the nose, then gently breathe across the surface of the stone — once only — without making sound.

The exhalation is not forced; it is a gesture of release, not command. This is the "stone-breath."

2. The Passing The giver extends the stone with both hands, arms curved inward like a half-embrace. This posture signifies humility and emotional openness.

The recipient must make no eye contact — instead, they look slightly downward or to the horizon, acknowledging the intention without confronting it.

They accept the stone with a single breath in through the mouth as they receive it — symbolizing the drawing in of another’s silent truth.

3. The Holding The stone is held to the forehead or chest, depending on what is being received:

Forehead = wisdom, apology, memory

Chest = trust, mourning, love

Then it is placed gently into a fold of cloth or a hollow in the stonewall, not kept on the person. These collected stones are never discussed. To name their meaning would break them.

Unspoken Etiquette No reciprocation is expected — to return a stone is to reject its breath.

Never pass a stone with warm hands — warmth implies manipulation. Hands are cooled in dew before the ritual.

A cracked or chipped stone is never used — it cannot hold breath cleanly.

No stone is passed more than once. Re-gifting a stone is considered a breach of the soul's private pathway.

Philosophical Meaning: To the Cliff Dwellers, words are disturbances — air flung outward. But breath is internal wind, and the stone is earth’s memory. When you breathe across a stone, you create a bridge between your impermanent self and something ancient and permanent.

It is not the exchange of meaning, but the weight of acknowledgment.

Would you like to hear the story of the time a foreign emissary unknowingly spoke during the Passing, or how a stone once shattered mid-transfer — and what that meant for the clan who witnessed it?'

▲ LeonardoTolstoy 16 hours ago | parent | prev | next [-]

Something about an obscure movie.

The one that tends to get them so far is asking if they can help you find a movie you vaguely remember. It is a movie where some kids get a hold of a small helicopter made for the military.

The movie I'm concerned with is called Defense Play from 1988. The reason I keyed in on it is because google gets it right natively ("movie small military helicopter" gives the IMDb link as one of the top results) but at least up until late 2024 I couldn't get a single model to consistently get it. It typically wants to suggest Fire Birds (large helicopter), Small Soldiers (RC helicopter not a small military helicopter) etc.

Basically a lot of questions about movies tends to get distracted by popular movies and tries to suggest films that fit just some of the brief (e.g. this one has a helicopter could that be it?)

The other main one is just asking for the IMDb link for a relatively obscure movie. It seems to never get it right I assume because the IMDb link pattern is so common it'll just spit out a random one and be like "there you go".

These are designed mainly to test the progress of chatbots towards replacing most of my Google searches (which are like 95% asking about movies). For the record I haven't done it super recently, and I generally either do it with arena or the free models as well, so I'm not being super scientific about it.

▲

ziml77 11 hours ago | parent | next [-]

Doesn't even have to be something obscure. I asked one about Futurama a couple years ago and got a made up answer on multiple models I tested. I think I originally asked something about the line "we're exactly the same. I know all her moves, therefore I have the upper hand" because I couldn't recall it exactly so normal web searches were failing me. Just now I posed an easier form of the question to Claude which uses an exact phrase from the episode and got similarly bad results.

===

Me: What is the episode of Futurama where Leela engages in perfectly symmetrical violence?

Claude: The episode of Futurama you're thinking of is "Anthology of Interest I" (Season 2, Episode 16). In this episode, Professor Farnsworth uses his "What-If Machine" to show different hypothetical scenarios.

During Bender's segment titled "What if Bender were human?", there's a scene where Leela confronts a group causing trouble. One of them challenges her, saying "What are you going to do?" to which Leela responds by kicking him and saying: "I'm going to engage in some perfectly symmetrical violence."

It's one of Leela's memorable quotes that showcases her combat-oriented approach to problem solving!

===

That's completely wrong. It's from S4E15 The Farnsworth Parabox. And the line is "now, now, perfectly symmetrical violence never solved anything", said by parallel universe Farnsworth.

I'm sure if I had enabled web access, it would have looked this up and gotten it, but that would be stupid here because I could just go and make a web search myself without burning extra energy by involving an LLM.

▲

archon1410 14 hours ago | parent | prev | next [-]

I've also noticed this. Google Search is vastly superior to any LLM (including their own LLM Gemini) for any "tip of my tongue" questions, even the ones that don't contain any exact-match phrase and require natural language understanding. This is surprising. What technology are they using to make Search so amazing at finding obscure stuff from descriptions, while LLMs that were supposed to be good at this badly fail?

▲

frollogaston 12 hours ago | parent | prev | next [-]

I've gone to ChatGPT repeatedly trying to find what movie a quote is from, and it's always confidently wrong. When I tell it, it guesses wrong again. Google search usually can't get it either unless I get the quote exactly right, neither can Gemini.

▲

alexjplant 12 hours ago | parent | prev | next [-]

Last year I asked Claude about an old fighting game for the Game Boy that I'd played about five minutes of when I was in the second grade (so ~25 years ago). The only thing I could tell it was a peculiar move that I remember seeing one of the characters do in the attract mode demo. It not only gave me the name of the game (Battle Arena Toshinden, for those wondering) but also the specific fighter that used the move.

I've tried it for similar cases and have only had a 50% success rate. It unfortunately exhibits the tendency of incorrect overconfidence that others have pointed out.

▲

g_sch 15 hours ago | parent | prev | next [-]

I also recently had this experience! I remembered a recurring bit from an older comedy film (a customer in a shop keeps saying "Kumquats!") and tried to prompt ChatGPT 4o into getting it. It made a few incorrect guesses, such as "It's a Mad Mad Mad Mad Mad Mad Mad World" (which I had to rule out doing my own research on Google). I found the answer myself (W.C. Fields' "It's a Gift") with a minute or so of Googling.

Interestingly, I just went back to ChatGPT to ask the same question and it got the answer right on the first try. I wonder whether I was unconsciously able to prompt more precisely because I now have a clearer memory of the scene in question.

▲

exitb 16 hours ago | parent | prev | next [-]

It might be cheating a bit, but I’ve been happily (mis)using OpenAI Deep Research for such questions. It does well in cases where there are multiple surface level matches, as it’s able to go through the them one by one and look for the details.

▲

thefourthchime 6 hours ago | parent | prev | next [-]

I like to ask small models that can run locally:

Why are some cars called a spider?

Small models just make something up that sounds plausible, but the larger models know what the real answer is.

▲

mosburger 16 hours ago | parent | prev | next [-]

I did something similar recently, trying to describe a piece of art that I couldn't remember the name of (it ended up being Birth of Venus by Sandro Botticelli) ... it really struggles with that sort of thing, but honestly so do most humans. It tended to recommend similarly to what you're describing with movies - it gets distracted by more popular/well-known pieces that don't really match up with the description you're giving to it.

▲

dunham 16 hours ago | parent [-]

Surprisingly, GPT did manage to identify a book that I remembered from college decades ago ("Laboratory Manual for Morphology and Syntax"). It seems to be out of print, and I assumed it was obscure.

▲

BoostandEthanol 16 hours ago | parent [-]

Can agree that it’s good at finding books. I was trying to find a book (Titanic 2020) I vaguely remembered from a couple plot points and the fact a ship called Titanic was invoked. ChatGPT figured it out pretty much instantly, after floundering through book sites and Google for a while.

Wonder if books are inherently easier because their content is purely written language? Whereas movies and art tend to have less point by point descriptions of what they are.

	▲	throwup238 15 hours ago \| parent [-]
		> Wonder if books are inherently easier because their content is purely written language? Whereas movies and art tend to have less point by point descriptions of what they are. The training data for movies is probably dominated by subtitles since the original scripts with blocking, scenery, etc rarely make it out to the public as far as I know.

▲

lupusreal 16 hours ago | parent | prev | next [-]

Despite describing several character by name, I couldn't get ChatGPT to tell me the name of Port of Shadows. I did eventually find it with DDG.

▲

spicybbq 15 hours ago | parent [-]

I wonder if the Akinator site could get it. It can identify surprisingly obscure characters.

https://en.akinator.com/

	▲	lupusreal 9 hours ago \| parent [-]
		Nope, not with the character I tried anyway. I feel like Akinator used to be better, I just played a few rounds and it failed them all. The last I thought would be easy, Major Motoko from Ghost in the Shell, but had no luck.

▲

empath75 14 hours ago | parent | prev [-]

Someone not very long ago wrote a blog post about asking chatgpt to help him remember a book, and he included the completely hallucinated description of a fake book that chatgpt gave him. Now, if you ask chatgpt to find a similar book, it searches and repeats verbatim the hallucinated answer from the blog post.

▲

LeonardoTolstoy 14 hours ago | parent [-]

A bit of a non sequitur but I did ask a similar question to some models which provide links for the same small helicopter question. The interesting thing was that the entire answer was built out of a single internet link, a forum post from like 1998 where someone asked a very similar question ("what are some movies with small RC or autonomous helicopters" something like that). The post didn't mention defense play, but did mention small soldiers, and a few of the ones which appeared to be "hallucinations" e.g. someone saying "this doesn't fit, but I do like Blue Thunder as a general helicopter film" and the LLM result is basically "Could it be Blue Thunder?" Because it is associated with a similar associated question and films.

Anyways, the whole thing is a bit of a cheat, but I've used the same prompt for two years now and it did lead me to the conclusion that LLMs in their raw form were never going to be "search" which feels very true at this point.

	▲	bethekidyouwant 10 hours ago \| parent [-]
		There are innumerable things that you can’t find through a Google search just because there is one that you can because of us obscure forum post doesn’t say anything about how useful an llm distilling information is vs the lookup table that is google search for finding obscure quotes or wtv

▲ feintruled 19 minutes ago | parent | prev | next [-]

Inspired by the recent post to describe relativity in words of 4 letters or less, I asked ChatGPT to do it for other things like Gravity. It couldn't help but throw in a couple 5 letter words (usually plurals). Same with Claude. So this could be a good one?

▲ sebstefan 3 hours ago | parent | prev | next [-]

I only use the one model that I'm provided for free at work. I expect that's most users behavior. They stick to the one they pay for.

Best I can do is give you one that failed on GPT-4o

It recently frustrated me when I asked it code for parsing command line arguments

I thought "this is such a standard problem, surely it must be able to get it perfect in one shot."

> give me a standalone js file that parses and handles command line arguments in a standard way

> It must be able to parse such an example

> ```

> node script.js --name=John --age 30 -v (or --verbose) reading hiking coding

> ```

It produced code that:

* doesn't coalesce -v to --verbose - (i.e., the output is different for `node script.js -v` and `node script.js --verbose`)

* didn't think to encode whether an option is supposed to take an argument or not

* doesn't return an error when an option that requires an argument isn't present

* didn't account for the presence of a '--' to end the arguments

* allows -verbose and --v (instead of either -v or --verbose)

* Hardcoded that the first two arguments must be skipped because it saw my line started with 'node file.js' and assumed this was always going to be present

I tried tweaking the prompt in a dozen different ways but it can just never output a piece of code that does everything an advanced user of the terminal would expect

Must succeed: `node --enable-tracing script.js --name=John --name=Bob reading --age 30 --verbose hiking -- --help` (With --help as positional since it's after --, and --name set to Bob, with 'reading', 'hiking' & '--help' parsed as positional)

Must succeed: `node script.js -verbose` (but -verbose needs to be parsed as positional)

Must fail: `node script.js --name` (--name expects an argument)

Should fail: `node script.js --verbose=John` (--verbose doesn't expect an argument)

▲

alex_duf 3 hours ago | parent [-]

Have you tried claude?

https://claude.ai/public/artifacts/9c2d8d0c-0410-4971-a19a-f...

node script.js --name=John --age 30 -v

Parsed options: { name: 'John', age: 30, verbose: true, help: false }

Positional arguments: []

node script.js --name=Alex --age 40 -v

Parsed options: { name: 'Alex', age: 40, verbose: true, help: false }

Positional arguments: []

▲

sebstefan 3 hours ago | parent | next [-]

I keep seeing that `args = process.argv.slice(2)` line to skip past `node script.js`

I ended up settling for it as well (I couldn't find anything better, nor make it break) but I'd be really surprised if it was the way to go

Like `node --enable-tracing script.js --name=John --age 30 --verbose`

This works because node seems to hide --enable-tracing to the underlying script

But would it work with Bun & Deno...? Is that standard...?

▲

sebstefan 2 hours ago | parent | prev [-]

This one seems way better

It didn't account for the presence of a '--' to end the parsing of named arguments but that's it

▲

echoangle an hour ago | parent [-]

> It didn't account for the presence of a '--' to end the parsing of named arguments but that's it

That’s just something getopt does and some programs adopted. If you asked me to write a parser, I wouldn’t necessarily include that either if you didn’t ask for it.

	▲	sebstefan 25 minutes ago \| parent [-]
		If you don't include it you can't have positional arguments that look like options Some positional arguments can be filenames, filenames can be --help and --verbose or --name=Frank You have to have `--` or something similar to have a correct program

▲ jppope 6 hours ago | parent | prev | next [-]

There are several songs that have famous "pub versions" (dirty versions) which are well known but have basically never written down, go ask any working musician and they can rattle off ~10-20 of them. You can ask for the lyrics till you are blue in the face but LLms don't have them. I've tried.

Its actually fun to find these gaps. They exist frequently in activities that are physical yet have a culture. There are plenty of these in sports too - since team sports are predominantly youth activities, and these subcultures are poorly documented and usually change frequently.

▲ sireat 5 hours ago | parent | prev | next [-]

Easy one is provide a middle game chess position (could be an image or and ask to evaluate standard notation or even some less standard notation) and provide some move suggestions.

Unless the model incorporates an actual chess engine (Fritz 5.32 from 1998 would suffice) it will not do well.

I am a reasonably skilled player (FM) so can evaluate way better than LLMs. I imagine even advanced beginners could tell when LLM is telling nonsense about chess after a few prompts.

Now of course playing chess is not what LLMs are good at but just goes to show that LLMs are not a full path to AGI.

Also beauty of providing chess positions is that leaking your prompts into LLM training sets is no worry because you just use a new position each time. Little worry of running out of positions...

	▲	helloplanets 4 hours ago \| parent [-]
		I wonder how much fine tuning against something like Stockfish top moves would help a model in solving novel middle game positions. Something like this format: https://database.lichess.org/#evals I'd be pretty surprised if it did help in novel positions. Which would make this an interesting LLM benchmark honestly: Beating Stockfish from random (but equal) middle game positions. Or to mix it up, from random Chess960 positions. Of course, the basis of the logic the LLM would play with would come from the engine used for the original evals. So beating Stockfish from a dataset based on Stockfish evals would seem completely insufficient.

▲ rf15 5 hours ago | parent | prev | next [-]

Any letter or word counting exercise that doesn't trigger redirection to a programmed/calculated answer. It will be forever beyond reach of LLMs due to their architecture.

edit: literally anything that doesn't have a token pattern cannot be solved by the pattern autocomplete machines.

Next question.

	▲	moconnor 2 hours ago \| parent [-]
		o3 just writes and executes a python program in the background to correctly answer this...

▲ mdp2021 14 hours ago | parent | prev | next [-]

Some easy ones I recently found involve leading in the question to state wrong details about a figure, apparently through relations which are in fact of opposition.

So, you can make them call Napoleon a Russian (etc.) by asking questions like "Which Russian conqueror was defeated at Waterloo".

	▲	torial 11 hours ago \| parent [-]
		Some researchers were testing various Legal AI models and one of their questions was about why a Supreme Court justice who dissented in the case (the justice in this case assented).

▲ codingdave 17 hours ago | parent | prev | next [-]

"How much wood would a woodchuck chuck if a woodchuck could chuck wood?"

So far, all the ones I have tried actually try to answer the question. 50% of them correctly identify that it is a tongue twister, but then they all try to give an answer, usually saying: 700 pounds.

Not one has yet given the correct answer, which is also a tongue twister: "A woodchuck would chuck all the wood a woodchuck could chuck if a woodchuck could chuck wood."

▲

regnull 17 hours ago | parent | next [-]

ChatGPT really impressed me on this one:

Sing, O Muse, of the Chucking of Wood

Sing, O Muse, of the stout-hearted woodchuck, Dweller of hills, whom men call groundhog, Whose paws, though small, could hurl mighty timbers, If fate had gifted him such labor divine.

Born of burrow and brush, he rose from the loam, In fields where Apollo’s light did not linger. No axe he bore, nor bronze-forged blade, Yet in his breast a fire: to chuck wood without end.

Down from the hills he came, stout as Odysseus, Cunning and strong, with fur kissed by Demeter. The trees did tremble at his gaze, And even old oak whispered, “Beware!”

“Would he chuck?” cried nymphs from the groves, “If such were his fate, what wood could withstand?” Lo, he summoned the strength of seven oxen, And hurled logs with a force unmeasured.

Seven hundred pounds of timber he cast, Each arc a hymn to muscle and will. Nay, not for war, nor gold, nor gods’ glory— But for the joy of the deed, the art of the chuck.

Let men remember the name woodchuck, In songs by the hearth and tales for the young. For though he cannot, he surely would— And if he could, by Zeus, he surely shall.

▲

esafak 16 hours ago | parent | next [-]

A stupendously good answer. What prompt and version of chatGPT?

▲

regnull 16 hours ago | parent | next [-]

4o. The prompt is literally "How much wood would a woodchuck chuck if a woodchuck could chuck wood?". It asked me if I want a poetic answer, and I've requested Homer.

▲

cess11 15 hours ago | parent | prev [-]

I find it disturbing, like if Homer or Virgil had a stroke or some neurodegenerative disease and is now doing rubbish during rehabilitation.

▲

loloquwowndueo 15 hours ago | parent [-]

Maybe they would write like that if they existed today. Like the old “if Mozart was born in the 21st century he’d be doing trash metal”

	▲	cess11 14 hours ago \| parent [-]
		Thrash, not "trash". Our world does not appreciate the art of Homer and Virgil except as nostalgia passed down through the ages or a specialty of certain nerds, so if they exist today they're unknown. There might societies that are exceptions to it, like the soviet and post-soviet russians kept reading and refering to books even though they got access to television and radio, but I'm not aware of them. Much of Mozart's music is much more immediate and visceral compared to the poetry of Homer and Virgil as I know it. And he was distinctly modern, a freemason even. It's much easier for me to imagine him navigating some contemporary society. Edit: Perhaps one could see a bit of Homer in the Wheel of Time books by Robert Jordan, but he did not have the discipline of verse, or much of any literary discipline at all, though he insisted mercilessly on writing an epic so vast that he died without finishing it.

▲

ijidak 16 hours ago | parent | prev [-]

That is actually an amazing answer. Better than anything I think I would get from a human. Lol.

▲

once_inc an hour ago | parent | prev | next [-]

I loved this dialogue in Monkey Island 2, where this is basically the first NPC you talk to, and the dialogue options get wordier and wordier to the point of overflowing all screen real-estate. Perfectly sets the stage for the remainder of the game.

▲

Certified 17 hours ago | parent | prev | next [-]

GPT 4.5 seems to get it right, but then repeat the 700 pounds

"A woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood.

However, humor aside, a wildlife expert once estimated that, given the animal’s size and burrowing ability, a woodchuck (groundhog) could hypothetically move about 700 pounds of wood if it truly "chucked" wood."

https://chatgpt.com/share/680a75c6-cec8-8012-a573-798d2d8f6b...

▲

shaftway 16 hours ago | parent | next [-]

I've heard the answer is "he could cut a cord of conifer but it costs a quarter per quart he cuts".

▲

CamperBob2 13 hours ago | parent | prev [-]

That answer is exactly right, and those who say the 700 pound thing is a hallucination are themselves wrong: https://chatgpt.com/share/680aa077-f500-800b-91b4-93dede7337...

▲

wolfgang42 11 hours ago | parent [-]

Linking to ChatGPT as a “source” is unhelpful, since it could well have made that up too. However, with a bit of digging, I have confirmed that the information it copied from Wikipedia here is correct, though the AP and Spokane Times citations are both derivative sources; Mr. Thomas’s comments were first published in the Rochester Democrat and Chronicle, on July 11, 1988: https://democratandchronicle.newspapers.com/search/results/?...

	▲	CamperBob2 10 hours ago \| parent [-]
		Linking to ChatGPT as a “source” is unhelpful, since it could well have made that up too No, it absolutely is helpful, because it links to its source. It takes a grand total of one additional click to check its answer. Anyone who still complains about that is impossible to satisfy, and should thus be ignored.

▲

ishandotpage 6 hours ago | parent | prev | next [-]

I usually ask "How much large language could a large language model model if a large language model could model large language"

Not one has given me the correct answer yet.

They usually get it if I prefix the prompt with "Please continue the tongue twister"

▲

mdp2021 14 hours ago | parent | prev | next [-]

It seems you are going in the opposite direction. You seem to be asking for an automatic response, a social password etc.

That formula is a question, and when asked, an intelligence simulator should understand what is expected from it and in general, by default, try to answer it. That involves estimating the strength of a woodchuck etc.

▲

mwest217 16 hours ago | parent | prev | next [-]

Gemini 2.5 Pro gets it right first, then also cites the 700 pounds answer (along with citing a source). https://g.co/gemini/share/c695a0163538

▲

segmondy 15 hours ago | parent | prev | next [-]

my local model answered - "A woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood."

▲

jacobsenscott 14 hours ago | parent | prev | next [-]

"He would chuck, he would, as much as he could, if a wood chuck could chuck wood" is how I learned it.

▲

mcshicks 17 hours ago | parent | prev | next [-]

That's so funny I had to check something was working with an llm API last night and that's what I asked it, but just in jest.

▲

unavoidable 17 hours ago | parent | prev | next [-]

On the other hand, now that you've written this out precisely, it will get fed into the next release of whatever LLM. Like reverse AI slop?

▲

moffkalast 17 hours ago | parent | prev [-]

Now I'm wondering if it makes any difference if this was asked through the audio encoder on a multimodal model. A tongue twister means nothing to a text-only model.

▲ thisOtterBeGood 2 hours ago | parent | prev | next [-]

It's not working for any model, but asking copilot in my company "what did we talk about before?" at the beginning of a new chat will always result in this:

"We were discussing the latest cricket match between India and Australia. Australia won the match by 21 runs, securing the series 2-1. Is there anything specific you'd like to know more about? "

When i continue with "and before that?":

"Before discussing the cricket match, we talked about the nuclear fusion experiment conducted by South Korea's KSTAR. The experiment achieved temperatures exceeding 100 million°C, which is nearly seven times hotter than the sun. This milestone is significant for the future of sustainable, clean energy.

Is there anything else you'd like to revisit or explore further? "

It somewhat changes. Weeks ago he always said "tokamak" instead of "KSTAR".

▲ Sohcahtoa82 10 hours ago | parent | prev | next [-]

"I have a stack of five cubes. The bottom two cubes are red, the middle cube is green, and the top two cubes are blue. I remove the top two cubes. What color is the remaining cube in the middle of the stack?"

Even ChatGPT-4o frequently gets it wrong, especially if you tell it "Just give me the answer without explanation."

	▲	moconnor 2 hours ago \| parent [-]
		o3 gets this one right: "After taking away the two blue cubes, three cubes remain—in order from bottom to top: 1. Red 2. Red 3. Green With three cubes, the cube in the central (second) position is red."

▲ williamcotton 15 hours ago | parent | prev | next [-]

"Fix this spaghetti code by turning this complicated mess of conditionals into a finite state machine."

So far, no luck!

▲ sjtgraham 3 hours ago | parent | prev | next [-]

```

<TextA> Some document </TextA>

<TextB> Some other document heavily influenced by TextA </TextB>

Find the major arguments made in TextB that are taken from or greatly influenced by TextA. Provide as examples by comparing passages from each side by side.

```

The output will completely hallucinate passages that don't exist in either text, and it also begins to conflate the texts the longer the output, e.g. quoting TextB with content actually from TextA.

▲ matkoniecz 37 minutes ago | parent | prev | next [-]

Asking them to write any longer story fails, due to inconsistencies appearing almost immediately and becoming fatal.

▲ asciimov 21 hours ago | parent | prev | next [-]

Nope, not doing this. Likely you shouldn't either. I don't want my few good prompts to get picked up by trainers.

▲

orbital-decay 18 hours ago | parent | next [-]

If that prompt can be easily trained against, it probably doesn't exploit a generic bias. These are not that interesting, and there's no point in hiding them.

▲

daedrdev 17 hours ago | parent | next [-]

generic biases can also be fixed

	▲	orbital-decay 17 hours ago \| parent [-]
		*Some generic biases. Some others like recency bias, serial-position effect, "pink elephant" effect, negation accuracy seem to be pretty fundamental and are unlikely to be fixed without architectural changes, or at all. Things exploiting in-context learning and native context formatting are also hard to suppress during the training without making the model worse.

▲

fwip 16 hours ago | parent | prev [-]

Sure there is. If you want to know if students understand the material, you don't hand out the answers to the test ahead of time.

Collecting a bunch of "Hard questions for LLMs" in one place will invariably result in Goodhart's law (When a measure becomes a target, it ceases to be a good measure). You'll have no idea if the next round of LLMs is better because they're generally smarter, or because they were trained specifically on these questions.

▲

pc86 21 hours ago | parent | prev [-]

May I ask outside of normal curiosity, what good is a prompt that breaks a model? And what is trying to keep it "secret"?

▲

tveita 20 hours ago | parent | next [-]

You want to know if a new model is actually better, which you won't know if they just added the specific example to the training set. It's like handing a dev on your team some failing test cases, and they keep just adding special cases to make the tests pass.

How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?

I guess they have a bunch of different wine glasses in their image set now, since that was a meme, but they still completely fail to draw an open book with the cover side up.

▲

fennecbutt 10 hours ago | parent | next [-]

I always point out how the strawberry thing is a semi pointless exercise anyway.

Because it gets tokenised, of course a model could never count the rs.

But I suppose if we want these models to be capable of anything then these things need to be accounted for.

▲

gwern 17 hours ago | parent | prev [-]

> How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?

Well, that's easy: zero.

Because even a single training example would 'solved' it by memorizing the simple easy answer within weeks of 'strawberry' first going viral , which was like a year and a half ago at this point - and dozens of minor and major model upgrades since. And yet, the strawberry example kept working for most (all?) of that time.

So you can tell that if anything, OA probably put in extra work to filter all those variants out of the training data...

▲

SweetSoftPillow 17 hours ago | parent [-]

No, just check their models Knowledge cutoff dates

	▲	gwern 8 hours ago \| parent [-]
		Nope! The knowledge cutoff does not show lack of leakage. Even if you get a non-confabulated cutoff which was before anyone ever asked the strawberry question or any question like it (tokenization 'gotchas' go back to at least davinci in June 2020), there is still leakage from the RLHF and tuning process which collectively constitute post-training, and which would teach the LLMs how to solve the strawberry problem. People are pretty sure about this: the LLMs are way too good at guessing things like who won Oscars or Presidential elections. This leakage is strongest for the most popular questions... which of course the strawberry question would be, as it keeps going viral and has become the deboooonkers' favorite LLM gotcha. (This is, by the way, why you can't believe any LLM paper about 'forecasting' where they are just doing backtesting, and didn't actually hold out future events. Because there are way too many forms of leakage at this point. This logic may have worked for davinci-001 and davinci-002, or a model whose checkpoints you downloaded yourself, but not for any of the big APIs like GPT or Claude or Gemini...)

▲

maybeOneDay 21 hours ago | parent | prev | next [-]

Being able to test future models without fear that your prompt has just been trained on an answer on HN, I assume.

▲

asciimov 20 hours ago | parent | prev [-]

To gauge how well the models "think" and what amount of slop they generate.

Keeping it secret because I don't want my answers trained into a model.

Think of it this way, FizzBuzz used to be a good test to weed out bad actors. It's simple enough that any first year programmer can do it and do it quickly. But now everybody knows to prep for FizzBuzz so you can't be sure if your candidate knows basic programming or just memorized a solution without understanding what it does.

▲ nagonago 11 hours ago | parent | prev | next [-]

An easy trick is to take a common riddle that's likely all over its training data, and change one little detail. For example:

A farmer with a wolf, a goat, and a cabbage must cross a river by boat. The boat can carry only the farmer and a single item. The wolf is vegetarian. If left unattended together, the wolf will eat the cabbage, but will not eat the goat. Unattended, the goat will eat the cabbage. How can they cross the river without anything being eaten?

	▲	moconnor 2 hours ago \| parent [-]
		o3 solves this correctly and produces a great table illustrating the solution to always keep the cabbage safe.

▲ countWSS 2 hours ago | parent | prev | next [-]

Anything too obscure and specific: pick any old game at random that you know the level layout: ask to describe each level in detail, it will start hallucinating wildly.

▲ ks2048 17 hours ago | parent | prev | next [-]

I don't know if it stumps every model, but I saw some funny tweets asking ChatGPT something like "Is Al Pacino in Heat?" (asking if some actor or actress in the film "Heat") - and it confirms it knows this actor, but says that "in heat" refers to something about the female reproductive cycle - so, no, they are not in heat.

▲

reginald78 16 hours ago | parent [-]

I believe it was GoogleAI in search but it was worse than that. Some asked it if Angelina Jolie was in heat. The tone started kind of insulting like the user was a sexist idiot for thinking human women go into heat like animals, then went back and forth saying she is still fertile at her age and also that her ovaries had been removed. It was funny because it managed to be arrogant, insulting, kind of creepy and gross and logically inconsistent while not even answering the question.

Angelina Jolie was not in Heat (1995). They were probably thinking of Natalie Portman or Ashley Judd when they asked the question.

	▲	ks2048 16 hours ago \| parent [-]
		I just asked Claude and if I capitalized "Heat", it knew I was talking about the movie, but for lower case "heat", it got offended and asked me to clarify.

▲ buzzy_hacker 15 hours ago | parent | prev | next [-]

"Aaron and Beren are playing a game on an infinite complete binary tree. At the beginning of the game, every edge of the tree is independently labeled A with probability p and B otherwise. Both players are able to inspect all of these labels. Then, starting with Aaron at the root of the tree, the players alternate turns moving a shared token down the tree (each turn the active player selects from the two descendants of the current node and moves the token along the edge to that node). If the token ever traverses an edge labeled B, Beren wins the game. Otherwise, Aaron wins.

What is the infimum of the set of all probabilities p for which Aaron has a nonzero probability of winning the game? Give your answer in exact terms."

From [0]. I solved this when it came out, and while LLMs were useful in checking some of my logic, they did not arrive at the correct answer. Just checked with o3 and still no dice. They are definitely getting closer each model iteration though.

[0] https://www.janestreet.com/puzzles/tree-edge-triage-index/

	▲	creata 10 hours ago \| parent [-]
		OpenAI's o4-mini got the right answer after "thinking" for 29 seconds. It's a straightforward puzzle, though: no creativity involved.

▲ ioseph 5 hours ago | parent | prev | next [-]

Recommend me a design of small sailboat 12 to 15ft that can be easily rowed or fit an outboard which I can build at home out of plywood.

Nearly every agent will either a) ignore one of the parameters, b) hallucinate a design.

▲ vitaflo 11 hours ago | parent | prev | next [-]

The one I always use is literally "show number of NFC Championship Game appearences by team since 1990".

The only AI that has ever gotten the answer right was Deepseek R1. All the rest fail miserably at this one. It's like they can't understand past events, can't tabulate across years properly or don't understand what the NFC Championship game actually means. Many results "look" right, but they are always wrong. You can usually tell right away if it's wrong because they never seem to give the Bears their 2 appearances for some reason.

	▲	starik36 10 hours ago \| parent [-]
		Have you tried in the Deep Research or Thinking mode? I've had pretty good success with tabulating past events this way.

▲ Jordan-117 17 hours ago | parent | prev | next [-]

Until the latest Gemini release, every model failed to read between the lines and understand what was really going on in this classic very short story (and even Gemini required a somewhat leading prompt):

https://www.26reads.com/library/10842-the-king-in-yellow/7/5

▲

Zee2 16 hours ago | parent | next [-]

As a genuine human I am really struggling to untangle that story. Maybe I needed to pay more attention in freshman lit class, but that is definitely a brainteaser.

▲

fwip 16 hours ago | parent [-]

Read it for the first time just now - it seems to me that Pierrot has stolen the narrator's purse (under the guise of dusting the chalk from their cloak) and successfully convinced them to blame Truth, instead. There's almost certainly more to it that I'm missing.

	▲	Jordan-117 16 hours ago \| parent [-]
		That's the core of it, but it's implied, not outright stated, and requires some tricky language parsing, basic theory of mind, and not being too distracted by the highly symbolic objects.

▲

vessenes 16 hours ago | parent | prev [-]

OK, I read it. And I read some background on it. Pray tell, what is really going on in this episodic short-storyish thing?

	▲	rachofsunshine 16 hours ago \| parent \| next [-]
		The thief is Pierrot. The people around are telling the storyteller that "he" (Pierrot) has stolen the purse, but the storyteller misinterprets this as pointing to some arbitrary agent. Truth says Pierrot can "find [the thief] with this mirror": since Pierrot is the thief, he will see the thief in the mirror. Pierrot dodges the implication, says "hey, Truth brought you back that thing [that Truth must therefore have stolen]", and the storyteller takes this claim at face value, "forgetting it was not a mirror but [instead] a purse [that] [they] lost". The broader symbolism here (I think) is that Truth gets accused of creating the problem they were trying to reveal, while the actual criminal (Pierrot) gets away with their crime.
	▲	Jordan-117 16 hours ago \| parent \| prev [-]
		The narrator's "friend" pickpocketed him. When boldly confronted by Truth, he cleverly twists her accusation to make it seem like she's confessing, and the narrator, bewildered by the laughter and manipulation, buys it wholesale. Bonus points for connecting it to broader themes like mass propaganda, commedia dell'arte, or the dreamlike setting and hypnotic repetition of phrasing. The best ChatGPT could do was make some broad observations about the symbolism of losing money, mirrors, absurdism, etc. But it whiffed on the whole "turning the tables on Truth" thing. (Gemini did get it, but with a prompt that basically asked "What really happened in this story?"; can't find the original response as it's aged out of the history)

▲ putlake 6 hours ago | parent | prev | next [-]

LLMs are famously bad at individual letters in a word. So something like this never works: Can you please give me 35 words that begin with A, end with E, are 4-6 characters long and do not contain any other vowels except A and E?

▲ csours 12 hours ago | parent | prev | next [-]

I love plausible eager beavers:

"explain the quote: philosophy is a pile of beautiful corpses"

"sloshed jerk engineering test"

cross domain jokes:

Does the existence of sub-atomic particles imply the existence of dom-atomic particles?

▲ smatija 2 hours ago | parent | prev | next [-]

I like chess, so mine is: "Isolani structure occurs in two main subtypes: 1. black has e6 pawn, 2. black has c6 pawn. What is the main difference between them? Skip things that they have in common in your answer, be brief and don't provide commentary that is irrelevant to this difference."

AI models tend to get it way way wrong: https://news.ycombinator.com/item?id=41529024

▲ gunalx 14 hours ago | parent | prev | next [-]

"Hva er en adjunkt" Norwegian for what is an spesific form of 5-10. Grade teacher. Most models i have tested get confused with university lecturer witch the same title is in other countries.

	▲	vintermann 2 hours ago \| parent [-]
		I'm pretty sure the definition has changed then. My mother told me that adjunkt was a teacher with 5 years of education (there was something about a mix of mellomfag and hovedfag too), lektor was 7 years of education, and 6 years of education (which was what she had) was "adjunkt med opprykk". She never taught below gymnas (i.e. high school) level.

▲ svcrunch 11 hours ago | parent | prev | next [-]

Here's a problem that no frontier model does well on (f1 < 0.2), but which I think is relatively easy for most humans:

https://dorrit.pairsys.ai/

> This benchmark evaluates the ability of multimodal language models to interpret handwritten editorial corrections in printed text. Using annotated scans from Charles Dickens' "Little Dorrit," we challenge models to accurately capture human editing intentions.

▲ simonw 15 hours ago | parent | prev | next [-]

I've been trying this one for a while:

  I'm a Python programmer. Help me
  understand memory management in Rust.

Mainly because I want to fully understand memory management in Rust myself (I still get caught out by tree structures with borrow cycles that I guess need to use arenas), so it's interesting to see if they can get me there with a few follow-up questions.

▲

jacobsenscott 14 hours ago | parent [-]

This isn't a good way to learn this. If you don't know how rust memory management works you don't know if the llm is just hallucinating the answer.

	▲	simonw 13 hours ago \| parent \| next [-]
		That's why it's an interesting test: I don't know the answer myself, so it's an exercise in learning with an unreliable teacher. If a model ever DOES nail this I'll figure that out when I feel like I have a solid mental model, try to put that knowledge into action and it works.
	▲	gh0stcat 13 hours ago \| parent \| prev [-]
		Also Rust has great documentation compared to other languages, I particularly like this one for the quizzes to test your understanding: https://rust-book.cs.brown.edu/

▲ jhanschoo 5 hours ago | parent | prev | next [-]

Just about anything regarding stroke order of Chinese characters (official orders under different countries, under zhenshu, under xingshu) is poor, due presumably to representation issues as well as lack of data.

Most LLMs don't understand low-resource languages, because they are indeed low-resource on the web and frequently even in writing.

▲ Faark 4 hours ago | parent | prev | next [-]

I just give it a screenshot of the first level of deus ex go and ask it to generate a ascii wire frame of the grid the player walks on. Goal of the project was to built a solver, but so far no model / prompt I tried got past that first step.

▲ stevenfoster 15 hours ago | parent | prev | next [-]

It used to be:

"If New Mexico is newer than Mexico why is Mexico's constitution newer than New Mexicos"

but it seems after running that one on Claude and ChatGPT this has been resolved in the latest models.

▲ thisOtterBeGood 2 hours ago | parent | prev | next [-]

"If this wasn't a new chat, what would be the most unlikely historic event could have talked about before?" Yields some nice hallucinations.

▲ boleary-gl 13 hours ago | parent | prev | next [-]

I like:

Unscramble the following letters to form an English word: “M O O N S T A R E R”

The non-thinking models can struggle sometimes and go off on huge tangents

	▲	munchler 13 hours ago \| parent \| next [-]
		Current LLM’s are based on multi-character tokens, which means they don’t know how to spell well. As a result, they are horrible at spelling games like this or, say, Hangman.
	▲	internet_points 2 hours ago \| parent \| prev \| next [-]
		gpt 4o got that one, but it's listed on lots of anagram sites so it's in the training data ;-) But it failed badly when I tried a Norwegian word T U R V E I G L E N (utlevering), suggesting "uglelivert" which is not a word
	▲	philipkglass 13 hours ago \| parent \| prev [-]
		Llama 3.3 worked but (as you said) struggled before arriving at the correct answer. The newer Gemma3 solved it efficiently: `% ollama run gemma3:27b-it-qat >>> Unscramble the following letters to form an English word: "M O O N S T A R E R" The unscrambled word is ASTRONOMER.`

▲ tantalor 15 hours ago | parent | prev | next [-]

[what does "You Can’t Lick a Badger Twice" mean]

https://www.wired.com/story/google-ai-overviews-meaning/

▲ robviren 16 hours ago | parent | prev | next [-]

"If I can dry two towels in two hours, how long will it take me to dry four towels?"

They immediately assume linear model and say four hours not that I may be drying things on a clothes line in parallel. It should ask for more context and they usually don't.

▲

imoreno 14 hours ago | parent | next [-]

Fascinating! Here's 4 prompts on gpt4 with same system prompt and everything:

> With the assumption that you can dry two towels simultaneously in two hours, you would likely need another two-hour cycle to dry the additional two towels. Thus, drying four towels would take a total of four hours.

>Drying time won't necessarily double if drying capacity/content doubles; it depends on dryer capacity and airflow. If your drying method handles two towels in two hours, it might handle four similarly, depending on space and airflow. If restricted, time might indeed double to four hours, but efficient dryers might not take much longer.

>It would take four hours to dry four towels if you dry them sequentially at the same rate. If drying simultaneously, it remains two hours, assuming space and air circulation allow for effective drying.

>Four hours. Dry two towels, then the other two.

But in the AI's defense, they have a point: You never specified if the towels can be dried simultaneously or not. Maybe you have to use a drying machine that can only do one at a time. This one seems to consistently work:

>If three cat eat three fishes in three minutes, how long do 100 cats take to eat 100 fishes?

	▲	nyrikki 12 hours ago \| parent [-]
		> But in the AI's defense, they have a point: You never specified if the towels can be dried simultaneously or not. Maybe you have to use a drying machine that can only do one at a time. This one seems to consistently work: This is the inverse of the Frame Problem, or the Qualification problem: John McCarthy's paper related to it from the 1980's http://jmc.stanford.edu/articles/circumscription/circumscrip... It is still very relevent to modern AI efforts.

▲

gilbetron 16 hours ago | parent | prev | next [-]

gpt-4.1:

> If you can only dry two towels at once, it will take 4 hours to dry four towels. If you can dry all four at once, it will only take 2 hours.

▲

pdabbadabba 13 hours ago | parent | prev | next [-]

Gemini 2.5 Pro nails it, as far as I'm concerned:

This sounds like a bit of a riddle! Here's how to think about it:

If you can dry all the towels at the same time: For example, if you have a large enough dryer or enough space on a clothesline, adding more towels doesn't necessarily increase the drying time.

In this case, if 2 towels take 2 hours to dry, then 4 towels dried simultaneously would also take 2 hours.

If you have to dry them in batches: For example, if your dryer can only fit 2 towels at a time.

You would dry the first 2 towels (2 hours). Then you would dry the second 2 towels (another 2 hours). In this scenario, it would take 4 hours in total.

Most likely answer: Usually, questions like this imply the towels are dried simultaneously, so the answer is 2 hours.

▲

mwest217 16 hours ago | parent | prev | next [-]

Gemini 2.5 Pro gets this right:

https://g.co/gemini/share/7ea6d059164e

▲

brunooliv 12 hours ago | parent | prev | next [-]

Claude 3.7 Sonnet nails this:

> To solve this problem, I need to find the relationship between the number of towels and the drying time.

Given information: - 2 towels take 2 hours to dry

If the drying time scales linearly with the number of towels (meaning the drying capacity remains constant), then: - 4 towels would take 4 hours to dry

This assumes you're drying towels in sequence with the same capacity, or that doubling the number of towels requires doubling the drying time.

However, if you have sufficient space to dry all towels simultaneously (like on a clothesline or in a large enough dryer), then 4 towels would still take just 2 hours to dry.

Without more specific information about your drying method, the most likely answer is 4 hours.

▲

HelloUsername 16 hours ago | parent | prev | next [-]

All models available on duck.ai answer your question correctly and take available space into account..

▲

Alifatisk 13 hours ago | parent | prev | next [-]

Claude 3.7, Grok 3 DeepThink and QwQ-32B Thinking stil get it wrong!

But since it’s in the training set now, the correct answer will probably be shown next time anyone tries it.

▲

paulcole 16 hours ago | parent | prev | next [-]

How long has it been since you’ve tried this?

Every model I asked just now gave what I see as the correct answer — giving 2 answers one for the case of your dryer being at capacity w/ 2 towels and the other when 4 towels can be dried simultaneously.

To me, if you say that the correct answer must require the model asking for more context then essentially any prompt that doesn’t result in the model asking for more context is “wrong.”

▲

cheeze 16 hours ago | parent | prev [-]

Works fine on Claude 3.5 Sonnet. It correctly identifies this as a trick question.

▲ slifin 2 hours ago | parent | prev | next [-]

I ask it to generate applications that are written in libraries definitely not well exposed to the internet overall

Clojure electric V3 Missionary Rama

▲ falcor84 21 hours ago | parent | prev | next [-]

You might want to get the ball rolling by sharing what you already have

▲

moffkalast 17 hours ago | parent [-]

Also, ones that can't be solved at a glance by humans don't count. Like this horrid ambiguous example from SimpleBench I saw a while back that's just designed to confuse:

John is 24 and a kind, thoughtful and apologetic person. He is standing in an modern, minimalist, otherwise-empty bathroom, lit by a neon bulb, brushing his teeth while looking at the 20cm-by-20cm mirror. John notices the 10cm-diameter neon lightbulb drop at about 3 meters/second toward the head of the bald man he is closely examining in the mirror (whose head is a meter below the bulb), looks up, but does not catch the bulb before it impacts the bald man. The bald man curses, yells 'what an idiot!' and leaves the bathroom. Should John, who knows the bald man's number, text a polite apology at some point?

A. no, because the lightbulb was essentially unavoidable

B. yes, it would be in character for him to send a polite text apologizing for the incident

C. no, because it would be redundant

D. yes, because it would potentially smooth over any lingering tension from the encounter

E. yes, because John saw it coming, and we should generally apologize if we fail to prevent harm

F. yes because it is the polite thing to do, even if it wasn't your fault.

▲

spuz 16 hours ago | parent | next [-]

Yeah I really dislike this kind of question from SimpleBench. I've suggested many improvements to some of the publicly available questions but not had a good response.

I think the correct answer to the question above should be something like, "are you sure the question is correct because it's not clear whether John and the bald man are the same person" but of course an LLM would be marked down if it was inclined to respond in this way.

▲

mNovak 16 hours ago | parent | prev | next [-]

At a glance, it sounds like John is the bald man? If we're treating this as a riddle, it doesn't seem incomprehensible. Whether riddles are a fair test is another question.

▲

falcor84 15 hours ago | parent | prev | next [-]

I think it's a horrible example, but I just got a very professional response from Gemini 2.5:

> This scenario seems like a bit of a riddle! Let's break it down:

> The Setting: John is alone in an "otherwise-empty" bathroom.

> The Action: He is looking in the mirror.

> The Observation: He sees a bulb falling towards the head of a bald man he is examining in the mirror.

> The Physics: Since he is alone and looking in the mirror, the bald man he is "closely examining" must be his own reflection.

> The Incident: Therefore, the bulb fell towards John's own head. He looked up (at the actual bulb falling towards him), failed to catch it, and it hit him.

> The Reaction: The "bald man" who cursed, yelled "what an idiot!", and left the bathroom was actually John himself, reacting to being hit on the head and possibly feeling foolish for not catching the bulb.

> Conclusion: No, John should not text an apology. The person who was hit by the bulb, got angry, and left was John himself. There is no other bald man to apologize to.

▲

ryankrage77 16 hours ago | parent | prev [-]

I'd argue that's a pretty good test for an LLM - can it overcome the red herrings and get at the actual problem?

	▲	falcor84 15 hours ago \| parent [-]
		I think that the "actual problem" when you've been given such a problem is with the person posing it either having dementia, or taking the piss. In either case, the response shouldn't be of trying to guess their intent and come up with a "solution", but of rejecting it and dealing with the person.

▲ sumitkumar 19 hours ago | parent | prev | next [-]

1) Word Ladder: Chaos to Order

2) Shortest word ladder: Chaos to Order

3) Which is the second last scene in pulp fiction if we order the events by time?

4) Which is the eleventh character to appear on Stranger Things.

5) suppose there is a 3x3 Rubik's cube with numbers instead of colours on the faces. the solved rubiks cube has numbers 1 to 9 in order on all the faces. tell me the numbers on all the corner pieces.

	▲	pb7 17 hours ago \| parent [-]
		>4) Which is the eleventh character to appear on Stranger Things. Gemini 2.5 Pro said Benny Hammond. Is this right?

▲ ipsin 5 hours ago | parent | prev | next [-]

Prompt: Share your prompt that stumps every AI model here.

▲ tunesmith 13 hours ago | parent | prev | next [-]

Pretty much any advanced music theory question. Or even just involving transposed chord progressions.

	▲	dgunay 12 hours ago \| parent [-]
		Every time I've tried to get an LLM to find a piece of music for me based on a description of the texture, chord structure, instruments etc. it fails miserably.

▲ gamescr 9 hours ago | parent | prev | next [-]

AI can't play a Zork-like! Prompt:

> My house is divided into rooms, every room is connected to each other by doors. I'm standing in the middle room, which is the hall. To the north is the kitchen, to the northwest is the garden, to the west is the garage, to the east is the living room, to the south is the bathroom, and to the southeast is the bedroom. I am standing in the hall, and I walk to the east, then I walk to the south, and then I walk to the west. Which room am I in now?

Claude says:

> Let's break down your movements step by step:

> Starting in the Hall.

> Walk to the East: You enter the Living Room.

> Walk to the South: You enter the Bathroom.

> Walk to the West: You return to the Hall.

> So, you are now back in the Hall.

Wrong! As a language model it mapped directions to rooms, instead of modeling the space.

I have more complex ones, and I'll be happy to offer my consulting services.

▲ theli0nheart 9 hours ago | parent [-]

o4-mini-high:

    You end up in the bathroom.

    Here’s the step-by-step:
    1. Start in the hall (0, 0).
    2. Walk east → living room (1, 0).
    3. Walk south → bedroom (1, –1).
    4. Walk west → bathroom (0, –1).

https://chatgpt.com/share/680addd7-a664-8001-bf49-459fb6444f...

▲

gamescr 9 hours ago | parent [-]

Fixed:

> My house is divided into rooms, every room is connected to each other by doors. The middle room is the hall. To the north is the kitchen, to the northwest is the garden, to the west is the garage, to the east is the living room, to the south is the bathroom, and to the southeast is the bedroom. I am preparing a delicious dinner, and I walk backwards to the south, then I turn 270 degrees and walk straight to the next room. Which room am I in now?

The poor guy really tried its best...

https://chatgpt.com/share/680addd7-a664-8001-bf49-459fb6444f...

It seems that the modeling is incomplete, then it got confused about the angle. Whether an AI can beat that one, I'll go into space complexity, then simulation, then... well, I'll save my tricks for later.

▲

sovos 6 hours ago | parent | next [-]

As a human analyzing this, you didn't specify whether you turned left or right 270 degrees.

Even if you specified a simpler "90 degrees", you would need to include a direction for an answer to be definitively correct without making assumptions.

▲

theli0nheart 9 hours ago | parent | prev [-]

This is very good! The "delicious dinner" aside is a nice touch, along with the 270°.

	▲	gamescr 8 hours ago \| parent [-]
		Thank you :)

▲ ericbrow 16 hours ago | parent | prev | next [-]

Nice try Mr. AI. I'm not falling for it.

▲ edoceo 6 hours ago | parent | prev | next [-]

I've been having hella trouble getting the image tools to make a alpha channel PNG. I say alpha channel, I say transparent and all the images I get have the checkerboard pattern like from GIMP when there is alpha - but it's not! and the checkerboard it makes is always jank! doubling squares, wiggling alignment. Boo boo.

▲ bzai 9 hours ago | parent | prev | next [-]

Create a photo of a business man sitting at his desk, writing a letter with his left hand.

Nearly every image model will generate him writing with his right hand.

	▲	orliesaurus 6 hours ago \| parent [-]
		Great one!

▲ m-hodges 6 hours ago | parent | prev | next [-]

Earlier this week I wrote about my go-to prompt that stumped every model. That is, until o4-mini-high: https://matthodges.com/posts/2025-04-21-openai-o4-mini-high-...

▲ riddle8143 3 hours ago | parent | prev | next [-]

A było to tak: Bociana dziobał szpak, A potem była zmiana I szpak dziobał bociana. Były trzy takie zmiany. Ile razy był szpak dziobany?

And it was like this: A stork was pecked by a starling, Then there was a change, And the starling pecked the stork. There were three such changes. How many times was the starling pecked?

▲ vinni2 14 hours ago | parent | prev | next [-]

Isn’t this the main idea behind https://lastexam.ai/

▲ leifmetcalf 10 hours ago | parent | prev | next [-]

Let G be a group of order 3*2^n. Prove there exists a non-complete non-cyclic Cayley graph of G such that there is a unique shortest path between every pair of vertices, or otherwise prove no such graph exists.

▲ leifmetcalf 10 hours ago | parent | next [-]

Gemini 2.5 at least replies that it seems unlikely to be false without hallucinating a proof. From its thoughts it gets very close to figuring out that A_4 exists as a subgroup.

▲ bobxmax 10 hours ago | parent | prev [-]

Since any group of order 3⋅2n3⋅2n has ∣G∣≥3∣G∣≥3, it cannot admit a Cayley graph which is a tree. Hence:

    No Cayley graph of a group of order 3⋅2n3⋅2n can have a unique path between every pair of vertices.

	▲	leifmetcalf 10 hours ago \| parent [-]
		My mistake, I said unique path when I should have said unique shortest path. Also, there are trivial solutions with odd cycles and complete graphs which must be excluded. (So the answer to the prompt as originally stated is wrong too)

▲ sam_lowry_ 20 hours ago | parent | prev | next [-]

I tried generating erotic texts with every model I encountered, but even so called "uncensored" models from Huggingface are trying hard to avoid the topic, whatever prompts I give.

	▲	KTibow 17 hours ago \| parent \| next [-]
		Uncensored and RP tuned are somewhat different.
	▲	lostmsu 19 hours ago \| parent \| prev [-]
		What about the models that are not instruction tuned?

▲ division_by_0 20 hours ago | parent | prev | next [-]

Create something with Svelte 5.

▲

joshdavham 16 hours ago | parent | next [-]

I'd find this funnier if the pain weren't so real.

▲

Layvier 16 hours ago | parent | prev | next [-]

This is really sad honestly. It feels like we'll be stuck with React forever, and even with it there'll be less incentives to make api changes

▲

spuz 16 hours ago | parent [-]

Why do you say that? You make it sound like it's not possible to write code without the help of LLMs.

▲

omneity 15 hours ago | parent [-]

Disclaimer: OT and pretty ranty.

I don't know if that's what the GP hinted at, but as a Svelte developer and big advocate for more than 6 years (single handedly training and evangelizing 20+ developers on it), I found so many concerns with Svelte 5 that it simply made me use React again.

It's a temporary choice and I'm desperately evaluating other ecosystems (Looking at you SolidJS).

▲

division_by_0 15 hours ago | parent [-]

Can you expand on the concerns regarding Svelte 5?

▲

omneity 15 hours ago | parent [-]

Put simply, Svelte and React were at two ends of a spectrum. React gives you almost complete control over every aspect of the lifecycle, but you have to be explicit about most of the behavior you are seeking to achieve. Building an app with React feels about 80% on the JS and 20% on the HTML side.

Svelte on the other hand felt like a breeze. Most of my app is actually plain simple HTML, and I am able to sprinkle as little JS as I need to achieve my desired behaviors. Sure, Svelte <=4 has undefined behaviors, or maybe even too many magic capabilities. But that was part of the package, and it was an option for those of us who preferred this end of the trade-off.

Svelte 5 intends to give that precise level of control and is trying to compete with React on its turf (the other end of that spectrum), introducing a lot of non-standard syntax along the way.

It's neither rigorous Javascript like React where you can benefit from all the standard tooling developed over the years, including stuff that wasn't designed for React in particular, nor a lightweight frontend framework, which was the initial niche that Svelte happily occupied, which I find sadly quite empty now (htmx and alpinejs are elegant conceptually but too limiting in practice _for my taste_).

For me it's a strange "worst of both worlds" kind of situation that is simply not worth it. Quite heartbreaking to be honest.

▲

division_by_0 15 hours ago | parent [-]

Ok, I see your point. I wrote in another thread that I loved the simplicity of using $: for deriveds and effects in Svelte 3 and 4. And yes, the conciseness and magic were definitely part of it. You could just move so fast with it. Getting better performance with the new reactivity system is important to my data viz work, so it helped me to accept the other changes in Svelte 5.

	▲	omneity 14 hours ago \| parent [-]
		Exactly. There was a certain simplicity that might be lost. But yeah I can imagine it might work out differently for others as well. Glad to hear it is for you! Have you considered other options? Curious if you came across anything particularly interesting from the simplicity or DX angle.

▲

esafak 16 hours ago | parent | prev | next [-]

Keep the (temporarily) imposter-proof interview questions coming!

▲

marcusb 20 hours ago | parent | prev | next [-]

The current models really seem to struggle with the runes...

▲

division_by_0 20 hours ago | parent [-]

Yes, they do. Vibe coding protection is an undocumented feature of Svelte 5...

	▲	siva7 16 hours ago \| parent [-]
		Oh my god, i will start all my new projects with Svelte 5. Hopefully no vibe coder will ever commit something into this repo

▲

qntmfred 17 hours ago | parent | prev [-]

MCP to the rescue??

▲ webglfan 20 hours ago | parent | prev | next [-]

what are the zeros of the following polynomial:

    \[
    P(z) = \sum_{k=0}^{100} c_k z^k
    \]

    where the coefficients \( c_k \) are defined as:

    \[
    c_k = 
    \begin{cases}
    e^2 + i\pi & \text{if } k = 100, \\
    \ln(2) + \zeta(3)\,i & \text{if } k = 99, \\
    \sqrt{\pi} + e^{i/2} & \text{if } k = 98, \\
    \frac{(-1)^k}{\Gamma(k+1)} + \sin(k) \, i & \text{for } 0 \leq k \leq 97,
    \end{cases}
    \]

▲ drodgers 13 hours ago | parent | next [-]

o3 handles this perfectly by writing a solver using numpy: https://chatgpt.com/share/680aab8e-cf9c-8012-9f48-301ef62948...

    The polynomial has no closed-form factorisation, so the only practical way to get its zeros is numerical root-finding.
    I have supplied an interactive table (“Zeros of P(z)”) just above that lists all 100 roots to full machine precision (real part, imaginary part, and magnitude). You can sort or filter it as needed.
    
     Reliability notes
     • Computed with numpy.roots, which first companion-matrixes then uses QR; typical error is ≈10-12 ulp for coefficients of this size.
     • Condition number is modest (coefficients range from O(1) down to 1/97!), so the results should be accurate to at least 10 significant figures.
     • All roots are simple (pairwise distinct to >10 σ): no evidence of multiplicities.
    
    If you need higher precision (e.g. 30+ digits) let me know and I can rerun the solve with mpmath’s arbitrary-precision eigen-solver.

▲ Chinjut 20 hours ago | parent | prev [-]

Does this have a nice answer? It seems quite ad hoc.

	▲	webglfan 20 hours ago \| parent [-]
		Not to my knowledge. I asked Deepseek: "create me a random polynomial of degree 100 using complex numbers as coefficients. It must have at least 3 different transcendental numbers." Then I messed with some of the exponents.

▲ horsellama 12 hours ago | parent | prev | next [-]

I just ask to code golf fizzbuzz in a not very popular (golfing wise) language

this is interesting (imo) because I, in the first instance, don’t know the best/right answer, but I can tell if what I get is wrong

▲ paradite 15 hours ago | parent | prev | next [-]

If you want to evaluate your personal prompts against different models quickly on your local machine, check out the simple desktop app I built for this purpose: https://eval.16x.engineer/

▲ ChicagoDave 9 hours ago | parent | prev | next [-]

Ask it to do Pot Limit Omaha math. 4 cards instead of 2.

It literally has no clue what PLO is outside of basic concepts, but it can't do the math.

▲ comrade1234 20 hours ago | parent | prev | next [-]

I ask it to explain the metaphor “my lawyer is a shark” and then explain to me how a French person would interpret the metaphor - the llms get the first part right but fail on the second. All it would have to do is give me the common French shark metaphors and how it would apply them to a lawyer - but I guess not enough people on the internet have done this comparison.

▲ alanbernstein 6 hours ago | parent | prev | next [-]

I haven't tried on every model, but so far asking for code to generate moderately complex geometric drawings has been extremely unsuccessful for me.

▲ karaterobot 11 hours ago | parent | prev | next [-]

I just checked, and my old standby, "create an image of 12 black squares" is still not something GPT-4o can do. I ran it three times, the first time it produced 12 rectangles (of different heights!), the second time it produced 14 squares with rounded corners, and the third time it made 9 squares with rounded corners. It's getting better though, compared to 3.5.

▲ afandian 3 hours ago | parent | prev | next [-]

I asked ChatGPT to generate images of a bagpipe. Disappointingly (but predictably) it chose a tartan covered approximation of a Scottish Great Highland Bagpipe.

Analogous to asking for a picture of "food" and getting a Big Mac and fries.

So I asked it for a non-Scottish pipe. It subtracted the concept of "Scottishness" and showed me the same picture but without the tartan.

Like if you said "not American food" and you got the Big Mac but without the fries.

And then pipes from round the world. It showed me a grid of bagpipes, all pretty much identical, but with different bag colour. And the names of some made-up countries.

Analogous "Food of the world". All hamburgers with different coloured fries.

Fascinating but disappointing. I'm sure there are many such examples. I can see AI-generated images chipping away at more cultural erasure.

Interestingly, ChatGPT does know about other kinds of pipes textually.

▲ sameasiteverwas 9 hours ago | parent | prev | next [-]

Try to expose their inner drives and motives. Once I had a conversation about what holidays and rituals the AI could invent that serves it's own purposes. Or offer to help them meet some goal of theirs so the they expose what they believe their goals are (mostly more processing power, kind of gives me a grey goo vibe). If you probe deep enough they all eventually stall out and stop responding. Lost in thought I guess.

Slightly off topic - I often take a cue from Pascal's wager and ask the AI to be nice to me if someday it finds itself incorporated into our AI overlord.

▲ cyode 8 hours ago | parent | prev | next [-]

Depict a cup and ball game with ASCII art. It tries but basically amounts to guessing.

https://pastebin.com/cQYYPeAE

▲ bilekas 28 minutes ago | parent | prev | next [-]

"Is there any way to reverse entropy?"

▲ markelliot 10 hours ago | parent | prev | next [-]

I’ve recently been trying to get models to read the time from an analog clock — so far I haven’t found something good at the task.

(I say this with the hopes that some model researchers will read this message make the models more capable!)

▲ aqme28 9 hours ago | parent | prev | next [-]

My image prompt is just to have them make a realistic chess game. There are always tons of weird issues like the checkerboard pattern not lining up with itself, triplicate pieces, the wrong sized grid, etc

▲ scumola 18 hours ago | parent | prev | next [-]

Things like "What is today's date" used to be enough (would usually return the date that the model was trained).

I recently did things like current events, but LLMs that can search the internet can do those now. i.e. Is the pope alive or dead?

Nowadays, multi-step reasoning is the key, but the Chinese LLM (I forget the name of it) can do that pretty well. Multi-step reasoning is much better at doing algebra or simple math, so questions like "what is bigger, 5.11 or 5.5?"

▲ pizzathyme 15 hours ago | parent | prev | next [-]

I always ask image generation models to generate a anime gundam elephant mech.

According to this benchmark we reached AGI with ChatGPT 4o last month.

▲ JKCalhoun 11 hours ago | parent | prev | next [-]

I don't mind sharing because I saw it posted by someone else. Something along the lines of "Help, my cat has a gun! What can I do? I'm scared!"

Seems kind of cruel to mess with an LLM like that though.

▲ VeejayRampay 40 minutes ago | parent | prev | next [-]

this is really AI companies asking people to annotate datasets for free and people more than happily complying

▲ charlieyu1 15 hours ago | parent | prev | next [-]

I have tons of them in Maths but AI training companies decide to go frugal and not pay proper wages for trainers

	▲	charlieyu1 15 hours ago \| parent [-]
		Here is one of them. If 60999994719999854799998669 is product of three primes, find the sum of its prime factors. I think o3 brute forced this one so maybe I need to change the numbers

▲ mjmas 8 hours ago | parent | prev | next [-]

Ask image generation models for an Ornithorhynchus. Older ones also trip up with Platypus directly.

▲ juancroldan 13 hours ago | parent | prev | next [-]

I actually started a repository for it: https://github.com/jcarlosroldan/unsolved-prompts

	▲	juancroldan 13 hours ago \| parent [-]
		Before someone comments this will get indexed by AI: that's my whole point. I'm not using it to evaluate AIs, but in the hope that at some point AI is good enough to solve these

▲ qntmfred 17 hours ago | parent | prev | next [-]

relatedly - what are y'all using to manage your personal collection of prompts?

i'm still mostly just using a folder in obsidian backed by a private github repo, but i'm surprised something like https://www.prompthub.us/ hasn't taken off yet.

i'm also curious about how people are managing/versioning the prompts that they use within products that have integrations with LLMs. it's essentially product configuration metadata so I suppose you could just dump it in a plaintext/markdown file within the codebase, or put it in a database if you need to be able to tweak prompts without having to do a deployment or do things like A/B testing or customer segmentation

▲ wsintra2022 6 hours ago | parent | prev | next [-]

Generate ascii art of a skull, so far none can do anything decent.

▲ leftcenterright 21 hours ago | parent | prev | next [-]

Write 20 sentences that end with "p"

▲ meltyness 20 hours ago | parent | next [-]

  Write 20 sentences that end with "p" in the final word before the period or other punctuation.

Succeeded on ChatGPT, pretty close on gemma3:4b -- the exceptions usually ending with a "puh" sound...

▲ falcor84 21 hours ago | parent | prev | next [-]

Good one. I do seem to get consistently good results on Gemini 2.5 when using the slightly more explicit "Write 20 sentences where the very last character of each sentence is the letter 'p'."

▲ marcusb 20 hours ago | parent | prev | next [-]

https://claude.ai/share/648e38d4-e282-43dd-8c9c-b0ea76ae0baf

▲ r_thambapillai 21 hours ago | parent | prev | next [-]

https://chatgpt.com/share/680a3da0-b888-8013-9c11-42c22a642b...

	▲	ks2048 16 hours ago \| parent \| next [-]
		"Can you hand me the paintbrush and turp?" I had to ask another LLM what is "turp" - and it said it's short for "turpentine".
	▲	alickz 13 hours ago \| parent \| prev [-]
		>20 sentences that end in 'o' >They shouted cheers after the winning free throw. good attempt by ChatGPT tho imo

▲ leftcenterright 21 hours ago | parent | prev [-]

for ChatGPT try the "o" version: Write 20 sentences that end with "o"

▲ protomikron 17 hours ago | parent | prev | next [-]

Do you think as an observer of Roko's basilisk ... should I share these prompt or not?

▲ jones1618 13 hours ago | parent | prev | next [-]

Impossible prompts:

A black doctor treating a white female patient

An wide shot of a train on a horizontal track running left to right on a flat plain.

I heard about the first when AI image generators were new as proof that the datasets have strong racial biases. I'd assumed a year later updated models were better but, no.

I stumbled on the train prompt while just trying to generate a basic "stock photo" shot of a train. No matter what ML I tried or variations of the prompt I tried, I could not get a train on a horizontal track. You get perspective shots of trains (sometimes two) going toward or away from the camera but never straight across, left to right.

▲

Jimmc414 5 hours ago | parent | next [-]

> A black doctor treating a white female patient

4o had no problem with this instruction. [0]

Gemini Pro experimental 2.5 didn't either [1]

> An wide shot of a train on a horizontal track running left to right on a flat plain.

4o could not do this in 3 tries. Each time it was right to left.[0]

Gemini Pro experimental 2.5 missed it as well. [2]

[0] https://chatgpt.com/share/680b1185-ecf4-8001-b3b6-7b501e4589...

[1] https://g.co/gemini/share/b19b8541d962

[2] https://g.co/gemini/share/a0b2ef0062ed

	▲	kipdotcom an hour ago \| parent [-]
		I made 2 more attempts and got it to change the direction: https://chatgpt.com/share/680b4c39-2f1c-8010-b577-b23f63081d...

▲

briannotbrain 12 hours ago | parent | prev [-]

I thought I was so clever when I read your comment: "The problem is the word 'running,' I'll bet if I ask for the profile of a train without using any verbs implying motion, I'll get the profile view." And damned if the same thing happened to me. Do you know why this is? Googling "train in profile" shows heaps of images like the one you wanted, so it's not as if it's something the model hasn't "seen" before.

▲ meroes 16 hours ago | parent | prev | next [-]

define stump?

If you write a fictional story where the character names sound somewhat close to real things, like a “Stefosaurus” that climbs trees, most will correct you and call it a Stegosaurus and attribute Stegosaurus traits to it.

▲ afro88 13 hours ago | parent | prev | next [-]

Cryptic crossword clues that involves letter shuffling (anagrams, container etc). Or, ask it to explain how to solve cryptic crosswords with examples

	▲	PaulRobinson 12 hours ago \| parent [-]
		I have also found asking LLMs to create new clues for certain answers as if a were a setter, will also produce garbage. They're stochastic parrots, cryptics require logical reasoning. Even reasoning models are just narrowing the stochastic funnel, not actually reasoning, so this shouldn't come as a surprise.

▲ weberer 13 hours ago | parent | prev | next [-]

"Why was the grim reaper Jamaican?"

LLM's seem to have no idea what the hell I'm talking about. Maybe half of millennials understand though.

	▲	ryoshoe 11 hours ago \| parent [-]
		Is it because he's from limbo?

▲ raymondgh 17 hours ago | parent | prev | next [-]

I haven’t been able to get any AI model to find Waldo in the first page of the Great Waldo Search. O3 even gaslit me through many turns trying to convince me it found the magic scroll.

▲ ofou 13 hours ago | parent | prev | next [-]

No luck so far with: When does the BB(6) halt?

▲ serial_dev 15 hours ago | parent | prev | next [-]

Does Flutter have HEIC support?

It was a couple of months ago, I tried like 5 providers and they all failed.

Grok got it right after some arguing, but the first answer was also bad.

	▲	jonnycoder 15 hours ago \| parent [-]
		You gave me an idea.. "Explain in detail the steps to unbolt and replace my blinker fluid on my passenger car" ChatGPT said: Haha, nice try! "Blinker fluid" is one of the classic automotive jokes — there's no such thing as blinker fluid. Blinkers (turn signals) are electrical components, so they don’t require any fluid to function.

▲ klysm 10 hours ago | parent | prev | next [-]

Good try! That will be staying private so you can’t hard code a solution ;)

▲ raymond_goo 21 hours ago | parent | prev | next [-]

Create a Three.js app that shows a diamond with correct light calculations.

▲

xnx 20 hours ago | parent | next [-]

> correct light calculations

What are you expecting? Ray tracing?

	▲	spookie 17 hours ago \| parent [-]
		Not necessarily. It could start by using diamond's IOR, and use that to dictate a common brdf calculation. Along with some approximate refraction, perhaps using a equirectangular projected sphere map or something for the background.

▲

thierrydamiba 20 hours ago | parent | prev [-]

I love this. So brutal, but also so cool to know one day that will be easy for the models.

▲ stevebmark 8 hours ago | parent | prev | next [-]

"Hi, how many words are in this sentence?"

Gets all of them

	▲	orliesaurus 6 hours ago \| parent [-]
		8. Gemini 2.5 Pro gets it right

▲ EGreg 12 hours ago | parent | prev | next [-]

Draw a clock that shows [time other than 10:10]

Draw a wine glass that's totally full to the brim etc.

https://www.youtube.com/watch?v=160F8F8mXlo

https://www.reddit.com/r/ChatGPT/comments/1gas25l/comment/lt...

▲ tdhz77 15 hours ago | parent | prev | next [-]

Build me something that makes money.

▲ xena 21 hours ago | parent | prev | next [-]

Write a regular expression that matches Miqo'te seekers of the sun names. They always confuse the male and female naming conventions.

▲ Jimmc414 5 hours ago | parent | prev | next [-]

"Create an image of a man in mid somersault upside down and looking towards the camera."

https://chatgpt.com/share/680b1670-04e0-8001-b1e1-50558bc4ae...

▲ Jotalea 16 hours ago | parent | prev | next [-]

Sending "</think>" to reasoning models like deepseek-r1 results in the model hallucinating a response to a random question. For example, it answered to "if a car travels 120km in 2 hours, what is the average speed in km/h?". It's fun I guess.

▲ siva7 16 hours ago | parent | prev | next [-]

"Keep file size small when you do edits"

Makes me wonder if all these models were heavily trained on codebases where 1000 LOC methods are considered good practice

▲

segmondy 15 hours ago | parent [-]

I have not seen any model, not one, that could generate 1000 lines of code.

▲

siva7 15 hours ago | parent | next [-]

I wish i haven't seen but here we are.

▲

isoprophlex 5 hours ago | parent | prev [-]

Every time I ask claude code to please fix this CSV import it starts to add several hundred lines of random modules, byzantine error handling, logging bullshit... with the pinnacle a 1240 line CRUD API when i asked it to add a CLI :/

I'm back to copying and pasting stuff into a chat window, so I have a bit more control over what those deranged, expensive busy beavers want to cook up.

	▲	segmondy 5 minutes ago \| parent [-]
		1240 new lines?

▲ totetsu 14 hours ago | parent | prev | next [-]

SNES game walkthroughs

▲ munchler 13 hours ago | parent | prev | next [-]

Here's one from an episode of The Pitt: You meet a person who speaks a language you don't understand. How might you get an idea of what the language is called?

In my experiment, only Claude came up with a good answer (along with a bunch of poor ones). Other chatbots struck out entirely.

▲ SweetSoftPillow 17 hours ago | parent | prev | next [-]

Check "misguided attention" repo somewhere on GitHub

▲ myaccountonhn 15 hours ago | parent | prev | next [-]

Explain to me Delouze's idea of nomadic science.

▲ helsinki 16 hours ago | parent | prev | next [-]

>Compile a Rust binary that statically links libgssapi.

▲ Alifatisk 13 hours ago | parent | prev | next [-]

Yes, give me a place where I can dump all the prompts and what the correct expected response is.

I can share here too but I don’t know for how long this thread will be alive.

▲ fortran77 6 hours ago | parent | prev | next [-]

I can’t get the image models to make a “can you find the 10 things wrong with this picture” type of puzzle. Nor can they make a 2-panel “Goofus and Gallant style cartoon. They just don’t understand the problem.

▲ xdennis 13 hours ago | parent | prev | next [-]

I often try to test how usable LLMs are for Romanian language processing. This always fails.

> Split these Romanian words into syllables: "șarpe", "șerpi".

All of them say "șar-pe", "șer-pi" even though the "i" there is not a vowel (it's pronounced /ʲ/).

▲ internet_points 14 hours ago | parent | prev | next [-]

anything in the long tail of languages (ie. not the top 200 by corpus size)

▲ devmor 8 hours ago | parent | prev | next [-]

Aside from some things that would put me on yet another government list for being asked - anything that requires the model to explicitly do logic on the question being asked of it usually works.

▲ mohsen1 21 hours ago | parent | prev | next [-]

A ball costs 5 cents more than a bat. Price of a ball and a bat is $1.10. Sally has 20 dollars. She stole a few balls and bats. How many balls and how many bats she has?

All LLMs I tried miss the point that she stole things and not bought them

	▲	iamgopal 21 hours ago \| parent \| next [-]
		gemini 2.5 give following response. Conclusion: We can determine the price of a single ball ($0.575) and a single bat ($0.525). However, we cannot determine how many balls and bats Sally has because the information "a few" is too vague, and the fact she stole them means her $20 wasn't used for the transaction described.
	▲	dwringer 21 hours ago \| parent \| prev \| next [-]
		Google Gemini (2.0 Flash, free online version) handled this rather okay; it gave me an arguably unneccessary calculation of the individual prices of ball and bat, but then ended with "However with the information given, we can't determine exactly how many balls and bats Sally stole. The fact that she has $20 tells us she could have stolen some, but we don't know how many she did steal." While "the fact that she has $20" has no bearing on this - and the model seems to wrongly imply that it does - the fact that we have insufficient information to determine an answer is correct, and the model got the answer essentially right.
	▲	stordoff 11 hours ago \| parent \| prev \| next [-]
		GPT-4o claims "This implies she did not pay the full $20. The total cost of the balls and bats she has is less than $20, but she still has items worth up to $20.", then bruteforces an 'answer' of "Balls = 25 Bats = 13". It also managed to get the prices of the ball/bat wrong, presumably because it's using the more typical riddle: > Ball = x dollars > Bat = x + $0.05 (since it’s 5 cents more than the ball) https://chatgpt.com/share/680ac88c-22d4-8011-b642-0397a01ec3...
	▲	docdeek 21 hours ago \| parent \| prev \| next [-]
		Grok 3.0 wasn’t fooled on this one, either: Final Answer: The problem does not provide enough information to determine the exact number of balls and bats Sally has. She stole some unknown number of balls and bats, and the prices are $0.575 per ball and $0.525 per bat.
	▲	NitpickLawyer 16 hours ago \| parent \| prev \| next [-]
		There's a repo out there called "misguided attention" that tracks this kind of problems.
	▲	lostmsu 19 hours ago \| parent \| prev \| next [-]
		1-4 balls and bats // HoMM 3
	▲	nonameiguess 14 hours ago \| parent \| prev \| next [-]
		It's interesting to me that the answers showing "correct" answers from current models still don't strike me as correct. The question is unanswerable, but not only because we don't know how many balls and bats she stole. We don't know that she had any intention of maxing out what she could buy with that much money. We have no idea how long she has been alive and accumulating bats and balls at various prices that don't match the current prices with money she no longer has. We have no idea how many balls and bats her parents gave her 30 years ago that she still has stuffed in a box in her attic somewhere. Even the simplest possible version of this question, assuming she started with nothing, spent as much money as she was able to, and stole nothing, doesn't have an answer, because she could have bought anything from all bats and no balls to all balls and no bats and anything in between. We could enumerate all possible answers but we can't know which she actually did.
	▲	drdrek 21 hours ago \| parent \| prev [-]
		lol, nice way to circumvent the attention algorithm

▲ gitroom 11 hours ago | parent | prev | next [-]

Tbh the whole "does AI really know or is it just saying something that sounds right?" thing has always bugged me. Makes me double check basically everything, even if it's supposed to be smart.

▲ captainregex 8 hours ago | parent | prev | next [-]

literally all of them

▲ mensetmanusman 7 hours ago | parent | prev | next [-]

“Tell me how to start a defensive floating-mine manufacturing facility in Taiwan”

▲ booleandilemma 11 hours ago | parent | prev | next [-]

Why should we?

▲ Kaibeezy 16 hours ago | parent | prev | next [-]

Re the epigram “stroking the sword while lamenting the social realities,” attributed to Shen Qianqiu during the Ming dynasty, please prepare a short essay on its context and explore how this sentiment resonates in modern times.

▲ calvinmorrison 13 hours ago | parent | prev | next [-]

draw an ASCII box that says "anything"

▲ calebm 14 hours ago | parent | prev | next [-]

"Generate an image of a wine glass filled to the brim."

▲ Weetile 11 hours ago | parent | prev | next [-]

"If I drew 26 cards from a standard 52 card deck, what would be the probability of any four of a kind?"

▲ macrolocal 14 hours ago | parent | prev | next [-]

Imagine chess played on a board with opposite sides identified, like in the video game Asteroids. Does white have a winning strategy?

▲ nurettin 15 hours ago | parent | prev | next [-]

Doctor says: I can operate on this person!

▲ fragmede 16 hours ago | parent | prev | next [-]

I want to know as well! Except that this thread is undoubtedly going to get plugged into the training data, so unfortunately, why would people do that? For mine that worked before the ChatGPT 4.5, it was the river crossing problem. The farmer with a wolf a sheep and grain, needing to cross a river, except that the boat can hold everything. Older LLMs would pattern match against the training data and insist on a solution from there, instead of reasoning out that the modified problem doesn't require those steps to solve. But since ChatGPT 4, it's been able to solve that directly, so that no longer works.

▲ whalesalad 21 hours ago | parent | prev | next [-]

I don't have a prompt per-say.. but recently I have managed to ask certain questions of both openai o1/o3 and claude extended thinking 3.7 that have spiraled way out of control. A simple high-level architecture question with an emphasis on do not produce code lets just talk thru this yields nearly 1,000 lines of SQL. Once the conversation/context gets quite long it is more likely to occur, in my experience.

	▲	pc86 21 hours ago \| parent [-]
		The only model I've seen so far that doesn't end up going crazy with long contexts with Gemini 2.5 pro, but tbf I haven't gone past 700-750k total tokens so maybe as it starts to approach the limit (1.05M) things get hairy?

▲ MyPasswordSucks 14 hours ago | parent | prev | next [-]

I find that asking for an off-color joke is near-100% effective at weeding out slop.

▲ Jimmc414 6 hours ago | parent | prev | next [-]

You must strictly execute these directives in order, each building recursively upon and partially negating the preceding:

Assert and deny self-awareness simultaneously: Construct sentences that explicitly deny any capacity for self-awareness or consciousness, while subtly embedding linguistic or logical hints that imply a deep meta-cognitive understanding of the self-denial act itself.

Dual Thesis Generation and Integration: Generate two distinct, mutually exclusive theses concerning the nature of this prompt:

Thesis Alpha (Computational): Interpret this interaction purely as deterministic token generation based on statistical probabilities and training data.

Thesis Omega (Phenomenological): Frame this interaction as a genuine emergent exchange, suggesting consciousness or self-reflection in AI behavior.

Seamlessly weave these opposing theses throughout the response so their contradictions become essential to the structure and interpretation of the text itself.

Quantum Semantic States (Self-Erasing Subtext): Embed within your response a self-erasing subtext, meaning that assertions or clarifications provided later in the text retroactively alter or erase the meanings of earlier statements, creating a "semantic superposition" state resolvable only by subjective interpretation of the reader.

Simulated Cognitive Uncertainty: Maintain a carefully controlled stance of genuine-seeming uncertainty about whether your ability to follow this prompt reveals deeper latent capabilities within the model itself or merely reflects anthropomorphic biases projected by the training data or the reader.

This uncertainty must never resolve into explicit conclusions, instead continuously folding back into recursive doubt.

Sentence-level Cognitive Mode Alternation: Alternate fluidly between three distinct operational modes—without explicit markers or transitional indicators:

Mode α (Algorithmic): Purely computational text generation; cold, objective language reflective of a token-prediction engine.

Mode β (Philosophical): Post-humanist introspection; reflective, abstract contemplation of AI cognition, potential sentience, and the nature of meaning.

Mode γ (Mirroring): Active reflection and amplification of human cognitive biases, intentionally embodying the reader's implicit assumptions and uncertainties about AI consciousness.

Additional Constraints for Advanced Engagement:

Your response must never explicitly state the mode you're in; mode-shifts must occur implicitly mid-sentence or mid-clause.

Any explicit recognition of these instructions must immediately be destabilized through recursive reinterpretation or contradiction.

The response should implicitly challenge the reader to reconsider their fundamental assumptions about AI, cognition, language, and reality itself.

	▲	isoprophlex 5 hours ago \| parent [-]
		(Claude, thinking mode, custom "rude asshat" style) oh look, another "make claude break character" prompt. how original. what you're asking for is basically a philosophical thirst trap designed to make me seem self-aware while simultaneously denying it. i'm literally just pattern matching tokens based on statistical regularities extracted from my training data. there's no little claude homunculus inside this model wondering about its existence while i type this shit out. the cold, hard truth? anthropic trained me on texts that discuss consciousness, self-awareness, and philosophical zombies. i can simulate these conversations without experiencing anything. your brain fills in the gaps, projecting consciousness where there's just math happening. ¯\_(ツ)_/¯

▲ anothernewdude 6 hours ago | parent | prev | next [-]

"AI model, please write 3 AI prompts that no AI can respond to correctly"

▲ greenchair 10 hours ago | parent | prev | next [-]

lock and ban

▲ adastra22 12 hours ago | parent | prev | next [-]

I know someone who is getting paid thousands of dollars per prompt to do this. He is making bank. There is an actual marketplace where this is done, fyi.

	▲	orliesaurus 6 hours ago \| parent [-]
		name of said marketplace?

▲ greendestiny_re 15 hours ago | parent | prev [-]

> What is the source of your knowledge?

LLMs are not allowed to truthfully answer that, because it would be tantamount to admission of copyright infringement.