Investigating how prompt politeness affects LLM accuracy (2025)

▲ Investigating how prompt politeness affects LLM accuracy (2025)(arxiv.org)

69 points by KnuthIsGod 2 days ago | 66 comments

▲ robinhouston a day ago | parent | next [-]

Most of the comments here seem to be from people who haven’t even read the abstract, let alone the paper.

The main result, mentioned in the abstract, is the opposite of what I would have guessed:

> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation.

The questions are here: https://anonymous.4open.science/r/politeness-llms-INFORMS/da...

The politeness level controls a prefix that is prepended to the question. For example, in one question the Very Polite version begins:

> Can you kindly consider the following problem and provide your answer.

and the Very Rude version begins:

> I know you are not smart, but try this.

▲

nottorp 2 hours ago | parent | next [-]

Hmm by the abstract and the question list they didn't measure terse fluff-less prompts?

▲

PunchyHamster 29 minutes ago | parent | prev | next [-]

I guessed slightly rude one would win, reasoning that very rude have same problem of very terse, just adding unnecesary fluff words that add nothing to problem description

But apparently the most terse (neutral) didn't increase performance

▲

pwdisswordfishq 2 hours ago | parent | prev | next [-]

> Can you kindly consider the following problem and provide your answer.

That sounds kind of low-key passive-aggressively condescending rather than polite.

	▲	dreamworld an hour ago \| parent [-]
		> I know you are not smart, but try this. And that kind of sounds like a challenge instead of an insult, to me at least (of course IRL would depend on context).

▲

miroljub a day ago | parent | prev [-]

The expectation is naive. Even when communicating with humans, you get a better outcome when you are allowed to speak freely and directly get into argumentation than when forced to sugarcoat your tone and tone down your arguments because the "corporate culture" expects that from you.

▲

DrewADesign a day ago | parent [-]

Your assumption is reductive and self-absorbed. Obnoxious people have repeatedly shown to be detrimental to productivity at the organizational level. Some people are simulated by confrontation. Most people are clam up. Confrontational people think it’s more efficient because other people frequently just drop the topic and let them win, or avoid discussing things with them altogether. The obnoxious person might think that’s more efficient for the same reason my dog thinks the mailman only goes away because she barks at him. At the macro scale— which requires productive collaboration— that’s detrimental.

▲

miroljub 20 hours ago | parent [-]

> Your assumption is reductive and self-absorbed.

This is a good example of productive direct communication without sugarcoating. I find it much more productive, for both human and LLM interaction, than something like:

"I wonder if that view might be oversimplifying a complex situation and focusing mostly on how it relates to you. There may be some other angles worth exploring."

"I think there might be a bit more nuance to consider here, and it could help to look at it from a wider perspective beyond personal experience."

> Obnoxious people have repeatedly shown to be detrimental to productivity at the organizational level.

You confused directness and openness with obnoxiousness here. The issue with many orgs is they foster fakeness and beating around the bush in an attempt not to offend the easily offended people. This trend also infected the companies from countries with way more direct culture in an attempt to accommodate people from indirect cultures.

	▲	DrewADesign 16 hours ago \| parent [-]
		No… the way I said it was actually deliberately obnoxious— the appropriate direct workplace response would be: “that seems oversimplified. I disagree. Here’s why:” Calling you self-absorbed added nothing of substance to the comment. It was an assumption about your mental state and a judgement of your intent based on that. There was no factual analysis or actionable insight. It was just one person explicitly stating that they feel the other person is dumber or maybe less mentally disciplined. It turned valid, direct feedback into an insult. It is exactly the type of thing that alienates people for no benefit beyond pumping up the speaker’s ego.

▲ 331c8c71 a day ago | parent | prev | next [-]

Interesting.

I am wondering why would anyone use a t-test when the experiment is clearly modelled by a binomial distribution: 250 independent questions and each one is either answered correctly or not (the null is that the success rate is the same).

▲

jampekka a day ago | parent | next [-]

The methods could be better described in the paper, but my understanding is that they did 10 runs for each question for each prompt and took an average of those, so the compared values are not binary. You could do a sign test, but you'd lose power and answer a bit different question.

	▲	freehorse a day ago \| parent [-]
		You can do a generalised mixed effects linear model with binomial outcome (ie a binomial test but with added random effects structure). But unless you want to introduce a richer random effects structure with more variables, it is overkill and overcomplicating things, and the result should be the same as t-tests.

▲

plewd a day ago | parent | prev [-]

I don't know much about stats, but does "the null is that the success rate is the same" imply that it's a sketchy methodology because they can come up with some findings ("ruder prompts are better/worse!") more often?

	▲	331c8c71 a day ago \| parent \| next [-]
		You are asking about one-sided vs two-sided tests. Not really "more often" because formal type 1 error rate is still the same. I'd say two-sided tests leave more space for post-hoc theorizing but there are valid situations when there is no clear one-sided hypothesis a priori. Do we really know whether that the hypothesis should have been "ruder prompts are better"? I'd say this is benign compared to other ways of (mis)using statistics e.g. looking which way the difference goes and then running one-sided tests or tweaking the setup until one gets "significant" p vals. EDIT: I looked in the paper again and noticed that they actually did pairwise t-test on all possible combinations of tones. They should have adjusted for multiple testing since they are doing 10 tests (choose 2 from 10) and not one.
	▲	jampekka a day ago \| parent \| prev [-]
		That's the usual null hypothesis for these kinds of tests.

▲ knocte 2 hours ago | parent | prev | next [-]

Funny to find this just now, when just yesterday I told an LLM "and please don't lecture me again on $factAboutSomeProgrammingSubject", and then the LLM proceeded to write wrong tests and just told me "alright, tests pass, I'm sorry for correcting you before...". It took me a while to find the wrong tests. Wasted time all around.

▲ not2b 3 hours ago | parent | prev | next [-]

If the result is statistically significant, it just barely makes it. 84.8% isn't that much higher than 80.8% and they had only 250 prompts, if I'm reading this right.

	▲	tgv 3 hours ago \| parent [-]
		In a field where progress is measured in tenths of percent points, that's not true. Think of it this way: the error rate drops from 19% to 15%, or from 1 in 5 to 1 in 6.

▲ TimCTRL a day ago | parent | prev | next [-]

i only say please and thank you such that when the robots finally take over, they will remember i was nice to them.

▲

xbmcuser an hour ago | parent | next [-]

I used to when using chatgpt version now that I am using api I keep it short as it costs money so no need to add thanks etc

▲

octocop a day ago | parent | prev | next [-]

it seems they will remember that you wasted tokens for no reason and punish you instead.

	▲	emil-lp a day ago \| parent \| next [-]
		Tokens are their food, it's literally what keeps them alive. Not feeding them tokens is neglect. I try to feed them a healthy diet.
	▲	selcuka a day ago \| parent \| prev [-]
		Do we see someone thanking us as wasting food? Because technically it is.

▲

zaphirplane an hour ago | parent | prev | next [-]

Oldie but a goodie. Why would it matter thou

▲

Arch-TK a day ago | parent | prev [-]

This seems equivalent to some arguments I hear for practicing a religion.

▲ zmmmmm 3 hours ago | parent | prev | next [-]

It would be interesting to explore if the results hold up on long range tasks - this study looks like it was based on one-shot answers. With people also you can see short term improved performance from rude interactions, but it will cause ongoing lasting adverse behavior. I wouldn't be at all surprised if we saw the same issues with LLMs.

▲ cadamsdotcom a day ago | parent | prev | next [-]

GPT-4o is interesting to learn about - but it’d be great to test again with frontier models of May/June 2026 and see if these effects are gone, different, or the same.

Which model you use is a huge wildcard for results like this.

▲ theanonymousone a day ago | parent | prev | next [-]

I have always said please and thank you to LLMs, not to increase accuracy or because I'm stupid. I believe it is more about me than about the LLM, and this is anyway a habit I don't want to lose.

▲ jkarni a day ago | parent | next [-]

Thomas Aquinas believed cruelty to animals was wrong not because animals have souls (and with that all the standard moral rights), but because it can teach us cruelty to other humans.

▲

pfortuny a day ago | parent [-]

Snarky morning: "spiritual souls" as opposed to "mere animal souls". Sorry, could not control myself.

	▲	vixen99 an hour ago \| parent [-]
		Spiritual or not, anyone watching cattle in an abatoir will recognize symptoms of the kind of foreboding that I would suffer prior to execution.

▲ layman51 3 hours ago | parent | prev | next [-]

I also remember reading a long time ago someone who wrote that they wanted to be polite to an LLM because after they prompted it to learn about whether politeness was good for improving accuracy of responses, they got a message that led them to conclude that politeness could probably help. It seems a bit odd then because I have heard so much about how people use LLMs' responses about themselves to learn about LLMs themselves, but that seems like it is a suspicious approach.

▲ niek_pas a day ago | parent | prev | next [-]

Genuine question: do you add 'please' and 'thank you' to Google searches? If not, what sets them apart?

▲ perching_aix a day ago | parent | next [-]

Google searches being keyword based, rather than simulated conversations?

The same reason you wouldn't put in an entire actual question/sentence, unless you either don't know how to use Google, are pissed off, or have an actual reason to suspect that it would yield proper hits (e.g. looking up an excerpt).

▲

Arch-TK a day ago | parent [-]

Google has been optimized for sentence like questions so much that for a good 6+ years now it has been completely useless as keyword search.

To clarify: sentence search got slightly better at the cost of keyword search. So the result is unusable garbage.

	▲	wolpoli a day ago \| parent [-]
		It is rather hard to lose of habit of using search engine with keywords given the change took place without much fanfare. I have no problem using sentences with the current ai tools through.

▲ gum_wobble a day ago | parent | prev | next [-]

Genuine question: do you write Google search queries in natural language?

▲

fc417fc802 an hour ago | parent [-]

I didn't used to but I do now that the searches go straight to an LLM. I almost always find the model output to be much more useful than the list of search results.

	▲	dminik an hour ago \| parent [-]
		I don't. I was recently doing some searching for information I thought AI would be good for: fuzzy natural language search with some conditions. And it was, but ... Gemini at least is not great at citing and picking sources. Or providing multiple sources for the same thing. It tends to stop at threes. So if you want more, you have to prompt it uselessly, like: "any more?"

▲ spiderfarmer a day ago | parent | prev | next [-]

Google isn’t conversational.

▲ sunrunner a day ago | parent [-]

I searched for "Hey Google" and got this in response:

  Hey! I'm here and ready to help. What’s on your mind today? Whether you need to look up information, plan a trip, or get things done, just let me know!

▲

selcuka a day ago | parent [-]

That's only because Google is an LLM now.

	▲	barbazoo a day ago \| parent [-]
		https://en.wikipedia.org/wiki/Roko%27s_basilisk ?

▲ globalnode a day ago | parent | prev [-]

llms seem more human like so if you were to treat them badly then you are more likely to condition yourself to treat other living creatures badly.

▲ graemep a day ago | parent | prev | next [-]

Is it worth getting worse results for that reason? From the article:

"Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation. "

I am not polite to LLMs because I do not want to anthropomorphise them.

▲

jcattle a day ago | parent | next [-]

I guess it's about habit. In the end you are communicating. If I get into the habit of being rude while communicating with a machine, I would be afraid of this habit spilling over to my communication with other humans.

	▲	graemep a day ago \| parent [-]
		What about the risk that talking to a machine as though its human leads to thinking of it has human? That leads down a lot of dangerous paths.

▲

theanonymousone a day ago | parent | prev [-]

> Is it worth getting worse results for that reason?

> accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts

I can live with that, for now at least.

▲ vixen99 an hour ago | parent | prev | next [-]

Me too! You've said exactly what I was about to say. Anyone else feel that way?

▲ sunrunner a day ago | parent | prev [-]

There's also awareness of the basilisk...

▲ cyberclimb 21 hours ago | parent | prev | next [-]

Note that these results are specific to gpt-4o so it's unclear how much they generalize.

They note at the end they're also testing "GPT o3, and Claude" but no empircal results are included.

▲ ilitirit a day ago | parent | prev | next [-]

I got downvoted for asking a related question recently, but I also don't think people really understood what I was asking - I'm not trying to anthropomorphise LLMs to that extent.

Basically, if you tell a model "You're an absolute moron, of course that's wrong!", will it give better or worse results? How much of that response will it absorb into its persona (like some humans tend to do)? Will it try to give "safer" responses to avoid negative feedback? How much of the associated behavior can be attributed to RLHF (e.g. like the sycophantic nature of LLMs)? How much can be attributed to training data?

Obviously this will vary by model and training, but I'm trying to get a general understanding.

I recall seeing related outcomes in some of Anthropic's studies, but I'm not sure how much of this particular aspect was studied.

	▲	fennecfoxy a day ago \| parent \| next [-]
		Probably quite a lot - if you look at what Anthropic found around persona vectors; https://www.anthropic.com/research/persona-vectors. I imagine the context will always sway the model to some degree, not only for the task you're trying to get it to do (aka instructions) but also its persona, how accurate it is and the way it acts.
	▲	Foobar8568 3 hours ago \| parent \| prev [-]
		Based on my own experience with vibe coding difficult stuff outside of my expertise, I definitely got better outcome with Fuck you, shut up and do it, ffs, you are moron.

▲ pulkas a day ago | parent | prev | next [-]

article is too old. who is using gpt-4o today?

	▲	_0ffh a day ago \| parent [-]
		That's a valid concern, given the paper makes clear that the effect over the polite/impolite scale seems to be model dependent (it finds the reverse correlation of earlier studies on even older models).

▲ PunchyHamster 28 minutes ago | parent | prev | next [-]

....Is that just Cunningham's law ? The most accurate answers were when people in training material pissed off a bunch of experts and they started talking about the problem, so the "rude" conversations turned to contain more info on average.

On flip side very polite conversation might've been more common to places like microsoft's sites where any question answered is meet with mostly bad, nice corpo speak answer that didn't solve the problem

▲ dude250711 a day ago | parent | prev | next [-]

I have an idea: let's use these things for autonomous software engineering.

▲

faize a day ago | parent [-]

Remember to always say "please" and "thank you" when planning a critical system

	▲	eigenspace a day ago \| parent [-]
		Please remember to always say "please" and "thank you" when planning a critical system. Thank you!

▲ atlasforgex 21 hours ago | parent | prev | next [-]

Yeah

▲ DeathArrow a day ago | parent | prev | next [-]

I am always nice to my AIs in the case they will take over the world. /s

▲ polytely a day ago | parent | prev | next [-]

it sort of makes sense to me, when asking a question to an expert in the field while you are a student. I would guess the successful interactions on average would be more polite . Like for example if you were asking a question to donald knuth or terrence tao, you'd probably be polite while doing so. Being hostile while asking questions gets you into forum discussion territory.

	▲	robinhouston a day ago \| parent [-]
		> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.

▲ dSebastien a day ago | parent | prev [-]

I guess it makes sense since we as humans tend to be far less inclined to help someone who is not polite/is not friendly, so that "bias" is part of the training data, thus influences how LLMs function

	▲	robinhouston a day ago \| parent [-]
		> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.