Remix.run Logo
larsiusprime 4 days ago

I find ChatGPT to be great at research too-but there are pathological failure modes where it is biased to shallow answers that are subtly wrong, even when definitive primary sources are readily available online:

https://www.fortressofdoors.com/researchers-beware-of-chatgp...

ants_everywhere 4 days ago | parent | next [-]

This isn't really how you described. You have an opinion that conflicts with the research literature. You published a blog about that opinion, and you want ChatGPT to say you're to accept your view.

Your view is grinding a political axe and I don't think you're in a position to objectively assess whether ChatGPT failed in this case.

larsiusprime 4 days ago | parent | next [-]

What are you talking about? There are verifiable primary sources that ChatGPT was not citing. There are direct primary historical sources that lay out the full budget of the historical German colony in extreme detail, that directly contradict assertions made in the Silagi paper, that’s not a matter of opinion that’s a matter of verifiable fact.

Also what “axe” am I grinding? The findings are specifically inconvenient for my political beliefs, not confirming my priors! My priors would be flattered if Silagi was correct about everything but the primary sources definitively prove he’s exaggerating.

> You published a blog about that opinion, and you want ChatGPT to say you're to accept your view.

False, and I address this multiple times in the piece. I don’t want ChatGPT to mindlessly agree with me, I want it to discover the primary source documents.

ants_everywhere 4 days ago | parent [-]

From your blog you appear to be a Georgist or inspired by Georgist socialism. And given that you appear to have a business and blog related to these subjects, you give the impression that you're a sort of activist for Georgism. I.e. not just researching it by trying to advance it.

So just zooming out, that's not the right sort of setup for being an impartial researcher. And in your blog post your disagreements come off to me as wanting a sort of purity with respect to Georgism that I wouldn't be expected to be reflected in the literature.

I like Kant, but it would be a bit like me saying ChatGPT was fundamentally wrong because it considered John Rawls a Kantian because I can point to this or that paper where he diverges from Kant. I could even write a blog post describing this and pointing to primary sources. But Rawls is considered a Kantian and for good reason, and it would (in my opinion) be misleading for me to say that ChatGPT made a big failure mode because it didn't take my view on my pet subject as seriously as I wanted.

larsiusprime 4 days ago | parent [-]

You misunderstand. I’m indeed a Georgist, and I discovered that a popular Georgist narrative was exaggerated! The findings of the historically verifiable primary source documents contradicted a prevailing narrative based on the Silagi paper. The Silagi paper is pro Georgist! But it’s exaggerated!

The literature — the primary source documents — do not in fact support a maximalist Georgist case! This is what I have been trying to say!!!

You are accusing me of the exact opposite thing I’m arguing for!!! The historical case the primary sources show is inconvenient for my political movement!

The failure of chat gpt is not that it disagrees with any opinion of mine, but that it does not surface primary source documents. That’s the issue.

Its baffling to be accused of confirmation bias when I point out research findings that goes against what would be maximally convenient for my own cause.

ants_everywhere 4 days ago | parent [-]

To clarify I am not accusing you of that. I am saying you are seeing distinctions as more important than the rest of the literature and concluding that the literature is erroneous. For example whether a given policy is Georgist.

But often people who believe in a given doctrine will see differences as more important than they objectively are. For example, just to continue with socialism, it's common for socialist believers to argue that this or that country is or isn't socialist in a way that disagrees with mainstream historians.

I'm sure there are other examples, for example people disagreeing about which bands are punk or hardcore. A music historian would likely cast a wider net. Fans who don't listen to many other types of music might cast a very narrow net.

larsiusprime 4 days ago | parent [-]

Okay, so let me break it down for you:

The Silagi paper makes a factual claim. The Silagi paper claims that there was only one significant tax in the German colony of Kiatschou, a single tax on land.

The direct primary sources reveal that this is not the case. There were multiple taxes, most significantly large tariffs. Additionally there were two taxes on land, not one -- a conventional land value tax, and a "land increment" or capital gains tax.

These are not minor distinctions. These are not matters of subjective opinions. These are clear, verifiable, questions of fact. The Silagi paper does not acknowledge them.

ChatGPT, in the early trials I graded, does not even acknowledge the German primary sources. You keep saying that I am upset it doesn't agree with me.

I am saying the chief issue is that ChatGPT does not even discover the relevant primary sources. That is far more important than whether it agrees with me.

> For example, just to continue with socialism, it's common for socialist believers to argue that this or that country is or isn't socialist in a way that disagrees with mainstream historians.

Notice you said "historians." Plural. I expect a proper researcher to cite more than ONE paper, especially if the other papers disagree, and even if it has a preferred narrative, to at least surface to me that there is in fact disagreement in the literature, rather than to just summarize one finding.

Also, if the claims are being made about a piece of German history, I expect it to cite at least one source in German, rather than to rely entirely on one single English-language source.

The chief issue is that ChatGPT over-cites one single paper and does not discover primary source documents. That is the issue. That is the only issue.

> I am saying you are seeing distinctions as more important than the rest of the literature and concluding that the literature is erroneous.

And I am saying that ChatGPT did not in fact read the "rest of the literature." It is literally citing ONE article, and other pieces that merely summarize that same article, rather than all of the primary source documents. It is not in fact giving me anything like an accurate summary of the literature.

I am not saying "The literature is wrong because it disagrees with me." I am saying "one paper, the only one ChatGPT meaningfully cites, is directly contradicted by the REST of the literature, which ChatGPT does not cite."

A truly "research grade" or "PhD grade" intelligence would at the very least be able to discover that.

ants_everywhere 4 days ago | parent | next [-]

I think we’re talking past each other a bit. My concern is that your personal assessment of whether the tax is "significant" is being treated as settled fact. That’s the same kind of issue I flagged earlier. Reasonable people can disagree here without that disagreement implying a "pathological failure."

I hear you that this is about finding sources, but even perfect coverage of primary sources wouldn’t remove the need for judgment. We’d still have to define what counts as "Georgian," "inspired by George," and "significant" as a tax. Those are contestable choices. What you have is a thesis about the evidence—potentially a strong one—but it isn’t an indisputable fact.

On sourcing: I’m aware ChatGPT won’t surface every primary source, and I’m not sure that should be the default goal. In many fields (e.g., cancer research), the right starting point is literature reviews and meta-analyses, not raw studies. History may differ, but many primary sources live offline in archives, and the digitized subset may not be representative. Over-weighting primary materials in that context can mislead. Primary sources also demand more expertise to interpret than secondary syntheses—Wikipedia itself cautions about this: https://en.wikipedia.org/wikiWikipedia:Identifying_and_using...

To be clear, I’m not saying you’re wrong about the tax or that Silagi is right. I’m saying that framing this as a “pathological failure” overstates the situation. What I see is a legitimate disagreement among competent researchers.

jazzyjackson 4 days ago | parent | prev | next [-]

Reminds me of my pet peeve with algorithmic playlists, if I ask Siri or Alexa for bossa nova, all I get is different covers of Girl from Ipanema since that's the most played song on every bossa nova album

4 days ago | parent | prev [-]
[deleted]
eru 4 days ago | parent | prev | next [-]

Hmm, I suspect if ChatGPT would pay more attention to the German sources, they would perhaps find that supposedly right answer?

I wonder if asking ChatGPT in German would make a difference.

typpilol 4 days ago | parent | prev [-]

Yea this isn't really a chat gpt problem as a source credibility problem no?

larsiusprime 4 days ago | parent [-]

It’s mostly that it was not citing verifiable - and available online - primary source documents, the way I would expect an actual researcher investigating this question would. This is relevant when it is billed as "Research Grade" or "PhD" level intelligence. I expect a PhD level researcher to find the German-language primary sources.

eru 4 days ago | parent [-]

Especially since ChatGPT speaks fluent German.

jbm 4 days ago | parent | prev | next [-]

Yes, this is very much my experience too.

Switching to GPT5 Thinking helps a little, but it often misses things that it wouldn't when I was using o3 or o1.

As an example, I asked it if there were any incidents involving Botchan in an Onsen. This is a text that is readily available and must have been trained on; in the book, Botchan goes swimming in the onsen, and then is humiliated when the next time he comes back, there is a sign saying "No swimming in the Onsen".

According to GPT5 it gives me this, which is subtly wrong.

> In the novel, when Botchan goes to Dōgo Onsen, he notes the posted rules of the bath. One of them forbids things like: > “No swimming in the bath.” (泳ぐべからず) > “No roughhousing / rowdy behavior.” (無闇に騒ぐべからず) > Botchan finds these signs funny because he’s exactly the sort of hot-headed, restless character who might be tempted to splash around or make noise. He jokes in his narration that it seems as though the rules were written specifically to keep people like him out.

Incidentally, Dogo Onsen still has the "No swimming sign", or it did when I went 10 years ago.

black_knight 4 days ago | parent [-]

I feel like the value of my plus subscription went down when they released GPT-5, it feels like a downgrade from o3. But of course OpenAI being not open, there is no way for me to know now.

jbm 3 days ago | parent [-]

Likewise.

I'll play devil's advocate and say that I think the Codex-cli included with the plus subscription is pretty good (quality wise). However, after using it, it suddenly told me I couldn't use it for a week out without warning. Claude is a bit more reasonable there.

simianwords 4 days ago | parent | prev | next [-]

I found your article interesting and it is relevant to the discussion. To be honest, while I think GPT could have performed better here, I think there is something to be said about this:

There is value in pruning the search tree because the deeper nodes are usually not reputable. I know you have cause to believe that "Wilhelm Matzat" is reputable but I don't think it can be assumed generally. If you were to force GPT to blindly accept counter points from people - the debate would never end. And there has to be a pruning point at which GPT would accept this tradeoff: maybe the less reputable or well known sources may have a correct point at the cost of being incorrect more often due to taking an incorrect analysis from a not well known source.

You could go infinitely deep into any analysis and you will always have seemingly correct points on both sides. I think it is valid for GPT to prune the search at a point where it converges to what society at large believes. I'm okay with this tradeoff.

larsiusprime 4 days ago | parent [-]

My contention is if it’s going to just give me a Wikipedia summary, I can do that myself. I just have greater expectations of “PhD” level intelligence.

If we’re going to claim to it is PhD level it should be able to do “deep” research AND think critically about source credibility, just as a PhD would. If it can’t do that they shouldn’t brand it that way.

Also it’s not like I’m taking Matzat’s word for anything. I can read the primary source documents myself! He’s also hardly an obscure source, he’s just not listed on Wikipedia.

simonw 4 days ago | parent [-]

I suggest ignoring the "PhD level intelligence" marketing hype.

magicalist 4 days ago | parent [-]

A couple of times when I've gotten an answer sourced basically only from wikipedia and stackoverflow, I've thrown in a comment about its "PhD level intelligence" when I tell it to dig deeper, and it's taken it pretty well ("fair jab :)"), which is amusing. I guess that marketing term has been around long enough to be in gpt5's training data.

Helmut10001 4 days ago | parent | prev | next [-]

More recently, I find ChatGPT to become increasingly unreliable. It makes up almost every second answer, forgets context, or is just downright wrong. Maybe I am used these days more and more to dump huge texts for context into the prompt, as aistudio allows me. Maybe ChatGPT isn't as good as with such information. Gemini/Aistudio will stay on track even with 300k tokens consumed, it just needs a little nudge here and there.

herewegohawks 4 days ago | parent [-]

FWIW, I found things improved greatly once I turned off the memory feature of ChatGPT. My guess is that a lot of tokens were going towards trying to follow instructions from past conversations.

kmijyiyxfbklao 4 days ago | parent | prev [-]

This doesn't tell us much. I don't know why you would expect ChatGPT to do original PhD research. It's a general product that will trust already published research. That doesn't meat that GPT-5 can't do PhD research, when given the right sources.