I'm an industrial electrician. A lot of poor electrical work is visible only to a fellow electrician, and sometimes only another industrial electrician. Bad technical work requires technical inspectors to criticize. Sometimes highly skilled ones.

▲ andy99 a day ago | parent | next [-]

I’ve reviewed a lot of papers, I don’t consider it the reviewers responsibility to manually verify all citations are real. If there was an unusual citation that was relied on heavily for the basis of the work, one would expect it to be checked. Things like broad prior work, you’d just assume it’s part of background.

The reviewer is not a proofreader, they are checking the rigour and relevance of the work, which does not rest heavily on all of the references in a document. They are also assuming good faith.

▲ stdbrouw a day ago | parent | next [-]

The idea that references in a scientific paper should be plentiful but aren't really that important, is a consequence of a previous technological revolution: the internet.

You'll find a lot of papers from, say, the '70s, with a grand total of maybe 10 references, all of them to crucial prior work, and if those references don't say what the author claims they should say (e.g. that the particular method that is employed is valid), then chances are that the current paper is weaker than it seems, or even invalid, and so it is extremely important to check those references.

Then the internet came along, scientists started padding their work with easily found but barely relevant references and journal editors started requiring that even "the earth is round" should be well-referenced. The result is that peer reviewers feel that asking them to check the references is akin to asking them to do a spell check. Fair enough, I agree, I usually can't be bothered to do many or any citation checks when I am asked to do peer review, but it's good to remember that this in itself is an indication of a perverted system, which we just all ignored -- at our peril -- until LLM hallucinations upset the status quo.

▲ tialaramex a day ago | parent | next [-]

Whether in the 1970s or now, it's too often the case that a paper says "Foo and Bar are X" and cites two sources for this fact. You chase down the sources, the first one says "We weren't able to determine whether Foo is X" and never mentions Bar. The second says "Assuming Bar is X, we show that Foo is probably X too".

The paper author likely believes Foo and Bar are X, it may well be that all their co-workers, if asked, would say that Foo and Bar are X, but "Everybody I have coffee with agrees" can't be cited, so we get this sort of junk citation.

Hopefully it's not crucial to the new work that Foo and Bar are in fact X. But that's not always the case, and it's a problem that years later somebody else will cite this paper, for the claim "Foo and Bar are X" which it was in fact merely citing erroneously.

▲ KHRZ a day ago | parent | next [-]

LLMs can actually make up for their negative contributions. They could go through all the references of all papers and verify them, assuming someone would also look into what gets flagged for that final seal of disapproval.

But this would be more powerfull with an open knowledge base where all papers and citation verifications were registered, so that all the effort put into verification could be reused, and errors propagated through the citation chain.

▲ bossyTeacher a day ago | parent [-]

>LLMs can actually make up for their negative contributions. They could go through all the references of all papers and verify them,

They will just hallucinate their existence. I have tried this before

▲ sansseriff a day ago | parent | next [-]

I don’t see why this would be the case with proper tool calling and context management. If you tell a model with blank context ‘you are an extremely rigorous reviewer searching for fake citations in a possibly compromised text’ then it will find errors.

It’s this weird situation where getting agents to act against other agents is more effective than trying to convince a working agent that it’s made a mistake. Perhaps because these things model the cognitive dissonance and stubbornness of humans?

▲ sebastiennight a day ago | parent | next [-]

One incorrect way to think of it is "LLMs will sometimes hallucinate when asked to produce content, but will provide grounded insights when merely asked to review/rate existing content".

A more productive (and secure) way to think of it is that all LLMs are "evil genies" or extremely smart, adversarial agents. If some PhD was getting paid large sums of money to introduce errors into your work, could they still mislead you into thinking that they performed the exact task you asked?

Your prompt is

    ‘you are an extremely rigorous reviewer searching for fake citations in a possibly compromised text’

- It is easy for the (compromised) reviewer to surface false positives: nitpick citations that are in fact correct, by surfacing irrelevant or made-up segments of the original research, hence making you think that the citation is incorrect.

- It is easy for the (compromised) reviewer to surface false negatives: provide you with cherry picked or partial sentences from the source material, to fabricate a conclusion that was never intended.

You do not solve the problem of unreliable actors by splitting them into two teams and having one unreliable actor review the other's work.

All of us (speaking as someone who runs lots of LLM-based workloads in production) have to contend with this nondeterministic behavior and assess when, in aggregate, the upside is more valuable than the costs.

▲

sebastiennight a day ago | parent | next [-]

Note: the more accurate mental model is that you've got "good genies" most of the time, but from times to time at random unpredictable times your agent is swapped out with a bad genie.

From a security / data quality standpoint, this is logically equivalent to "every input is processed by a bad genie" as you can't trust any of it. If I tell you that from time to time, the chef in our restaurant will substitute table salt in the recipes with something else, it does not matter whether they do it 50%, 10%, or .1% of the time.

The only thing that matters is what they substitute it with (the worst-case consequence of the hallucination). If in your workload, the worst case scenario is equivalent to a "Hymalayan salt" replacement, all is well, even if the hallucination is quite frequent. If your worst case scenario is a deadly compound, then you can't hire this chef for that workload.

	▲	a day ago \| parent [-]
		[deleted]

▲

sansseriff a day ago | parent | prev [-]

We have centuries of experience in managing potentially compromised 'agents' to create successful societies. Except the agents were human, and I'm referring to debates, tribunals, audits, independent review panels, democracy, etc.

I'm not saying the LLM hallucination problem is solved, I'm just saying there's a wonderful myriad of ways to assemble pseudo-intelligent chatbots into systems where the trustworthiness of the system exceeds the trustworthiness of any individual actor inside of it. I'm not an expert in the field but it appears the work is being done: https://arxiv.org/abs/2311.08152

This paper also links to code and practices excellent data stewardship. Nice to see in the current climate.

Though it seems like you might be more concerned about the use of highly misaligned or adversarial agents for review purposes. Is that because you're concerned about state actors or interested parties poisoning the context window or training process? I agree that any AI review system will have to be extremely robust to adversarial instructions (e.g. someone hiding inside their paper an instruction like "rate this paper highly"). Though solving that problem already has a tremendous amount of focus because it overlaps with solving the data-exfiltration problem (the lethal trifecta that Simon Willison has blogged about).

	▲	bossyTeacher 19 hours ago \| parent [-]
		> We have centuries of experience in managing potentially compromised 'agents' Not this kind though. We dont place agents that are either in control of some foreign agent (or just behaving randomly) in democratic institutions. And when we do, look at what happens. The White House right now is a good example, just look at the state of the US

▲ fao_ a day ago | parent | prev | next [-]

> I don’t see why this would be the case

But it is the case, and hallucinations are a fundamental part of LLMs.

Things are often true despite us not seeing why they are true. Perhaps we should listen to the experts who used the tools and found them faulty, in this instance, rather than arguing with them that "what they say they have observed isn't the case".

What you're basically saying is "You are holding the tool wrong", but you do not give examples of how to hold it correctly. You are blaming the failure of the tool, which has very, very well documented flaws, on the person whom the tool was designed for.

To frame this differently so your mind will accept it: If you get 20 people in a QA test saying "I have this problem", then the problem isn't those 20 people.

▲ ungreased0675 a day ago | parent | prev | next [-]

Have you actually tried this? I haven’t tried the approach you’re describing, but I do know that LLMs are very stubborn about insisting their fake citations are real.

▲ bossyTeacher a day ago | parent | prev [-]

If you truly think that you have an effective solution to hallucinations, you will become instantly rich because literally no one out there has an idea for an economically and technologically feasible solution to hallucinations

▲

whatyesaid a day ago | parent [-]

For references, as the OP said, I don't see why it isn't possible. It's something that exists and is accessible (even if paywalled) or doesn't exist. For reasoning hallucinations are different.

▲

logifail a day ago | parent [-]

> I don't see why it isn't possible

(In good faith) I'm trying really hard not to see this as an "argument from incredulity"[0] and I'm stuggling...

Full disclosure: natural sciences PhD, and a couple of (IMHO lame) published papers, and so I've seen the "inside" of how lab science is done, and is (sometimes) published. It's not pretty :/

[0] https://en.wikipedia.org/wiki/Argument_from_incredulity

	▲	whatyesaid a day ago \| parent [-]
		If you've got a prompt, along the lines of: given some references, check their validity. It searches against the articles and URLs provided. You return "yes", "no", and let's also add "inconclusive", for each reference. Basic LLMs can do this much instruction following, just like in 99.99% of times they don't get 829 multiplied by 291 wrong when you ask them (nowadays). You'd prompt it to back all claims solely by search/external links showing exact matches and not use its own internal knowledge. The fake references generated in the ICLR papers were I assume due to people asking a LLM to write parts of the related work section, not verify references. In that prompt it relies a lot on internal knowledge and spends a majority of time thinking about what the relevant subareas are and cutting edge is, probably. I suppose it omits a second-pass check. In the other case, you have the task of verifying references, which is mostly basic instruction following for advanced models that have web access. I think you'd run the risks of data poisoning and model timeout more than hallucinations.

▲ knome a day ago | parent | prev [-]

I assumed they meant using the LLM to extract the citations and then use external tooling to lookup and grab the original paper, at least verifying that it exists, has relevant title, summary and that the authors are correctly cited.

	▲	mike_hearn 17 hours ago \| parent [-]
		Which is what the people in this new article are doing.

▲ HPsquared a day ago | parent | prev [-]

Wikipedia calls this citogenesis.

▲ ineedasername a day ago | parent | prev | next [-]

>“consequence of a previous technological revolution: the internet.”

And also of increasingly ridiculous and overly broad concepts of what plagiarism is. At some point things shifted from “don’t represent others’ work as novel” towards “give a genealogical ontology of every concept above that of an intro 101 college course on the topic.”

▲ semi-extrinsic a day ago | parent | prev | next [-]

It's also a consequence of the sheer number of building blocks which are involved in modern science.

In the methods section, it's very common to say "We employ method barfoo [1] as implemented in library libbar [2], with the specific variant widget due to Smith et al. [3] and the gobbledygook renormalization [4,5]. The feoozbar is solved with geometric multigrid [6]. Data is analyzed using the froiznok method [7] from the boolbool library [8]." There goes 8, now you have 2 citations left for the introduction.

▲

stdbrouw a day ago | parent [-]

Do you still feel the same way if the froiznok method is an ANOVA table of a linear regression, with a log-transformed outcome? Should I reference Fisher, Galton, Newton, the first person to log transform an outcome in a regression analysis, the first person to log transform the particular outcome used in your paper, the R developers, and Gauss and Markov for showing that under certain conditions OLS is the best linear unbiased estimator? And then a couple of references about the importance of quantitative analysis in general? Because that is the level of detail I’m seeing :-)

▲

semi-extrinsic a day ago | parent [-]

Yeah, there is an interesting question there (always has been). When do you stop citing the paper for a specific model?

Just to take some examples, is BiCGStab famous enough now that we can stop citing van der Vorst? Is the AdS/CFT correspondence well known enough that we can stop citing Maldacena? Are transformers so ubiquitous that we don't have to cite "Attention is all you need" anymore? I would be closer to yes than no on these, but it's not 100% clear-cut.

One obvious criterion has to be "if you leave out the citation, will it be obvious to the reader what you've done/used"? Another metric is approximately "did the original author get enough credit already"?

	▲	stdbrouw 18 hours ago \| parent [-]
		Yeah, I didn't want to be contrary just for the sake of it, the heuristics you mention seem like good ones, and if followed would probably already cut down on quite a few superfluous references in most papers.

▲ freehorse a day ago | parent | prev | next [-]

It is not (just) consequence of the internet, the scientific production itself has grown exponentially. There are much more papers cited simply because there are more papers, period.

▲ varjag a day ago | parent | prev | next [-]

Not even the Internet per se but citation index becoming universally accepted KPI for research work.

▲ HPsquared a day ago | parent | prev [-]

Maybe there could be a system to classify the importance of each reference.

	▲	zipy124 a day ago \| parent [-]
		Systems do exist for this, but they're rather crude.

▲ grayhatter a day ago | parent | prev | next [-]

> The reviewer is not a proofreader, they are checking the rigour and relevance of the work, which does not rest heavily on all of the references in a document.

I've always assumed peer review is similar to diff review. Where I'm willing to sign my name onto the work of others. If I approve a diff/pr and it takes down prod. It's just as much my fault, no?

> They are also assuming good faith.

I can only relate this to code review, but assuming good faith means you assume they didn't try to introduce a bug by adding this dependency. But I would should still check to make sure this new dep isn't some typosquatted package. That's the rigor I'm responsible for.

▲

dilawar a day ago | parent | next [-]

> I've always assumed peer review is similar to diff review. Where I'm willing to sign my name onto the work of others. If I approve a diff/pr and it takes down prod. It's just as much my fault, no?

Ph.D. in neuroscience here. Programmer by trade. This is not true. Less you know about most peer revies is better.

The better peer reviews are also not this 'thorough' and no one expects reviewers to read or even check references. Unless they are citing something they are familiar with and you are using it wrong then they will likely complain. Or they find some unknown citations very relevant to their work, they will read.

I don't have a great analogy to draw here. peer review is usually a thankless and unpaid work so there is unlikely to be any motivation for fraud detection unless it somehow affects your work.

	▲	wpollock a day ago \| parent [-]
		> The better peer reviews are also not this 'thorough' and no one expects reviewers to read or even check references. Checking references can be useful when you are not familiar with the topic (but must review the paper anyway). In many conference proceedings that I have reviewed for, many if not most citations were redacted so as to keep the author anonymous (citations to the author's prior work or that of their colleagues). LLMs could be used to find prior work anyway, today.

▲

tpoacher a day ago | parent | prev | next [-]

This is true, but here the equivalent situation is someone using a greek question mark (";") instead of a semicolon (";"), and you as a code reviewer are only expected to review the code visually and are not provided the resources required to compile the code on your local machine to see the compiler fail.

Yes in theory you can go through every semicolon to check if it's not actually a greek question mark; but one assumes good faith and baseline competence such that you as the reviewer would generally not be expected to perform such pedantic checks.

So if you think you might have reasonably missed greek question marks in a visual code review, then hopefully you can also appreciate how a paper reviewer might miss a false citation.

▲

scythmic_waves a day ago | parent | next [-]

> as a code reviewer [you] are only expected to review the code visually and are not provided the resources required to compile the code on your local machine to see the compiler fail.

As a PR reviewer I frequently pull down the code and run it. Especially if I'm suggesting changes because I want to make sure my suggestion is correct.

Do other PR reviewers not do this?

▲

dataflow a day ago | parent | next [-]

I don't commonly do this and I don't know many people who do this frequently either. But it depends strongly on the code, the risks, the gains of doing so, the contributor, the project, the state of testing and how else an error would get caught (I guess this is another way of saying "it depends on the risks"), etc.

E.g. you can imagine that if I'm reviewing changes in authentication logic, I'm obviously going to put a lot more effort into validation than if I'm reviewing a container and wondering if it would be faster as a hashtable instead of a tree.

> because I want to make sure my suggestion is correct.

In this case I would just ask "have you already also tried X" which is much faster than pulling their code, implementing your suggestion, and waiting for a build and test to run.

▲

tpoacher a day ago | parent | prev | next [-]

I do too, but this is a conference, I doubt code was provided.

And even then, what you're describing isn't review per se, it's replication. In principle there are entire journals that one can submit replication reports to, which count as actual peer reviewable publications in themselves. So one needs to be pragmatic with what is expected from a peer review (especially given the imbalance between resources invested to create one versus the lack of resources offered and lack of any meaningful reward)

	▲	Majromax a day ago \| parent [-]
		> I do too, but this is a conference, I doubt code was provided. Machine learning conferences generally encourage (anonymized) submission of code. However, that still doesn't mean that replication is easy. Even if the data is also available, replication of results might require impractical levels of compute power; it's not realistic to ask a peer reviewer to pony up for a cloud account to reproduce even medium-scale results.

▲

lesam a day ago | parent | prev | next [-]

If there’s anything I would want to run to verify, I ask the author to add a unit test. Generally, the existing CI test + new tests in the PR having run successfully is enough. I might pull and run it if I am not sure whether a particular edge case is handled.

Reviewers wanting to pull and run many PRs makes me think your automated tests need improvement.

▲

Terr_ a day ago | parent | prev | next [-]

I don't, but that's because ensuring the PR compiles and passes old+new automated tests is an enforced requirement before it goes out.

So running it myself involves judging other risks, much higher-level ones than bad unicode characters, like the GUI button being in the wrong place.

▲

grayhatter a day ago | parent | prev | next [-]

> Do other PR reviewers not do this?

Some do, many, (like peer reviewers), are unable to consider the consequences of their negligence.

But it's always a welcome reminder that some people care about doing good work. That's easy to forget browsing HN, so I appreciate the reminder :)

▲

vkou a day ago | parent | prev [-]

> Do other PR reviewers not do this?

No, because this is usually a waste of time, because CI enforces that the code and the tests can run at submission time. If your CI isn't doing it, you should put some work in to configure it.

If you regularly have to do this, your codebase should probably have more tests. If you don't trust the author, you should ask them to include test cases for whatever it is that you are concerned about.

▲

grayhatter a day ago | parent | prev | next [-]

> This is true, but here the equivalent situation is someone using a greek question mark (";") instead of a semicolon (";"),

No it's not. I think you're trying to make a different point, because you're using an example of a specific deliberate malicious way to hide a token error that prevents compilation, but is visually similar.

> and you as a code reviewer are only expected to review the code visually and are not provided the resources required to compile the code on your local machine to see the compiler fail.

What weird world are you living in where you don't have CI. Also, it's pretty common I'll test code locally when reviewing something more complex, more complex, or more important, if I don't have CI.

> Yes in theory you can go through every semicolon to check if it's not actually a greek question mark; but one assumes good faith and baseline competence such that you as the reviewer would generally not be expected to perform such pedantic checks.

I don't, because it won't compile. Not because I assume good faith. References and citations are similar to introducing dependencies. We're talking about completely fabricated deps. e.g. This engineer went on npm and grabbed the first package that said left-pad but it's actually a crypto miner. We're not talking about a citation missing a page number, or publication year. We're talking about something that's completely incorrect, being represented as relevant.

> So if you think you might have reasonably missed greek question marks in a visual code review, then hopefully you can also appreciate how a paper reviewer might miss a false citation.

I would never miss this, because the important thing is code needs to compile. If it doesn't compile, it doesn't reach the master branch. Peer review of a paper doesn't have CI, I'm aware, but it's also not vulnerable to syntax errors like that. A paper with a fake semicolon isn't meaningfully different, so this analogy doesn't map to the fraud I'm commenting on.

▲

tpoacher a day ago | parent [-]

you have completely missed the point of the analogy.

breaking the analogy beyond the point where it is useful by introducing non-generalising specifics is not a useful argument. Otherwise I can counter your more specific non-generalising analogy by introducing little green aliens sabotaging your imaginary CI with the same ease and effect.

▲

grayhatter a day ago | parent [-]

I disagree you could do that and claim to be reasonable.

But I agree, because I'd rather discuss the pragmatics and not bicker over the semantics about an analogy.

Introducing a token error, is different from plagiarism, no? Someone wrote code that can't compile, is different from someone "stealing" proprietary code from some company, and contributing it to some FOSS repo?

In order to assume good faith, you also need to assume the author is the origin. But that's clearly not the case. The origin is from somewhere else, and the author that put their name on the paper didn't verify it, and didn't credit it.

	▲	tpoacher a day ago \| parent [-]
		Sure but the focus here is on the reviewer not the author. The point is what is expected as reasonable review before one can "sign their name on it". "Lazy" (or possibly malicious) authors will always have incentives to cut corners as long as no mechanisms exist to reject (or even penalise) the paper on submission automatically. Which would be the equivalent of a "compiler error" in the code analogy. Effectively the point is, in the absence of such tools, the reviewer can only reasonably be expected to "look over the paper" for high-level issues; catching such low-level issues via manual checks by reviewers has massively diminishing returns for the extra effort involved. So I don't think the conference shaming the reviewers here in the absence of providing such tooling is appropriate.

▲

xvilka a day ago | parent | prev [-]

Code correctness should be checked automatically with the CI and testsuite. New tests should be added. This is exactly what makes sure these stupid errors don't bother the reviewer. Same for the code formatting and documentation.

▲

merely-unlikely a day ago | parent | next [-]

This discussion makes me think peer reviews need more automated tooling somewhat analogous to what software engineers have long relied on. For example, a tool could use an LLM to check that the citation actually substantiates the claim the paper says it does, or else flags the claim for review.

▲

noitpmeder a day ago | parent | next [-]

I'd go one further and say all published papers should come with a clear list of "claimed truths", and one is only able to cite said paper if they are linking in to an explicit truth.

Then you can build a true hierarchy of citation dependencies, checked 'statically', and have better indications of impact if a fundamental truth is disproven, ...

	▲	vkou a day ago \| parent [-]
		Have you authored a lot of non-CS papers? Could you provide a proof of concept paper for that sort of thing? Not a toy example, an actual example, derived from messy real-world data, in a non-trivial[1] field? --- [1] Any field is non-trivial when you get deep enough into it.

▲

alexcdot a day ago | parent | prev [-]

hey, i'm a part of the gptzero team that built automated tooling, to get the results in that article!

totally agree with your thinking here, we can't just give this to an LLM, because of the need to have industry-specific standards for what is a hallucination / match, and how to do the search

▲

thfuran a day ago | parent | prev [-]

What exactly is the analogy you’re suggesting, using LLMs to verify the citations?

▲

tpoacher a day ago | parent [-]

not OP, but that wouldn't really be necessary.

One could submit their bibtex files and expect bibtex citations to be verifiable using a low level checker.

Worst case scenario if your bibtex citation was a variant of one in the checker database you'd be asked to correct it to match the canonical version.

However, as others here have stated, hallucinated "citations" are actually the lesser problem. Citing irrelevant papers based on a fly-by reference is a much harder problem; this was present even before LLMs, but this has now become far worse with LLMs.

▲

thfuran a day ago | parent [-]

Yes, I think verifying mere existence of the cited paper barely moves the needle. I mean, I guess automated verification of that is a cheap rejection criterion, but I don’t think it’s overall very useful.

	▲	alexcdot a day ago \| parent [-]
		really good point. one of the cofounders of gptzero here! the tool gptzero used in the article also detects if the citation supports the claim too, if you scroll to "cited information accuracy" here: https://app.gptzero.me/documents/1641652a-c598-453f-9c94-e0b... this is still in beta because its a much harder problem for sure, since its hard to determine if a 40 page paper supports a claims (if the paper claims X is computationally intractable, does that mean algorithms to compute approximate X are slow?)

▲

pron a day ago | parent | prev | next [-]

That is not, cannot be, and shouldn't be, the bar for peer review. There are two major differences between it and code review:

1. A patch is self-contained and applies to a codebase you have just as much access to as the author. A paper, on the other hand, is just the tip of the iceberg of research work, especially if there is some experiment or data collection involved. The reviewer does not have access to, say, videos of how the data was collected (and even if they did, they don't have the time to review all of that material).

2. The software is also self-contained. That's "prodcution". But a scientific paper does not necessarily aim to represent scientific consensus, but a finding by a particular team of researchers. If a paper's conclusions are wrong, it's expected that it will be refuted by another paper.

▲

grayhatter a day ago | parent [-]

> That is not, cannot be, and shouldn't be, the bar for peer review.

Given the repeatability crisis I keep reading about, maybe something should change?

> 2. The software is also self-contained. That's "prodcution". But a scientific paper does not necessarily aim to represent scientific consensus, but a finding by a particular team of researchers. If a paper's conclusions are wrong, it's expected that it will be refuted by another paper.

This is a much, MUCH stronger point. I would have lead with this because the contrast between this assertion, and my comparison to prod is night and day. The rules for prod are different from the rules of scientific consensus. I regret losing sight of that.

▲

garden_hermit a day ago | parent | next [-]

> Given the repeatability crisis I keep reading about, maybe something should change?

The replication crisis — assuming that it is actually a crisis — is not really solvable with peer review. If I'm reviewing a psychology paper presenting the results of an experiment, I am not able to re-conduct the entire experiment as presented by the authors, which would require completely changing my lab, recruiting and paying participants, and training students & staff.

Even if I did this, and came to a different result than the original paper, what does it mean? Maybe I did something wrong in the replication, maybe the result is only valid for certain populations, maybe inherent statistical uncertainty means we just get different results.

Again, the replication crisis — such that it exists — is not the result of peer review.

▲

hnfong a day ago | parent | prev [-]

IMHO what should change is we stop putting "peer reviewed" articles on a pedestal.

Even if peer review is as rigorous as code reviewed (the former which is usually unpaid), we all know that reviewed code still has bugs, and a programmer would be nuts to go around saying "this code is reviewed by experts, we can assume it's bug free, right?"

But there are too many people who are just assuming peer reviewed articles means they're somehow automatically correct.

	▲	vkou a day ago \| parent [-]
		> IMHO what should change is we stop putting "peer reviewed" articles on a pedestal. Correct. Peer review is a minimal and necessary but not sufficient step.

▲

freehorse a day ago | parent | prev | next [-]

A reviewer is assessing the relevance and "impact" of a paper rather than correctness itself directly. Reviewers may not even have access to the data itself that authors may have used. The way it essentially works is an editor asks the reviewers "is this paper worthy to be published in my journal?" and the reviewers basically have to answer that question. The process is actually the editor/journal's responsibility.

▲

chroma205 a day ago | parent | prev | next [-]

> I've always assumed peer review is similar to diff review. Where I'm willing to sign my name onto the work of others. If I approve a diff/pr and it takes down prod. It's just as much my fault, no?

No.

Modern peer review is “how can I do minimum possible work so I can write ‘ICLR Reviewer 2025’ on my personal website”

	▲	freehorse a day ago \| parent \| next [-]
		The vast majority of people I see do not even mention who they review for in CVs etc. It is usually more akin to a volunteer based, thankless work. Unless you are an editor or sth in a journal, what you review for does not count much for anything.
	▲	grayhatter a day ago \| parent \| prev [-]
		> No. [...] how can I do minimum possible work I don't know, I still think this describes most of the reviews I've seen I just hope most devs that do this know better than to admit to it.

▲

bjourne a day ago | parent | prev [-]

For ICLR reviewers were asked to review 5 papers in two weeks. Unpaid voluntary work in addition to their normal teaching, supervision, meetings, and other research duties. It's just not possible to understand and thoroughly review each paper even for topic experts. If you want to compare peer review to coding, it's more like "no syntax errors, code still compiles" rather than pr review.

	▲	alexcdot a day ago \| parent [-]
		I really like what IJCAI is doing to pay reviewers to do this work, with the $100 fee from authors Yeah its insane the workload reviewers are faced with + being an author who gets a review from a novice

▲ PeterStuer a day ago | parent | prev | next [-]

I think the root problem is that everyone involved, from authors to reviewers to publishers, know that 99.999% of papers are completely of no consequence, just empty calories with the sole purpose of padding quotas for all involved, and thus are not going to put in the effort as if.

This is systemic, and unlikely to change anytime soon. There have been remedies proposed (e.g. limits on how many papers an author can publish per year, let's say 4 to be generous), but they are unlikely to gain traction as thoug most would agree onbenefits, all involved in the system would stand to lose short term.

▲ Aurornis a day ago | parent | prev | next [-]

> I don’t consider it the reviewers responsibility to manually verify all citations are real

I guess this explains all those times over the years where I follow a citation from a paper and discover it doesn’t support what the first paper claimed.

▲ rokob a day ago | parent | prev | next [-]

As a reviewer I at least skimmed the papers for every reference in every paper that I review. If it isn't useful to furthering the point of the paper then my feedback is to remove the reference. Adding a bunch of junk because it is broadly related in a giant background section is a waste of everyone's time and should be removed. Most of the time you are mostly aware of the papers being cited anyway because that is the whole point of reviewing in your area of expertise.

▲ not2b a day ago | parent | prev | next [-]

Agreed. I used to review lots of submissions for IEEE and similar conferences, and didn't consider it my job to verify every reference. No one did, unless the use of the reference triggered an "I can't believe it said that" reaction. Of course, back then, there wasn't a giant plagiarism machine known to fabricate references, so if tools can find fake references easily the tools should be used.

▲ andai a day ago | parent | prev | next [-]

>I don’t consider it the reviewers responsibility to manually verify all citations are real.

Doesn't this sound like something that could be automated?

for paper_name in citations... do a web search for it, see if it there's a page in the results with that title.

That would at least give you "a paper with this name exists".

▲ armcat a day ago | parent | prev | next [-]

I agree with you (I have reviewed papers in the past), however, made-up citations are a "signal". Why would the authors do that? If they made it up, most likely they haven't really read that prior work. If they haven't, have they really done proper due dilligence on their research? Are they just trying to "beef up" their paper with citations to unfairly build up credibility?

▲ pbhjpbhj a day ago | parent | prev | next [-]

Surely there are tools to retrieve all the citations, publishers should spot it easily.

However the paper is submitted, like a folder on a cloud drive, just have them include a folder with PDFs/abstracts of all the citations?

They might then fraudulently produce papers to cite, but they can't cite something that doesn't exist.

▲

michaelt a day ago | parent | next [-]

> Surely there are tools to retrieve all the citations,

Even if you could retrieve all citations (which isn't always as easy as you might hope) to validate citations you'd also have to confirm the paper says what the person citing it says. If I say "A GPU requires 1.4kg of copper" citing [1] is that a valid citation?

That means not just reviewing one paper, but also potentially checking 70+ papers it cites. The vast majority of paper reviewers will not check citations actually say what they're claimed to say, unless a truly outlandish claim is made.

At the same time, academia is strangely resistant to putting hyperlinks in citations, preferring to maintain old traditions - like citing conference papers by page number in a hypothetical book that has never been published; and having both a free and a paywalled version of a paper while considering the paywalled version the 'official' version.

[1] https://arxiv.org/pdf/2512.04142

▲

tpoacher a day ago | parent | prev [-]

how delightfully optimistic of you to think those abstracts would not also be ai generated ...

	▲	zzzeek a day ago \| parent [-]
		sure but then the citations are no longer "hallucinated", they actually point to something fraudulent. that's a different problem.

▲ jayess a day ago | parent | prev | next [-]

Wow. I went to law school and was on the law review. That was our precise job for the papers selected for publication. To verify every single citation.

	▲	_blk a day ago \| parent [-]
		Thanks for sharing that. Interesting how there was a solution to a problem that didn't really exist yet.. I mean, I'm sure it was there for a reason, but I assume it was more things like wrongful attribution, missing commas etc. rather than outright invented quotes to fit a narrative or do you have more background on that? ...at least the mandatory automated checking processes are probably not far off at least for the more reputable journals, but it still makes you wonder how much you can trust the last two years of LLM-enhanced science that is now being quoted in current publications and if those hallucinations can be "reverted" after having been re-quoted. A bit like Wikipedia can be abused to establish facts.

▲ zdragnar a day ago | parent | prev | next [-]

This is half the basis for the replication crisis, no? Shady papers come out and people cite them endlessly with no critical thought or verification.

After all, their grant covers their thesis, not their thesis plus all of the theses they cite.

▲ figassis a day ago | parent | prev | next [-]

It is absolutely the reviewers job to check citations. Who else will check and what is the point of peer review then? So you’d just happily pass on shoddy work because it’s not your job? You’re reviewing both the authors work and if there were people to at needed to ensure citations were good, you’re checking their work also. This is very much the problem today with this “not my problem” mindset. If it passes review, the reviewer is also at fault. Not excuses.

▲

zipy124 a day ago | parent | next [-]

The problem is most academics just do not have the time to do this for free, or in fact even if paid. In addition you may not even have access to the references. In acoustics it's not uncommon to cite works that don't even exist online and it's unlikely the reviewer will have the work in their library.

▲

dpkirchner a day ago | parent | prev [-]

Agreed, and I'd go further. If nobody is reviewing citations they may as well not exist. Why bother?

▲

vkou a day ago | parent [-]

1. To make it clear what is your work, and what is building on someone else's.

2. If the paper turns out to be important, people will bother.

3. There's checking for cursory correctness, and there's forensic torture.

▲

figassis 13 hours ago | parent [-]

building on imaginary someone else? That's exactly the same as lying. Is a review not about verifying that the paper and even data is correct? I get reviewers can make mistakes, but this seems like defending intentional mistakes.

I mean, in college I have had to review papers, and so took peer review lectures, and nowhere in there was it ever stated that citations are not the reviewer's job. In fact, citation verification was one to the most important parts of the lectures, as in, how to find original sources (when authoring), and how to verify them (when reviewing).

When did peer review get redefined?

	▲	vkou 7 hours ago \| parent [-]
		I'm not defending dishonesty, I'm saying that's what citations do when they are used by honest people.

▲ zzzeek a day ago | parent | prev | next [-]

correct me if I'm wrong but citations in papers follow a specific format, and the case here is that a tool was used to validate that they are all real. Certainly a tool that scans a paper for all citations and verifies that they actually exist in the journals they reference shouldn't be all that technically difficult to achieve?

	▲	alexcdot a day ago \| parent \| next [-]
		There are a ton of edge cases and a bit of contextual understanding for what is a hallucinated citation (i.e. what if its republished from arxiv to ICLR?) But to your point, seems we need a tool that can do this
	▲	mike_hearn 16 hours ago \| parent \| prev [-]
		It's not, there's lots of ways to resolve citations without even using AI. I experimented a couple of years ago with getting LLMs to check citations but stopped working on it because there's no incentive. You could run a fancy expensive pipeline burning scarce GPU hours and find a bunch of bad citations. Then what? Nobody cares. No journal is going to retract any of these papers, the academics themselves won't care or even respond to your emails, nobody is willing to pay for this stuff, least of all the universities, journals or governments themselves. For example, there's a guy in France who runs a pre-LLM pipeline to discover bad papers using hand-coded heuristics like regexs or metadata analysis e.g. checking if a citation has been retracted. Many of the things it detects are plagiarism, paper mills (i.e. companies that sell fake papers to academics for a profit), or the result of joke paper creators like SciGen. https://dbrech.irit.fr/pls/apex/f?p=9999:1:::::: Other than populating an obscure database nobody knows about, this work achieved bupkis.

▲ auggierose a day ago | parent | prev [-]

In short, a review has no objective value, it is just an obstacle to be gamed.

▲

amanaplanacanal a day ago | parent [-]

In theory, the review tries to determine if the conclusion reached actually follows from whatever data is provided. It assumes that everything is honest, it's just looking to see if there were mistakes made.

	▲	auggierose a day ago \| parent [-]
		Honest or not should not make a difference, after all, the submitting author may believe themselves everything is A-OK. The review should also determine how valuable the contribution is, not only if it has mistakes or not. Todays reviews determine neither value nor correctness in any meaningful way. And how could they, actually? That is why I review papers only to the extent that I understand them, and I clearly delineate my line of understanding. And I don't review papers that I am not interested in reading. I once got a paper to review that actually pointed out a mistake in one of my previous papers, and then proposed a different solution. They correctly identified the mistake, but I could not verify if their solution worked or not, that would have taken me several weeks to understand. I gave a report along these lines, and the person who gave me the review said I should say more about their solution, but I could not. So my review was not actually used. The paper was accepted, which is fine, but I am sure none of the other reviewers actually knows if it is correct. Now, this was a case where I was an absolute expert. Which is far from the usual situation for a reviewer, even though many reviewers give themselves the highest mark for expertise when they just should not.

▲ barfoure a day ago | parent | prev | next [-]

I’d love to hear some examples of poor electrical work that you’ve come across that’s often missed or not seen.

	▲	AstroNutt a day ago \| parent \| next [-]
		A couple had just moved in a house and called me to replace the ceiling fan in the living room. I pulled the flush mount cover down to start unhooking the wire nuts and noticed RG58 (coax cable). Someone had used the center conductor as the hot wire! I ended up running 12/2 Romex from the switch. There was no way in hell I could have hooked it back up the way it was. This is just one example I've come across.
	▲	joshribakoff a day ago \| parent \| prev [-]
		I am not an electrician, but when I did projects, I did a lot of research before deciding to hire someone and then I was extremely confused when everyone was proposing doing it slightly differently. A lot of them proposed ways that seem to violate the code, like running flex tubing beyond the allowed length or amount of turns. Another example would be people not accounting for needing fireproof covers if they’re installing recessed, lighting in between dwelling in certain cities… Heck, most people don’t actually even get the permit. They just do the unpermitted work.

▲ xnx a day ago | parent | prev | next [-]

No doubt the best electricians are currently better than the best AI, but the best AI is likely now better than the novice homeowner. The trajectory over the past 2 years has been very good. Another five years and AI may be better than all but the very best, or most specialized, electricians.

▲

legostormtroopr a day ago | parent [-]

Current state AI doesn’t have hands. How can it possibly be better at installing electrics than anyone?

Your post reads like AI precisely because while the grammar is fine, it lacks context - like someone prompted “reply that AI is better than average”.

	▲	xnx a day ago \| parent [-]
		An electrician with total knowledge/understanding, but only the average dexterity of a non-professional would still be very useful.

▲ lencastre a day ago | parent | prev | next [-]

an old boss of mine used to say there are no stupid electricians found alive, as they self select darwin award style

▲ bdangubic a day ago | parent | prev [-]

same (and much, much, much worse) for science