Good start on a hard question — how or whether to use #AI tools in #PeerReview.
https://www.researchsquare.com/article/rs-2587766/v1

"For the moment, we recommend that if #LLMs are used to write scholarly reviews, reviewers should disclose their use and accept full responsibility for their reports’ accuracy, tone, reasoning and originality."

PS: "For the moment" these tools can help reviewers string words together, not judge quality. We have good reasons to seek evaluative comments from human experts.

Fighting reviewer fatigue or amplifying bias? Considerations and recommendations for use of ChatGPT and other Large Language Models in scholarly peer review

Background: The emergence of systems based on large language models (LLMs) such as OpenAI’s ChatGPT has created a range of discussions in scholarly circles. Since LLMs generate grammatically correct and mostly relevant (yet sometimes outright wrong, irrelevant or biased...

Update. I acknowledge that there's no bright line between using these tools to polish one's language and using them to shape one's judgments of quality. I also ack that these tools are steadily getting better at "knowing the field". That's why this is a hard problem.

One way to ensure that reviewers take #responsibility for their judgments is #attribution.

#PeerReview #OpenPeerReview

Update. I'm pulling a few other comments into this thread, in preparation for extending it later.

1. I have mixed feelings on #attribution in peer review. I see the benefits, but I also see the benefits of #anonymity.
https://twitter.com/petersuber/status/1412455826397204487

2. For #AI today, good #reviews are a harder problem than good #summaries.
https://fediscience.org/@petersuber/109954904433171308

3. Truth detection is a deep, hard problem. Automating it is even harder.
https://fediscience.org/@petersuber/109921214854932516

#PeerReview #OpenPeerReview

Peter Suber (@[email protected]) on X

Mixed feelings about open peer review. Yes to all the benefits cited by others. But anonymity has benefits too. 1. It protects women, minority, & young/unknown authors from referee bias. 2. It protects the same scholars as referees when criticizing weak work by senior scholars.

X (formerly Twitter)

Update. I'm pulling in two of my Twitter threads on using #AI or #PredictionMarkets to estimate quality-surrogates (not quality itself). I should have kept them together in one thread, but it's too late now.

https://twitter.com/petersuber/status/1259521012196167681

https://twitter.com/petersuber/status/1196908657717342210

Peter Suber (@[email protected]) on Twitter

“If a successful replication boosts the credibility a research article, then does a prediction of a successful replication, from an honest prediction market, do the same, even to a small degree? https://t.co/fBtZ32mq6J”

Twitter

Update. I'm sure this has occurred to #AI / #LLM tool builders. Determining whether an assertion is #true is a hard problem and we don't expect an adequate software solution any time soon, if ever. But determining whether a #citation points to a real publication and whether it's #relevant to the passage citing it, are comparatively easy. (Just comparatively.)

Some tools already cite sources. But when will tools promise that their citations are real and relevant — and deliver on that promise?

Update. I've been playing with #Elicit, one of the new #AI #search engines. Apart from answering your questions in full sentences, it cites peer-reviewed sources. When you click on one, Elicit helps you evaluate it. Quoting from a real example:

"Can I trust this paper?
• No mention found of study type
• No mention found of funding source
• No mention found of participant count
• No mention found of multiple comparisons
• No mention found of intent to treat
• No mention found of preregistration"

Update. Found in the wild: A peer-reviewer used #AI to write comments on a manuscript. The AI tool recommend that the author review certain sources, when nearly all of the recommended works were fake.
https://www.linkedin.com/feed/update/urn:li:share:7046083155149103105/

#Misconduct #NotHypothetical

Robin Bauwens on LinkedIn: A reviewer rejected my paper, and instead suggested me to familiarize… | 101 comments

A reviewer rejected my paper, and instead suggested me to familiarize myself with the following readings. I could not find them anywhere. After a control in… | 101 comments on LinkedIn

Update. The US #NIH and Australian Research Council (#ARC) have banned the use of #AI tools for the #PeerReview of grant proposals. The #NSF is studying the question.
https://www.science.org/content/article/science-funding-agencies-say-no-using-ai-peer-review
(#paywalled)

Apart from #quality, one concern is #confidentiality. If grant proposals become part of a tool's training data, there's no telling (in the NIH's words) “where data are being sent, saved, viewed, or used in the future.”

#Funders

Update. If you *want* to use #AI for #PeerReview:

"Several publishers…have barred researchers from uploading manuscripts…[to] #AI platforms to produce #PeerReview reports, over fears that the work might be fed back into an #LLM’s training data set [&] breach contractual terms to keep work confidential…[But with] privately hosted [and #OpenSource] LLMs…one can be confident that data are not fed back to the firms that host LLMs in the cloud."
https://www.nature.com/articles/d41586-023-03144-w

How ChatGPT and other AI tools could disrupt scientific publishing

A world of AI-assisted writing and reviewing might transform the nature of the scientific paper.

Update. "Avg scores from multiple ChatGPT-4 rounds seems more effective than individual scores…If my weakest articles are removed… correlation with avg scores…falls below statistical significance, suggesting that [it] struggles to make fine-grained evaluations…Overall, ChatGPT [should not] be trusted for…formal or informal research quality evaluation…This is the first pub'd attempt at post-publication expert review accuracy testing for ChatGPT."
https://arxiv.org/abs/2402.05519

#AI #PeerReview

Can ChatGPT evaluate research quality?

Purpose: Assess whether ChatGPT 4.0 is accurate enough to perform research evaluations on journal articles to automate this time-consuming task. Design/methodology/approach: Test the extent to which ChatGPT-4 can assess the quality of journal articles using a case study of the published scoring guidelines of the UK Research Excellence Framework (REF) 2021 to create a research evaluation ChatGPT. This was applied to 51 of my own articles and compared against my own quality judgements. Findings: ChatGPT-4 can produce plausible document summaries and quality evaluation rationales that match the REF criteria. Its overall scores have weak correlations with my self-evaluation scores of the same documents (averaging r=0.281 over 15 iterations, with 8 being statistically significantly different from 0). In contrast, the average scores from the 15 iterations produced a statistically significant positive correlation of 0.509. Thus, averaging scores from multiple ChatGPT-4 rounds seems more effective than individual scores. The positive correlation may be due to ChatGPT being able to extract the author's significance, rigour, and originality claims from inside each paper. If my weakest articles are removed, then the correlation with average scores (r=0.200) falls below statistical significance, suggesting that ChatGPT struggles to make fine-grained evaluations. Research limitations: The data is self-evaluations of a convenience sample of articles from one academic in one field. Practical implications: Overall, ChatGPT does not yet seem to be accurate enough to be trusted for any formal or informal research quality evaluation tasks. Research evaluators, including journal editors, should therefore take steps to control its use. Originality/value: This is the first published attempt at post-publication expert review accuracy testing for ChatGPT.

arXiv.org

Update. 𝘓𝘢𝘯𝘤𝘦𝘵 𝘐𝘯𝘧𝘦𝘤𝘵𝘪𝘰𝘶𝘴 𝘋𝘪𝘴𝘦𝘢𝘴𝘦𝘴 on why it does not permit #AI in #PeerReview:
https://www.thelancet.com/journals/laninf/article/PIIS1473-3099(24)00160-9/fulltext

1. In an experimental peer review report, #ChatGPT "made up statistical feedback and non-existent references."

2. "Peer review is confidential, and privacy and proprietary rights cannot be guaranteed if reviewers upload parts of an article or their report to an #LLM."

Update. "Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these [#CS] conferences could have been substantially modified by #LLMs, i.e. beyond spell-checking or minor writing updates."
https://arxiv.org/abs/2403.07183

#AI #PeerReview

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.

arXiv.org

Update. "We demonstrate how increased availability and access to #AI technologies through recent emergence of chatbots may be misused to write or conceal plagiarized peer-reviews."
https://link.springer.com/article/10.1007/s11192-024-04960-1

#PeerReview

Emerging plagiarism in peer-review evaluation reports: a tip of the iceberg? - Scientometrics

The phenomenon of plagiarism in peer-review evaluation reports remained surprisingly unrecognized, despite a notable rise of such cases in recent years. This study reports multiple cases of peer-review plagiarism recently detected in 50 different scientific articles published in 19 journals. Their in-depth analysis reveals that such reviews tend to be nonsensical, vague and unrelated to the actual manuscript. The analysis is followed by a discussion of the roots of such plagiarism, its consequences and measures that could counteract its further spreading. In addition, we demonstrate how increased availability and access to AI technologies through recent emergence of chatbots may be misused to write or conceal plagiarized peer-reviews. Plagiarizing reviews is a severe misconduct that requires urgent attention and action from all affected parties.

SpringerLink

Update. "Researchers should not be using tools like #ChatGPT to automatically peer review papers, warned organizers of top #AI conferences and academic publishers…Some researchers, however, might argue that AI should automate peer reviews since it performs quite well and can make academics more productive."
https://www.semafor.com/article/05/08/2024/researchers-warned-against-using-ai-to-peer-review-academic-papers

#PeerReview

Researchers warned against using AI to peer review academic papers | Semafor

Top AI conferences and academic publishers worry about intellectual integrity as more researchers use tools like ChatGPT

Update. The @CenterforOpenScience (#COS) and partners are starting a new project (Scaling Machine Assessments of Research Trustworthiness, #SMART) in which researchers voluntarily submit papers to both human and #AI reviewers, and then give feedback on the reviews. The project is now calling for volunteers.
https://www.cos.io/smart-prototyping

#PeerReview

SMART Prototyping

The Center for Open Science (COS), along with its collaborators, is building on the work completed during the DARPA-funded SCORE program, which demonstrated the potential of using algorithms to efficiently evaluate research claims at scale. The SCORE program supplements existing research evaluation methods, including human judgment, evidence aggregation, and systematic replication.

Update. These researchers built an #AI system to predict #REF #assessment scores from a range of data points, inc #citation rates. For individual works, the system was not very accurate. But for total institutional scores, it was 99.8%. "Despite this, we are not recommending this solution because in our judgement, its benefits are marginally outweighed by the perverse incentive it would generate for institutions to overvalue journal impact factors."
https://blogs.lse.ac.uk/impactofsocialsciences/2023/01/16/can-artificial-intelligence-assess-the-quality-of-academic-journal-articles-in-the-next-ref/

#DORA #JIFs #Metrics

Can artificial intelligence assess the quality of academic journal articles in the next REF?

In this blog post Mike Thelwall, Kayvan Kousha, Paul Wilson, Mahshid Abdoli, Meiko Makita, Emma Stuart and Jonathan Levitt discuss the results of a recent project for UKRI that made recommendations…

Impact of Social Sciences

Update. This editorial sketches a fantasy of #AI-assisted #PeerReview, then argues that it's "not far-fetched".
https://www.nature.com/articles/s41551-024-01228-0

PS: I call it far-fetched. You?

The advent of human-assisted peer review by AI - Nature Biomedical Engineering

The Internet didn’t disrupt academic publishing. Audiovisual generative AI might do.

Nature

Update. New study: "The majority of human reviewers’ comments (78.5 %) lacked equivalents in #ChatGPT's comments."
https://www.sciencedirect.com/science/article/abs/pii/S0169260724003067

#AI #LLM #PeerReview

Update. #AI researchers are among those pissed when #PeerReview of their work is outsourced to AI.
https://www.chronicle.com/article/ai-scientists-have-a-problem-ai-bots-are-reviewing-their-work
(#paywalled)

One complained, “If I wanted to know what #ChatGPT thought of our paper, I could have asked myself.”

Update. Instead of arguing for or against the use of #AI in #PeerReview, @roohighosh asks what parts or aspects of peer review could take advantage of AI strengths and what parts need human attention. I appreciate this approach more than I want to quibble on details.
https://scholarlykitchen.sspnet.org/2024/09/12/strengths-weaknesses-opportunities-and-threats-a-comprehensive-swot-analysis-of-ai-and-human-expertise-in-peer-review/
Strengths, Weaknesses, Opportunities, and Threats: A Comprehensive SWOT Analysis of AI and Human Expertise in Peer Review - The Scholarly Kitchen

AI-generated content has been discovered in prominent journals. Should peer reviewers be expected to find AI text in manuscripts? Where in the publication workflow should these checks be done?

The Scholarly Kitchen

Update. The European Association of Science Editors (#EASE, @EASE) is calling for comments on the use of #AI in scholarly communication (#ScholComm), including #PeerReview.
https://docs.google.com/forms/d/e/1FAIpQLSe9dVTje9YZ0HQJzsPYaMUzfct08Oq9f4o6tm78cE9N-3N1bQ/viewform?pli=1

Sorry I didn't see it earlier. Comments are due by September 15.
https://fediscience.org/@[email protected]ience/113130332348315664

Google Forms: Sign-in

Access Google Forms with a personal Google account or Google Workspace account (for business use).

Update. Slight shift in perspective: Not using #AI to do #PeerReview but to predict the outcome of human peer review.
https://link.springer.com/article/10.1007/s00799-023-00359-0

See my old (pre-Mastodon) Twitter thread on the same and similar ideas.
https://x.com/petersuber/status/1259521012196167681

#Prediction #PredictionMarkets

Towards automated meta-review generation via an NLP/ML pipeline in different stages of the scholarly peer review process - International Journal on Digital Libraries

With the ever-increasing number of submissions in top-tier conferences and journals, finding good reviewers and meta-reviewers is becoming increasingly difficult. Writing a meta-review is not straightforward as it involves a series of sub-tasks, including making a decision on the paper based on the reviewer’s recommendation and their confidence in the recommendation, mitigating disagreements among the reviewers, and other such similar tasks. In this work, we develop a novel approach to automatically generate meta-reviews that are decision-aware and which also take into account a set of relevant sub-tasks in the peer-review process. More specifically, we first predict the recommendation scores and confidence scores for the reviews, using which we then predict the decision on a particular manuscript. Finally, we utilize the decision signals for generating the meta-reviews using a transformer-based seq2seq architecture. Our proposed pipelined approach for automatic decision-aware meta-review generation achieves significant performance improvement over the standard summarization baselines as well as relevant prior works on this problem. We make our codes available at https://github.com/saprativa/seq-to-seq-decision-aware-mrg .

SpringerLink

Update. From @lizziegadd: “Maybe” #AI can support the process of research #assessment. But a “lot of the arguments and worries we’re having about AI, we had about #bibliometrics.”
https://www.nature.com/articles/d41586-024-02989-z

#JIF #Metrics

Can AI be used to assess research quality?

Chatbots and other tools are increasingly being considered, but people power is still seen as a safer option.

Update. This study compared #ChatGPT assessments to human #REF reviews. "Although other explanations are possible, esp because REF score profiles are public, the results suggest that #LLMs can provide reasonable research quality estimates in most areas of science and particularly the physical and health sciences…even before citation data is available…[Note that] the ChatGPT scores are only based on titles and abstracts, so cannot be research evaluations."
https://arxiv.org/abs/2409.16695
In which fields can ChatGPT detect journal article quality? An evaluation of REF2021 results

Time spent by academics on research quality assessment might be reduced if automated approaches can help. Whilst citation-based indicators have been extensively developed and evaluated for this, they have substantial limitations and Large Language Models (LLMs) like ChatGPT provide an alternative approach. This article assesses whether ChatGPT 4o-mini can be used to estimate the quality of journal articles across academia. It samples up to 200 articles from all 34 Units of Assessment (UoAs) in the UK's Research Excellence Framework (REF) 2021, comparing ChatGPT scores with departmental average scores. There was an almost universally positive Spearman correlation between ChatGPT scores and departmental averages, varying between 0.08 (Philosophy) and 0.78 (Psychology, Psychiatry and Neuroscience), except for Clinical Medicine (rho=-0.12). Although other explanations are possible, especially because REF score profiles are public, the results suggest that LLMs can provide reasonable research quality estimates in most areas of science, and particularly the physical and health sciences and engineering, even before citation data is available. Nevertheless, ChatGPT assessments seem to be more positive for most health and physical sciences than for other fields, a concern for multidisciplinary assessments, and the ChatGPT scores are only based on titles and abstracts, so cannot be research evaluations.

arXiv.org
Update. "Is it a good idea to include #LLMs…in the #PeerReview process? Would doing so also give researchers back some of the estimated 15 thousand person years per year spent on peer review so they can do more actual research?…#AI can help improve the content and structure of papers *before* they even reach the peer review stage. This is the stage where the use of AI becomes more controversial, and I address it below."
https://scholarlykitchen.sspnet.org/2024/09/24/guest-post-is-ai-the-answer-to-peer-review-problems-or-the-problem-itself/
Guest Post - Is AI the Answer to Peer Review Problems, or the Problem Itself? - The Scholarly Kitchen

Are there ways to use AI in the research workflow to speed up the peer review process -- and, while we're at it, to address some of the other problems around bias and quality?

The Scholarly Kitchen

Update. Researchers asked #ChatGPT and human experts to #PeerReview the same article, and then compared their reviews.
https://doi.org/10.59249/SKDH9286

They find "compelling evidence [for] ChatGPT’s performance…[Its] critical analyses aligned with those of human reviewers…[It] exhibited commendable capability in identifying methodological flaws, articulating insightful feedback on theoretical frameworks, and gauging the overall contribution of the articles to their…fields."

#AI #LLMs

ChatGPT and the Future of Journal Reviews: A Feasibility Study

The increasing volume of research submissions to academic journals poses a significant challenge for traditional peer-review processes. To address this issue, this study explores the potential of employing ChatGPT, an advanced large language model ...

PubMed Central (PMC)

Update. New study: "At major computer-science publication venues, up to 17% of the peer reviews are now written by artificial intelligence. We need guidelines before things get out of hand."
https://www.nature.com/articles/d41586-024-03588-8

h/t @bachtasaar

#AI #LLMs #PeerReview

ChatGPT is transforming peer review — how can we use it responsibly?

At major computer-science publication venues, up to 17% of the peer reviews are now written by artificial intelligence. We need guidelines before things get out of hand.

Update. "78 medical journals (78%) provided guidance on use of #AI in #PeerReview. Of these…46 journals (59%) explicitly prohibit using AI, while 32 allow its use if confidentiality is maintained and authorship rights were respected…The main reason for prohibiting or limited use of AI is confidentiality concerns (75 journals [96%])…Wiley and Springer Nature favored limited use of AI, while Elsevier and Cell Press prohibited AI use."
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2827333

Update. AI-assisted peer review shows "a strong and consistent institutional-prestige bias: identical papers attributed to low-prestige affiliations face a significantly higher risk of rejection, despite only modest differences in LLM-assessed quality."
https://arxiv.org/abs/2509.15122

#AI #Bias #PeerReview #Prestige #ScholComm

Prestige over merit: An adapted audit of LLM bias in peer review

Large language models (LLMs) are playing an increasingly integral, though largely informal, role in scholarly peer review. Yet it remains unclear whether LLMs reproduce the biases observed in human decision-making. We adapt a resume-style audit to scientific publishing, developing a multi-role LLM simulation (editor/reviewer) that evaluates a representative set of high-quality manuscripts across the physical, biological, and social sciences under randomized author identities (institutional prestige, gender, race). The audit reveals a strong and consistent institutional-prestige bias: identical papers attributed to low-prestige affiliations face a significantly higher risk of rejection, despite only modest differences in LLM-assessed quality. To probe mechanisms, we generate synthetic CVs for the same author profiles; these encode large prestige-linked disparities and an inverted prestige-tenure gradient relative to national benchmarks. The results suggest that both domain norms and prestige-linked priors embedded in training data shape paper-level outcomes once identity is visible, converting affiliation into a decisive status cue.

arXiv.org
@petersuber If 17% of reviews are being shat out by AIs, things ARE out of hand.
@petersuber @bachtasaar did peer reviews ever really have guidelines to begin with?
@petersuber Neat idea but very strange approach. Why didn't they just upload the PDF? Why just one paper? Why just one model? Gemini has a big enough context window to handle books. Also, ChatGPT can read images. I don't get why they spent all that time writing up such a little amount of information.

@williamgunn @petersuber

Using a medical case report as the reviewed paper is not representative of a typical science paper. Case reports are mainly descriptions in context of the existing literature.

Usually, science papers also include logical operations such as making inferences of the sort "we observe X, therefore we conclude gene y regulates gene z" , e.g.
Checking if conclusions are based on sound data and sound logic is at the core of peer review, but was not really tested here.

@petersuber Maybe I missed something, but I didn't find in the article a discussion of the results mentioned in the abstract, concerning the "alignment" of ChatGPT's and the 3 reviewers' reports (admittedly very diverse in scope and content, and very superficial, at least for 2 of them). They simply reproduce ChatGPT's output and the reviewers' reports, leaving the comparison to the reader. One wonders what the authors mean by rating the reports, and thus "inter-rater agreement".
@petersuber @lizziegadd … and we were right to have those worries about bibliometrics.

@petersuber @roohighosh If any journal ever uses "AI" to review one of my submissions, I will instantly withdraw that submission, and never ever return to that journal.

Same thing if they hire a grade-5 kid to do the peer-review.

@petersuber (insert quip about leopards and faces)
@petersuber Can AI replace human peer review? - this study indicates ‘no/not yet’
@petersuber if someone can make money doing it, it's not far fetched. As I understand from being on Mastodon, peer reviewed journals are a morass of unethical and infective practice capitalizing on novelty over meaning. What can AI do for the industry except amp it up?

@petersuber

Far fetched to say the least.

Reviewing a paper without reading it!?! I guess I can also submit a paper without writing it, just letting the AI agents do the job. And then, there's no reason to review it. And why would I read a paper when I can get an AI-generated summary. At this point, why having journals to publish papers that nobody writes, reviews, and reads.

And there we go, AI will have achieved one useful thing, getting rid of the broken publication system.

@petersuber

But wait, silly me.

The AI-agent would only be available to th ultra-privileged, the 99.99% of us would still need to do the actual science job.

@lgatto @petersuber Given the secretive nature of peer-review, a LLM would need access to an existing corpus of "reviews", which are currently not publicly available. Is Nature planning to give access to its library of reviews to OpenAI (or the like)? Otherwise, the reviews aren't even going to sound like "reviews".
@RonBeavis @lgatto @petersuber Fine tuning is not so resource intensive as training, so journals could generate their own specific models without needing involvement of external companies. Not saying that would be a good idea, just that it's technically very feasible (also, there are plenty of open source models around)
@nicolaromano @lgatto @petersuber But Nature is a private sector publishing company. If they think that they can get a substantial licensing fee for providing a corpus of reviews to 3rd parties, they probably will do just that.

@petersuber Far-fetched, no. Insane, yes, absolutely.

None of the current LLMs actually have a PhD. Sure, it can summarize, but then the reviewer/editor is reviewing a review of the paper at hand, not the paper itself.

Is it really too much to ask for an unassisted human to review a paper?

Really, absurd. There is a much simpler way to increase the quality of journal articles. It's called #openscience

@petersuber Utter nonsense. A statistical model can't actually reason, at best it can flag spelling and (minor) grammatical errors.

But, and unfortunately: it's only a matter of time before some moneygrubbing publisher tries this because they think they can get richer.

@petersuber I think "far-fetched" is a deperately kind way of describing it. It's an idea that could only be seriously proposed by someone who either has no idea how LLMs work or how peer-review works.
@petersuber it's a pretty dystopian read. Quite a nice fantasy for those who like publisher lock in though.