Good start on a hard question — how or whether to use #AI tools in #PeerReview.
https://www.researchsquare.com/article/rs-2587766/v1

"For the moment, we recommend that if #LLMs are used to write scholarly reviews, reviewers should disclose their use and accept full responsibility for their reports’ accuracy, tone, reasoning and originality."

PS: "For the moment" these tools can help reviewers string words together, not judge quality. We have good reasons to seek evaluative comments from human experts.

Fighting reviewer fatigue or amplifying bias? Considerations and recommendations for use of ChatGPT and other Large Language Models in scholarly peer review

Background: The emergence of systems based on large language models (LLMs) such as OpenAI’s ChatGPT has created a range of discussions in scholarly circles. Since LLMs generate grammatically correct and mostly relevant (yet sometimes outright wrong, irrelevant or biased...

Update. I acknowledge that there's no bright line between using these tools to polish one's language and using them to shape one's judgments of quality. I also ack that these tools are steadily getting better at "knowing the field". That's why this is a hard problem.

One way to ensure that reviewers take #responsibility for their judgments is #attribution.

#PeerReview #OpenPeerReview

Update. I'm pulling a few other comments into this thread, in preparation for extending it later.

1. I have mixed feelings on #attribution in peer review. I see the benefits, but I also see the benefits of #anonymity.
https://twitter.com/petersuber/status/1412455826397204487

2. For #AI today, good #reviews are a harder problem than good #summaries.
https://fediscience.org/@petersuber/109954904433171308

3. Truth detection is a deep, hard problem. Automating it is even harder.
https://fediscience.org/@petersuber/109921214854932516

#PeerReview #OpenPeerReview

Peter Suber (@[email protected]) on X

Mixed feelings about open peer review. Yes to all the benefits cited by others. But anonymity has benefits too. 1. It protects women, minority, & young/unknown authors from referee bias. 2. It protects the same scholars as referees when criticizing weak work by senior scholars.

X (formerly Twitter)

Update. I'm pulling in two of my Twitter threads on using #AI or #PredictionMarkets to estimate quality-surrogates (not quality itself). I should have kept them together in one thread, but it's too late now.

https://twitter.com/petersuber/status/1259521012196167681

https://twitter.com/petersuber/status/1196908657717342210

Peter Suber (@[email protected]) on Twitter

“If a successful replication boosts the credibility a research article, then does a prediction of a successful replication, from an honest prediction market, do the same, even to a small degree? https://t.co/fBtZ32mq6J”

Twitter

Update. I'm sure this has occurred to #AI / #LLM tool builders. Determining whether an assertion is #true is a hard problem and we don't expect an adequate software solution any time soon, if ever. But determining whether a #citation points to a real publication and whether it's #relevant to the passage citing it, are comparatively easy. (Just comparatively.)

Some tools already cite sources. But when will tools promise that their citations are real and relevant — and deliver on that promise?

Update. I've been playing with #Elicit, one of the new #AI #search engines. Apart from answering your questions in full sentences, it cites peer-reviewed sources. When you click on one, Elicit helps you evaluate it. Quoting from a real example:

"Can I trust this paper?
• No mention found of study type
• No mention found of funding source
• No mention found of participant count
• No mention found of multiple comparisons
• No mention found of intent to treat
• No mention found of preregistration"

Update. Found in the wild: A peer-reviewer used #AI to write comments on a manuscript. The AI tool recommend that the author review certain sources, when nearly all of the recommended works were fake.
https://www.linkedin.com/feed/update/urn:li:share:7046083155149103105/

#Misconduct #NotHypothetical

Robin Bauwens on LinkedIn: A reviewer rejected my paper, and instead suggested me to familiarize… | 101 comments

A reviewer rejected my paper, and instead suggested me to familiarize myself with the following readings. I could not find them anywhere. After a control in… | 101 comments on LinkedIn

Update. The US #NIH and Australian Research Council (#ARC) have banned the use of #AI tools for the #PeerReview of grant proposals. The #NSF is studying the question.
https://www.science.org/content/article/science-funding-agencies-say-no-using-ai-peer-review
(#paywalled)

Apart from #quality, one concern is #confidentiality. If grant proposals become part of a tool's training data, there's no telling (in the NIH's words) “where data are being sent, saved, viewed, or used in the future.”

#Funders

Update. If you *want* to use #AI for #PeerReview:

"Several publishers…have barred researchers from uploading manuscripts…[to] #AI platforms to produce #PeerReview reports, over fears that the work might be fed back into an #LLM’s training data set [&] breach contractual terms to keep work confidential…[But with] privately hosted [and #OpenSource] LLMs…one can be confident that data are not fed back to the firms that host LLMs in the cloud."
https://www.nature.com/articles/d41586-023-03144-w

How ChatGPT and other AI tools could disrupt scientific publishing

A world of AI-assisted writing and reviewing might transform the nature of the scientific paper.

Update. "Avg scores from multiple ChatGPT-4 rounds seems more effective than individual scores…If my weakest articles are removed… correlation with avg scores…falls below statistical significance, suggesting that [it] struggles to make fine-grained evaluations…Overall, ChatGPT [should not] be trusted for…formal or informal research quality evaluation…This is the first pub'd attempt at post-publication expert review accuracy testing for ChatGPT."
https://arxiv.org/abs/2402.05519

#AI #PeerReview

Can ChatGPT evaluate research quality?

Purpose: Assess whether ChatGPT 4.0 is accurate enough to perform research evaluations on journal articles to automate this time-consuming task. Design/methodology/approach: Test the extent to which ChatGPT-4 can assess the quality of journal articles using a case study of the published scoring guidelines of the UK Research Excellence Framework (REF) 2021 to create a research evaluation ChatGPT. This was applied to 51 of my own articles and compared against my own quality judgements. Findings: ChatGPT-4 can produce plausible document summaries and quality evaluation rationales that match the REF criteria. Its overall scores have weak correlations with my self-evaluation scores of the same documents (averaging r=0.281 over 15 iterations, with 8 being statistically significantly different from 0). In contrast, the average scores from the 15 iterations produced a statistically significant positive correlation of 0.509. Thus, averaging scores from multiple ChatGPT-4 rounds seems more effective than individual scores. The positive correlation may be due to ChatGPT being able to extract the author's significance, rigour, and originality claims from inside each paper. If my weakest articles are removed, then the correlation with average scores (r=0.200) falls below statistical significance, suggesting that ChatGPT struggles to make fine-grained evaluations. Research limitations: The data is self-evaluations of a convenience sample of articles from one academic in one field. Practical implications: Overall, ChatGPT does not yet seem to be accurate enough to be trusted for any formal or informal research quality evaluation tasks. Research evaluators, including journal editors, should therefore take steps to control its use. Originality/value: This is the first published attempt at post-publication expert review accuracy testing for ChatGPT.

arXiv.org

Update. 𝘓𝘢𝘯𝘤𝘦𝘵 𝘐𝘯𝘧𝘦𝘤𝘵𝘪𝘰𝘶𝘴 𝘋𝘪𝘴𝘦𝘢𝘴𝘦𝘴 on why it does not permit #AI in #PeerReview:
https://www.thelancet.com/journals/laninf/article/PIIS1473-3099(24)00160-9/fulltext

1. In an experimental peer review report, #ChatGPT "made up statistical feedback and non-existent references."

2. "Peer review is confidential, and privacy and proprietary rights cannot be guaranteed if reviewers upload parts of an article or their report to an #LLM."

Update. "Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these [#CS] conferences could have been substantially modified by #LLMs, i.e. beyond spell-checking or minor writing updates."
https://arxiv.org/abs/2403.07183

#AI #PeerReview

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.

arXiv.org

Update. "We demonstrate how increased availability and access to #AI technologies through recent emergence of chatbots may be misused to write or conceal plagiarized peer-reviews."
https://link.springer.com/article/10.1007/s11192-024-04960-1

#PeerReview

Emerging plagiarism in peer-review evaluation reports: a tip of the iceberg? - Scientometrics

The phenomenon of plagiarism in peer-review evaluation reports remained surprisingly unrecognized, despite a notable rise of such cases in recent years. This study reports multiple cases of peer-review plagiarism recently detected in 50 different scientific articles published in 19 journals. Their in-depth analysis reveals that such reviews tend to be nonsensical, vague and unrelated to the actual manuscript. The analysis is followed by a discussion of the roots of such plagiarism, its consequences and measures that could counteract its further spreading. In addition, we demonstrate how increased availability and access to AI technologies through recent emergence of chatbots may be misused to write or conceal plagiarized peer-reviews. Plagiarizing reviews is a severe misconduct that requires urgent attention and action from all affected parties.

SpringerLink

Update. "Researchers should not be using tools like #ChatGPT to automatically peer review papers, warned organizers of top #AI conferences and academic publishers…Some researchers, however, might argue that AI should automate peer reviews since it performs quite well and can make academics more productive."
https://www.semafor.com/article/05/08/2024/researchers-warned-against-using-ai-to-peer-review-academic-papers

#PeerReview

Researchers warned against using AI to peer review academic papers | Semafor

Top AI conferences and academic publishers worry about intellectual integrity as more researchers use tools like ChatGPT

Update. The @CenterforOpenScience (#COS) and partners are starting a new project (Scaling Machine Assessments of Research Trustworthiness, #SMART) in which researchers voluntarily submit papers to both human and #AI reviewers, and then give feedback on the reviews. The project is now calling for volunteers.
https://www.cos.io/smart-prototyping

#PeerReview

SMART Prototyping

The Center for Open Science (COS), along with its collaborators, is building on the work completed during the DARPA-funded SCORE program, which demonstrated the potential of using algorithms to efficiently evaluate research claims at scale. The SCORE program supplements existing research evaluation methods, including human judgment, evidence aggregation, and systematic replication.

@petersuber @CenterforOpenScience I admit I am increasingly worried that Center for Open Science is too high on their own supply and are more and more likely to have a net negative effect on Science.