I have a preprint out estimating how many scholarly papers are written using chatGPT etc? I estimate upwards of 60k articles (>1% of global output) published in 2023. https://arxiv.org/abs/2403.16887

How can we identify this? Simple: there are certain words that LLMs love, and they suddenly start showing up *a lot* last year. Twice as many papers call something "intricate", big rises for "commendable" and "meticulous".

#bibliometrics #scholcomm #chatgpt

ChatGPT "contamination": estimating the prevalence of LLMs in the scholarly literature

The use of ChatGPT and similar Large Language Model (LLM) tools in scholarly communication and academic publishing has been widely discussed since they became easily accessible to a general audience in late 2022. This study uses keywords known to be disproportionately present in LLM-generated text to provide an overall estimate for the prevalence of LLM-assisted writing in the scholarly literature. For the publishing year 2023, it is found that several of those keywords show a distinctive and disproportionate increase in their prevalence, individually and in combination. It is estimated that at least 60,000 papers (slightly over 1% of all articles) were LLM-assisted, though this number could be extended and refined by analysis of other characteristics of the papers or by identification of further indicative keywords.

arXiv.org
@generalising Fantastic work, Andrew!
Thank you so much. Now I can search web data for posts, searches, and media using the same token words. :)
@generalising I'm number 2!!
@generalising Do you have the data table for the 90 top words? I'd love to see how the also-rans performed vis-a-vis the top 10! :)
@Wikisteff No, these were all done by hand so I didn't want to spend a full week on doing all 100! Might be practical to test them all using the Dimensions API, though?
@generalising I was thinking of Google NGram / Google Trends as well... for a different use case, of course. :)
@generalising I computed number of standard deviations above the 2016-2022 mean above a baseline quadratic time series model of language use for 2023-2024. All your control words came out significantly different than model, except for "before" and "earlier".
@generalising It's not a great model, I should probably use an ARIMA for baseline trend, but I don't have that implemented in my Excel library like quadratic regression.
@generalising The LLM-boosted adjectives are all up, except for "fresh", "potent", and "ingenious". Enormous effect size in 2024, as you noted.
@generalising For the adverbs, insane significance levels for "meticulously", "methodically", "compellingly", "impressively", and "strategically"; alongside a significant decline in "reportedly", "excellently", and "undoubtedly".
@generalising All your synthetic tests do great by this measure, but of course you already knew that. :)

@Wikisteff this is interesting, thankyou!

What I don't have a figure for is "what percentage of papers in any given year have full text", & it may not be constant over time. This was one of the reasons for including control words - they proxy for it and lets us know what a reasonable bound for year to year change might be (I got ~5%). I'm not sure if that complicates your analysis?

@Wikisteff similarly I think it's plausible the 2024 data is weird in interesting ways - eg over-representing certain types of paper in certain journals because they publish faster- and that might complicate analysis of it. Which is not to say the 2024 figures *aren't* going to be terrifyingly high whatever corrections we apply!

@generalising Yeah, this is a *great* point: with LLMs, it's easier to *publish* new papers, which will contaminate the sample in a parallel and different way. It would be great to sample a random subset of 100 "likelies", 100 "unlikelies", and 100 "unsures" and do more detailed stylometrics on them to see if you can pull out differences between groups here.

Future work! :) :)

@generalising NB. This from a friend on Twitter just today:

@Wikisteff believe it or not, these extreme cases were what made me think about full-text digging originally. Except of course there's only a handful of these - peer review is pretty good at weeding the most blatant stuff out (and I'd assume a lot of it is editorially desk-rejected even before that step). So it was amusing but all pretty low-level.

Then the adjective list came out and I thought, hey, this might actually show up at scale! :-)

@generalising ...and you were *not* wrong!

As a futurist working for the Government of Canada on, among other things, the medium and long-term consequences of generative artificial intelligence in society, I am hugely interested in the time series here... and in ways to quantify the extent to which our productive capacity is colonized by AI.

In this context, your work, though early, is highly important!

@generalising That is an excellent point. I hadn't thought about the imputed base rates being subject to time-varying pressures. Although, that said, my quadratic regression *should* account for slow underlying changes in the data, even within series.

However, step changes due to policy changes cannot be modelled in this way. 🤔

@Wikisteff Credit where it's due - I took the sample list from an earlier study! https://arxiv.org/abs/2403.07183 (p 15, 16) I think this is a bit of an idiosyncratic list due to the peer-review context (hence it's all adjectives/adverbs, almost all positive) and there will definitely be other distinctive terms, some unpredictable - it would be quite interesting to do some larger analysis to try and find them.
Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.

arXiv.org
@generalising It's a fantastic idea!
I used fine-grained stylometrics to identify the unique-ish "fists" of posters and their proxy accounts in Twitter posts in 2022 to do some hypothesis testing of co-authorship amongst accounts in the aftermath of the 2022 Convoy Protest here in Ottawa, but I hadn't thought of using them for bibliometrics and AI!
It's a genius move! :)
@Wikisteff Why had I heard of that work before? Fascinating!
@mirgray There's a LOT you can do with stylometrics. I'm still kind of hoping that LLMs can be used to identify the fists of individual authors in their training data reliably, as clearly the data are in there ("please write a sonnet about how bias in decision AIs is a challenging issue in the style of William Shakespeare").
@Wikisteff @mirgray Could you please consider creating a programming language named “Shakespeare”?
@Wikisteff I wasn't clear. I meant why haven't I heard of your work on convoy related tweets.
@mirgray Oh, great question!
Unpublished. I did it privately, not for publication, not with work.
Work stuff, we publish. :)