I have a preprint out estimating how many scholarly papers are written using... - ChatGPT, dude

generalising, 1 month ago

I have a preprint out estimating how many scholarly papers are written using chatGPT etc? I estimate upwards of 60k articles (>1% of global output) published in 2023. https://arxiv.org/abs/2403.16887

How can we identify this? Simple: there are certain words that LLMs love, and they suddenly start showing up a lot last year. Twice as many papers call something "intricate", big rises for "commendable" and "meticulous".

#bibliometrics #scholcomm #chatgpt

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ ianRobinson, jeffjarvis, MarcAbrahams, skyfaller +23 more

Image

Image alternative text

rysiek, 1 month ago

@generalising ooooh!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ArchaeoIain, 1 month ago

@generalising but if LLMs derive their models from papers that are already published, how can there be massive increases in such words? I like the idea, but this sounds strange.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

joeroe, 1 month ago

@ArchaeoIain @generalising Overfitting to certain sources/authors in their training data, I suppose? "Notable" for example is classic Wikipedianese.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Wikisteff, 1 month ago

@generalising Fantastic work, Andrew!
Thank you so much. Now I can search web data for posts, searches, and media using the same token words. :)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Wikisteff, 1 month ago

@generalising I'm number 2!!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

generalising, 1 month ago

@Wikisteff Credit where it's due - I took the sample list from an earlier study! https://arxiv.org/abs/2403.07183 (p 15, 16) I think this is a bit of an idiosyncratic list due to the peer-review context (hence it's all adjectives/adverbs, almost all positive) and there will definitely be other distinctive terms, some unpredictable - it would be quite interesting to do some larger analysis to try and find them.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Wikisteff, 1 month ago

@generalising It's a fantastic idea!
I used fine-grained stylometrics to identify the unique-ish "fists" of posters and their proxy accounts in Twitter posts in 2022 to do some hypothesis testing of co-authorship amongst accounts in the aftermath of the 2022 Convoy Protest here in Ottawa, but I hadn't thought of using them for bibliometrics and AI!
It's a genius move! :)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mirgray, 1 month ago

@Wikisteff Why had I heard of that work before? Fascinating!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Wikisteff, 1 month ago

@mirgray There's a LOT you can do with stylometrics. I'm still kind of hoping that LLMs can be used to identify the fists of individual authors in their training data reliably, as clearly the data are in there ("please write a sonnet about how bias in decision AIs is a challenging issue in the style of William Shakespeare").

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

franco_vazza, 1 month ago

@generalising
Very cool!
from what I understand, if a paper contains both "“intricate” and “meticulous” or
“intricate” and “notable” there is a very high chance it is by LLM (although, of course that's not a proof..especially now that your work will be noticed).

Is there hope to get automatic LLM with some~99% confidence? Of course the entire point is that LLM mimic human language, but you are showing that language correlations appear and are strong indicators, and they may hard to polish.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

generalising, 1 month ago

@franco_vazza there's always a baseline of usage, but that pair together has suddenly become a lot more common.

I don't think this approach is great for detecting LLM involvement in any individual paper (there are much more sophisticated tools for that) but it works OK for estimation at a much broader scale.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ franco_vazza

Add comment