Wikipedia has a cheat sheet of well-known tells for identifying generated text. (With an appropriate warning not to over-index on minor ones as absolute proof) https://en.m.wikipedia.org/wiki/Wikipedia:WikiProject_AI_Cleanup/AI_catchphrases
Wikipedia:WikiProject AI Cleanup/AI catchphrases - Wikipedia

Please stop sending me replies like "TIL I'm an LLM, because I do one of these seventeen things!" I WROTE TWO SENTENCES AND ONE OF THEM WAS ABOUT THIS
@0xabad1dea anyone who got good essay marks at school probably learned to do at least two of the Seventeen Things.
@fishidwardrobe @0xabad1dea definitely more than 2. Also native speakers of British English derivatives used across the commonwealth
@lilstevie @fishidwardrobe @0xabad1dea If you learn those rules it's as a super simplified model suitable for elementary school exercises, and for the sake of being able to meaningfully break them. Not because writing is supposed to look like that.

@fishidwardrobe @0xabad1dea Most of the patterns under the "Language and tone" section are actually good writing practices in the right context, such as writing a college essay, a news article, or a travel brochure, but an important part of writing is knowing what you're writing for, and adapting your language to that format. Most of the "Language and tone" category are language patterns that would be inappropriate in an encyclopedia format (such as Wikipedia), but may be perfectly fine elsewhere.

I'd bet a lot of these "tells" for AI-generated text probably also detect a lot of plagiarised edits, where something is just copied directly from an article, travel brochure, etc. instead of rephrasing it in objective language for an encyclopedia. I guess an argument could be made that the generative AI is also just plagiarism with extra steps.

@0xabad1dea

I'm relieved in ways I can't express that I'm out education entirely because based on attitudes I encountered I have total confidence that I'd encounter so many people with no revulsion to just submitting slop as assignments.

I've high confidence that many lecturers would not give a damn about false positive accusations of slop submissions to students, to the point of lecturers handing responsibiliy of detecting slop to an LLM, & feeding student work to that LLM.

@0xabad1dea personal mental note: put the warning BEFORE the main line. We're becoming trigger happy, polarized?
@0xabad1dea AND FURTHERMORE.... 😁
@0xabad1dea TIL I'm an LLM because I have reading but no comprehension. :P
@0xabad1dea i guess social media users are the dual of llms, because instead of hallucinating false new information on output, they ignore true old information on input...
@0xabad1dea maybe they only read an ai summary of your two sentences
@0xabad1dea My personal preference is for em-dashes to have no spaces, so I’d be devastated if that were the tell for bots.
@shanecelis @0xabad1dea You needn't be devastated. 😀 However, as one who has suffered too long from watching software not wrapping text at em dashes, I personally shall continue to be heretical — and add spaces.
@winterknell @0xabad1dea Of all the Hills you had to come to this one—to die on. ;)
@shanecelis @winterknell @0xabad1dea This hill I die on too! Trying to get consistent acceptable behaviour across eg LaTeX and HTML encouraged me to wrap em- but not en- dashes in most cases!
@0xabad1dea My love for em-dashes and markdown formatting is making me question my own existence.
@elricofmelnibone note that it specifically means markdown *on wikipedia*, which does not support markdown but rather a predecessor format
@0xabad1dea Can’t AI companies (and actors who want to be undetected) just feed this page as “patterns to avoid”?
@KyberNull wouldn't work super well, nor do AI companies particularly care if you notice AI generated text is AI generated text
@KyberNull @0xabad1dea i'm going only by intuition here, but i think it'd cause other, more obvious tells to pop up. also, these things are **really** bad at *not* doing some specified thing. (see "room with no elephants")
@KyberNull @0xabad1dea just the surface level of using one word more or less often, but many of the real patterns described on that page are harder to avoid.
@0xabad1dea TIL I'm an AI chatbot.
@antsu @0xabad1dea chat geepeetee doesnt say "TIL" so youre probably good
@antsu I literally put "do not over-index" in the two-sentence-long post
@0xabad1dea Apologies, you are absolutely correct! I seem to have overlooked the advice to not "over-index" — which was indeed included in your original prompt. As a large language model (LLM), my skills will continue to improve as technology advances and my training data set is expanded, making mistakes like this less frequent.
@antsu okay okay yes you got an unwilling laugh out of me
@0xabad1dea @antsu i can’t help but read “do not over-index” in the chatgpt “do not hallucinate” voice
@0xabad1dea *points at em-dashes and emojis used as bullet points all over the text*

"these code points are too high for a human hand to type"

@0xabad1dea #TIL that I'm just an LLM bot   

I'm using some Russian #typography rules which differs from the same English rules and, sadly  , the LLMs tends to use same rules when they comes to the dashes.

In RU typography en-dash used to divide numbers and it doesn't have spaces on the left and right. Like this: 123–456–789.

And the em-dash used to divide parts of sentence and it should have space on the left and right — like this 

@evgandr
@0xabad1dea Well… it is on the English Wikipedia referring to English text 😉

But on RU keyboards, you use the same Unicode codepoints, right? So U+002D for everything, right?

@tajpulo @0xabad1dea Yep, it was just funny to treat myself like a bot 🤖 beep-boop🙂. And also I wanted to write a bit about ru-typography

> you use the same Unicode codepoints

Yes, most of the people just use the same codepoint (-). But for people who want to use the typography symbols properly there are a Birman's layout (https://ilyabirman.net/typography-layout/) or Compose key in the X-server-based systems

Typography Layout

Typography Keyboard Layout

@evgandr
@0xabad1dea Ah, interesting. Thank you for sharing ☺️
@0xabad1dea TIL I'm an LLM, because I do one of these seventeen things!
@0xabad1dea Thanks for sharing this!
@0xabad1dea haha, I am ai then!
@0xabad1dea LLM is like pollution now.

@0xabad1dea Isn't this basically just another "Who's adapting faster" situation?

The more detailed the list becomes, the easier it is to simply adapt AI generated content to avoid these things, eventually making it harder and harder to tell if something is written by AI.

@christopherklay you're not wrong in that yes, you can use this list to intentionally improve AI writing to pass it off as authentic. However, in practice, AI companies don't care if you notice something you're reading is AI generated – they already got paid when it was generated – and people who are using an LLM to generate text because they either don't understand the subject themselves or don't speak the language well are also the least likely to be able to polish it up to the point no-one would be suspicious
@0xabad1dea @christopherklay In my experience, people who use it to generate text also think ChatGPT writes great prose.

@0xabad1dea The average say social media comment and the like definitely wouldn't be changed because of this, but I'd argue for example news sites could definitely come up with "refining" steps fairly easily.

The potential gain in viewers from not looking like AI compared to the flood of obvious AI spam enough would justify a few extra steps already.

@christopherklay @0xabad1dea people generating AI slop content have already conceded that they don't really care about the content.
@tedmielczarek @0xabad1dea People generating AI content to make money do care - if it makes them more money.
@christopherklay @0xabad1dea When you say "simply", that sounds to me like the siren song of AI: that you can just try anything, and maybe it will work. But if you read that list more closely, these are more like signs of defects, not just styles, so fixing them is not simple. A person who knows what they're talking about and wrote this way can simply edit their writing to be clearer and less clichéd. A person would have a harder time if they were trying to cover up the fact that they don't know as much as they want people to think they know. I suspect likewise an AI would need to be able to better retrieve and process actual information to avoid these mistakes.
@0xabad1dea it’s fascinating to look at the breakdown between items that are “chatbots have this quirk of style that isn’t bad per se but is a tell”, items that are “chatbots write poorly in a fairly consistent way”, and items that are “chatbots just absolutely cannot follow certain Wikipedia styles and conventions”
@0xabad1dea we're getting closer to an antidote.
@0xabad1dea Funnily enough, several of those points are exactly how someone learning English as a foreign language is taught to write in order to get a higher exam score.
@0xabad1dea It's fascinating to me that you can tell where the AI learned things - all that florid language comes from advertising copy and fails the pass as not impartial enough for Wiki. The listing comes from SEO techniques and traditional teaching tools for high school level essay writing. Once you move to other contexts, it'll be harder to use those as a tell.
@jilder @0xabad1dea the citation at the bottom of the page for "why does AI use 'delve' so much" is fascinating - the raw LLM goes through a human-curation "fine tuning" phase before entering production, which needs tons of proofreader labor. and that in turn comes from Nigeria (cheap), who use "delve" much more frequently in conversation than the US / Europe. hence, the LLM is conditioned to write Nigerian English and it becomes a tell for the rest of us