TIL that Kenyan workers have been used so much to train AI systems, that standard writing by Kenyan people is often flagged as AI generated while it is not (which means that they can get discriminated for jobs / exams etc)
https://marcusolang.substack.com/p/im-kenyan-i-dont-write-like-chatgpt

Edit: many people raised below that the article is talking about texts written in very classically trained English detected as AI generated, which is the case for many Kenyans. It is documented that many Kenyan workers have been hired to train LLMs, but I made an assumption that it was the reason for this detection while it may not be. Sorry about that, thanks for the feedback (and feel free to continue the discussion here)

I'm Kenyan. I Don't Write Like ChatGPT. ChatGPT Writes Like Me.

I'm calm. I'm calm. I promise.

this man's mind
@tek I can't help but find this oddly fascinating. 😼 đŸ€”

@tek
> "Kenyan workers have been used so much to train AI systems"

Where in the linked post does it say that?

@ki @tek it doesn’t. The linked article is still worth reading, but it’s not the synopsis that OP says it is.

@zed @tek
thanks for confirming, I was doubting my reading comprehension

it is a good article

@ki
i think that part was rather assumed to be part of preexisting context ("widely known")
@zed @tek

It’s ambiguous how to interpret that — one extreme is “Kenyan workers were paid by AI companies to tune model output”; another is “the prose style required of ex-British-Empire students is so common in the online corpus that it shaped LLM prose”.

The latter is supported by the essay.

“the writer from Lagos, from Mumbai, from Kingston, from right here in Nairobi, [
] was taught that precision was the highest form of respect for both the language and the reader”

Hear, hear.

@zed @ki @tek

@ki @tek

the synopsis is wrong

try this: high achieving Kenyan students were trained to write in a very clear, traditional style that has a certain rhythm & format that readers / listeners respond to because we (English speakers & writers worldwide, not just Kenyans) absorbed a common English language culture from the late British Empire

LLMs have now also absorbed this so nowadays doubters & racists assume Kenyans' polished writing wasn't written by them but by ChatGPT

@ki @tek it rather says thay tons of old colonial british english babble was ingested by the slop machine because there is no copyright on it anymore.
@f4grx @ki @tek since when do they care about copyright?

@ki @tek As others have noted, it doesn’t, but it is something that has been well documented, including in Karen’s Hao’s excellent book, Empire of AI (https://en.wikipedia.org/wiki/Empire_of_AI).

These platforms leveraged RLHF, which often relied on English speaking, economically advantageous populaces with good connectivity to refine the feedback (Venezuelans were also used, but more so for image tagging/categorization).

Empire of AI - Wikipedia

@tek as a scientist who wrote hundreds of research articles and grants, ChatGPT writes in a style as it was trained on: books, articles, curated sources. People who strutinize GPT writting style surely do not read beyond their social media bubbles.
@tek That could als explain why the image generating parts often collapse to a state where people tend to be darker skinned and black haired. (and why altright eejits have been whining about these image generators being "too woke")
@JorisMeys
iirc genai tends to have the opposite problem of whitewashing stuff usually
@tek

@tek but this isn’t what the text says?

the writer is arguing that the texts that were used to train AI made it sound like someone who went through the kenyan education system. it is speaking of the system and of LLMs in general, and how these ai detectors often flag non-first language speakers’ texts as AI. since teaching english on those places relies on exactly the kind of structure we now attribute to AI.

@agatha Indeed, many Kenyan workers have generated texts for AI but it is not clear that it is the reason for htat detection, I updated the post, thanks
@tek There's another dimension to this: even before the chatbot pandemic, enterprising Africans, many in Kenya, were doing assignments for students worldwide, cheaply if you had hard currency. Hence "Kenyan" style was incorporated in the training data of the chatbots from the start.
@tek
"So, when you read my work - when you see our work - what are you really seeing? Are you seeing a robot's soulless prose? Or are you seeing the image of our Standard Eight English teacher, Mrs. Amollo, her voice echoing in our minds - a voice that spoke with the clipped, precise accent of a bygone era - reminding us to connect our paragraphs with a suitable linking phrase?"

@tek

"Kenyan workers have been used so much to train AI systems, that standard writing by Kenyan people is often flagged as AI generated while it is not"

That is 
 not what the article is saying.

It is, however, an excellent article.

@teun Yeah I made an edit to my post to clarify, thanks for the feedback
@tek two birds with one stone, for the white supremacist techbros.
@tek Personally I like the way Kenyan ppl talk and if AI can help me sound like them 
 that's about the first good thing I've heard about it!
Interesting. But somehow his writing style doesn't feel AI-like to me. I guess there are other clues I personally pick up on for AI-generated texts. Excessive use of lists is a big one. Especially when the first words in each list item are bold. Unnecessary details that no real human would specify, too.
@tek He didn't even write like an AI in the slightest.

@tek Kenyan workers, anyone who reads a lot of fantasy or sci-fi, or pretty much anyone who had to learn English as a second (or third, or fourth, whatever) language. These folks are all likely to have an above-average vocabulary.

It's galling that a moronic minority of nAtIvE sPeAkeRs is levelling these silly accusations.

@tek Not just their writing is commonly tagged as AI. I've actually seen a lot of people post on here that their own writing (from pre AI times) is often flagged as 100% AI.

Also I've seen academics recommend to use AI to rewrite your own text until it passes the AI detectors as you'd otherwise have to deal with being accused of using AI to generate the text. (The irony is probably obvious)

@agowa338 @tek
what a great plan, I'm sure it's feasible to test against all AI detection systems

@Doomed_Daniel @tek

No it's not, but you can put it through all you've access to...

(Still quite shitty and you shouldn't have to worry about these things when you didn't use AI for it...)

@agowa338 @tek
the problem is I guess if one of them actually works (better than the rest) it will detect not "suspected AI" but "certainly AI" and it might drag you even deeper into the shit than you'd otherwise be

@Doomed_Daniel @tek

None of them work. All of them are magic 8 balls.

That's exactly the problem why people that weren't using AI are now using AI to rewrite their work to get a lower score in these oracles...

@tek This is not what this interesting article says. It says that AI detectors misclassify texts written by Kenyans but not because they were used to train AI but because Kenyans were taught english at school in a certain way, which looks "robotic" to US readers.
@bortzmeyer
@tek I understood it the same teaching approach in Nigeria where a lot of AI ghost writers was/is done.
@tek it is certainly possible to discern human writing from ai slop, but it requires more subtle clues than a word list.

@tek

This looks interesting. I'd boost it if it wasn't a link to Substack. Substack also platforms fascists. That's a red line for me. It should be a red line for *all* anti-fascists.

@LevZadov I agree and I hate how much great content is written on Substack but also I don't want to avoid reading/sharing great content for that reason (but I understand why you may not)

@tek

Q: If ten people sit at a table and a Nazi sits down with them, how many Nazis are sitting at the table?

A: Ask a German.

@LevZadov
Me sharing a link to a blog post by a Kenyan writer on Substack and people supporting the 3rd Reich sounds like a pretty similar situation, I don't know how I didn't realize that before.
(this is sarcasm)

@tek

This person supports a publisher of Nazi propaganda. That's evil. He's blocked.

@tek tell me the technical class hasn't paid attention in english since third grade without telling me they haven't paid attention in english since third grade:

"same-length sentences are AI"

@tek "It accidentally replicated the linguistic ghost of the British Empire." 
@tek Thanks for the edit. Still an interesting read and topic.
@tek shit, it actually does make sense. I used to talk to Kenyan colleagues quite a lot around 2018-2019, and somehow completely forgot that actually yes, Chatgpt indeed sounds like them, in a way.