Had a little scraping fun with Austria's yellow press. German explainer, sadly, no English captions.

https://www.youtube.com/watch?v=4rvPK6jB3l4

Let me explain. In Austria, governments feed the media with government funding and ads. No matter the party.

A few years ago, they invented yet another money funnel: digital transformation funding. Young online media explicitely excluded, aimed at establisehd media houses only.

They submit project proposals, a politically appointed jury waves them through. 🧵

The above two screenshots are the funded projects for Austria's two biggest free yellow press media houses. They (OE24 and Heute) are run of the mill yellow press outlets: basically the antithesis to journalism.

As you can see, both handed in "AI journalism" projects in 2024. My theory: we should see an increase in LLM-generated text markers for the years 2021 (no GPT), 2022 (GPT release), 2023 (initial evaluations), 2024 (experimental adoption) and 2025 (rollout as part of journalistic pipe)

What markers could we use? Well, those aren't sophisticated media houses, so let's take the silliest of all: em-dashes (or rather en-dashes, cause German).

You see, LLMs LOVE emitting dashes. We could use more sophisticated markers, like stylometric features. But that's work, and if we can find a signal just based on dashes, that's enough for an initial evaluation.

For statistical significance, we need a LOT of data, namely, thousands of articles for each year. Thankfully, media websites generally have useful sitemaps, so Google & Co. can easily index their content.

Those sitemaps contains links to every individual article ever published on those sites. The content is static HTML, so we can easily parse the article date, category, title, and content.

Based on this, I scraped all articles for the month of October for the years 2021, 2022, 2023, 2024, and 2025.

You can get the data here:
Heute
https://sink.mariozechner.at/api/share/be17c72de69d81b7d7c3c4664b30717b/file/heute-2021-2025.zip

OE24
https://sink.mariozechner.at/api/share/bb5c7a15e2dbac8c1c6d79e0dea46a58/file/oe24-2021-2025.zip

Then, the only thing that's left is counting em/en-dashes in all the articles, and visualize the results with fancy charts. Let's discuss the OE24 results first. The Heute results are analogous.

The first chart shows what percentage of articles in October of each year had no (green), 1-3 (yellow), 4-7 (orange), 8-12 (red), or more than 12 en/em-dashes.

The proportions are quite stable from 2021-2024. In 2025, you can see an enormous increase in the categories 1-3, 4-7, 8-12, and +12.

You can find a "live" version of all the charts and tables for OE24 here:

https://sink.mariozechner.at/api/share/b6bab98a893f4ab068bffbe08a89658f/file/oe24-dash-timeline.html

oe24 Dash Usage Timeline

We can make it even more interesting by looking at it in detail for individual article categories. OE24 divides its articles into categories like Madonna (horoscopes), tierschutz (animal welfare), leute (people), welt (world), etc.

Here the average number of dashes per article is shown in blue. Notice anything?

The second chart in each category shows the same as the large chart above: the percentage of articles with 0, 1-3, 4-7, 8-12, or +12 dashes.

In both charts in each category, the difference between 2021-2024 and 2025 is significant.

Categories like Shopping, Animal Welfare, People seem to now be largely written with the help of LLMs. Take a look at the site above and see for yourself. Below the charts there are raw numbers for the statistics connoisseurs among you.

What does this mean now? Does OE24 use LLMs? Yes, we already know that from @plocaploca.bsky.social. The data here just underpin it numerically.

What we can additionally derive: the whole thing began in the year 2025, so after the funded AI projects.

And here are the same charts for the other medium, Heute. Same exact pattern.

You can find the live version here:
https://sink.mariozechner.at/api/share/3ed2ca43bcab525c6424fd678768f843/file/heute-dash-timeline.html

That means we all helped OE24 and HEUTE with our taxes to increasingly feed their readership with LLM slop. Go us!

A bunch of Austrian journalists indicated to me, that this is a much more widespread problem. Guess I'll have to do more scraping (possibly with more sophisticated markers).

And if you find this kind of work educational, useful, or just entertaining, and if you have disposable income, consider supporting our zero-overhad charity. More details here:

https://mastodon.gamedev.place/@badlogic/113931946046141128

Mario Zechner (@[email protected])

If you find this useful, please consider donating to our 🇺🇦 charity. We have zero overhead; every cent of your donation goes towards buying €50 food vouchers, which we send to Ukrainian families who have fled to Austria. https://cards-for-ukraine.at We are also 100% transparent. Every order, invoice, payment receipt, etc., can be found here: https://drive.google.com/drive/folders/1PxOL8A44bIRU1Hdoq87_2iXSLNmnMXQr?usp=drive_link You can read about the charity's history here: https://mariozechner.at/posts/2024-07-15-two-years-in-review/#toc_0

Gamedev Mastodon
This thread is dedicated in part to @pluralistic :)

Special shout out to my dear computational linguist friend Jenia (PhD) over on Bluesky. She validated my findings with more sophisticated analysis, using stylometric features on the same data.

You can find her thread with her initial findings here:

https://bsky.app/profile/schenior.bsky.social/post/3m4ma4h66rs2q

Science!

Jenia (Женя) (@schenior.bsky.social)

Okok, using this chance to try out ✨stylometrics w AI✨ - tonight just seeing how far we can push this with str_count-ing basic features. I looked at: - en-dash - em-dash - mean sentence length - max sentence length - comma - quotes <"> - Fancy Quotes <„> Per category which exists for all 5 years [contains quote post or other embedded content]

Bluesky Social

Finally, full disclosure: I am in now way, shape, or form oposed to the use of LLMs in general. I use them a lot myself, for tasks I know they can acomplish. And I always keep myself as the human in the loop. In fact, the code to render the charts above was generated by an LLM, and then manually reviewed and valdiated by myself, to save time.

What I am oposed to is the mindless use of LLMs to generate slop, or worse, supposed journalistic content. Especially on tax payer money.

@badlogic welcome to my blocklist, “AI” apologist

@badlogic nein-doch-oh.gif

Danke für die penieble Recherche!