I put together my own system for tracking the total number of Mastodon users over time, as reported for the instances tracked by https://instances.social/

It's a delightful (to me) combination of different tricks - git scraping, my git-history and s3-credentials tools, Datasette Lite and an Observable notebook to plot the chart at the end.

I describe how it all works in detail here: https://simonwillison.net/2022/Nov/20/tracking-mastodon/

Or you can jump straight in to play with my notebook: https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time

Mastodon instances

This is a good example of me forcing myself to live the "If you do a project, you should write about it" rule - I was SO tempted to get this thing working and then go to bed, but I made myself do the extra 45 minutes of work to turn it into a blog post.

https://simonwillison.net/2022/Nov/6/what-to-blog-about/

What to blog about

You should start a blog. Having your own little corner of the internet is good for the soul! But what should you write about? It’s easy to get hung up …

Added a disclaimer to the notebook at https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time - just in case people start taking those numbers as the gospel truth as to the human population of Mastodon
Mastodon users and statuses over time

Gathered by scraping the JSON from https://instances.social/ every 20 minutes using this repository: https://github.com/simonw/scrape-instances-social For full details about how this works, see Tracking Mastodon user numbers over time with a bucket of tricks on my blog. How much should you trust these numbers? The user number here is calculated by adding up the number of registered users reported for every server in the https://instances.social/instances.json file published by https://instances.social/ This

Observable

Had a couple of complaints that my chart is misleading because it doesn't start the x axis from zero

I don't see that myself - I was VERY careful to make the x axis values as prominent as possible to avoid any potential for confusion there

But since people asked, I've added a from-zero chart to the notebook too

https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time

Unsurprisingly it's not very interesting - it's effectively a horizontal line at ~4.7m!

Mastodon users and statuses over time

Gathered by scraping the JSON from https://instances.social/ every 20 minutes using this repository: https://github.com/simonw/scrape-instances-social For full details about how this works, see Tracking Mastodon user numbers over time with a bucket of tricks on my blog. How much should you trust these numbers? The user number here is calculated by adding up the number of registered users reported for every server in the https://instances.social/instances.json file published by https://instances.social/ This

Observable
If you have ideas for better ways to present the data I've collected (I'm certain there's huge room for improvement here) you can fork my notebook on Observable and try them out!

After struggling for a while to figure out the best way to add a "new users per hour" chart I spotted there was an incoming change suggestion implementing exactly that... from Observable/D3 author Mike Bostock!

So that's now available in the notebook too: https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time

Mastodon users and statuses over time

Gathered by scraping the JSON from https://instances.social/ every 20 minutes using this repository: https://github.com/simonw/scrape-instances-social For full details about how this works, see Tracking Mastodon user numbers over time with a bucket of tricks on my blog. How much should you trust these numbers? The user number here is calculated by adding up the number of registered users reported for every server in the https://instances.social/instances.json file published by https://instances.social/ This

Observable

Some spam instances just showed up with fake user numbers that completely broke my charts (leaping the number of users from ~4.5m to 80m+)

Issue about that here - I'll fix my pipeline to avoid these in the morning https://github.com/simonw/scrape-instances-social/issues/4

Ignore angelfire glitch instances · Issue #4 · simonw/scrape-instances-social

https://lite.datasette.io/?json=https%3A%2F%2Fraw.githubusercontent.com%2Fsimonw%2Fscrape-instances-social%2Fmain%2Finstances.json#/data/instances?_filter_column=&_filter_op=exact&_filter_v...

GitHub

One of the neat things about Git scraping is that everything on GitHub is served with open CORS headers, which means JS apps can load that data even if the original source didn't enable CORS

So here's a Datasette Lite link for exploring the instances.json data from https://instances.social/ as an interactive table!

https://lite.datasette.io/?json=https%3A%2F%2Fraw.githubusercontent.com%2Fsimonw%2Fscrape-instances-social%2Fmain%2Finstances.json#/data/instances?_sort=users&_sort_by_desc=on

Mastodon instances

@simon oh shit I am really grokking how you've made datasette even more web than it already was

@anildash Datasette Lite really was mostly meant to be an elaborate joke - running a server-side web app entirely in the browser - but it's fast becoming one of my favourite pieces of the whole ecosystem

Turns out a 12MB loading weight in 2022 (to load in a full copy of Python compiled to WebAssembly) isn't nearly as prohibitive as I had expected!

@simon

Then limiting/blocking instances is not just for proper moderation, but for proper statistics..

@simon the author has been tweaking what instances to count for a few hours https://github.com/TheKinrar/instances/commits/master . Some obvious things I've had to filter out are duplicates, after normalizing names, and instances that deliberately publish wrong numbers. Many other cases aren't that clear and I still got a sudden jump of about a million users 🤷‍♂️.

Let's hope for things to settle down soon.

GitHub - TheKinrar/instances: Mastodon instances list

Mastodon instances list. Contribute to TheKinrar/instances development by creating an account on GitHub.

GitHub

@simon ... the data now looks a LOT better. It actually was a long overdue fix for autodiscovery, and now you can find instances like yours or sigmoid.social there, among other 10,000 that were previously missing.

https://mastodon.xyz/@TheKinrar/109381846167480060

TheKinrar (@[email protected])

I pushed a few fixes and improvements to instances.social this night and it is now tracking about six times more instances than it was before (2200 => 12800). Autodiscovery of instances had been broken for some time now, and obviously with all the new users from the last weeks, came many new instances. See the full list on https://instances.social/list/advanced and https://instances.social/list/old (the latter being the "legacy" list, a plain html table, which is quite... heavy for browsers).

Mastodon
@mauforonda @simon I'm donwloading the new instances.social json file and I don't see the instances publishing (obviously) wrong numbers. Do you have any idea what changed?
@estebanmoro @mauforonda @simon I found this (see image) when I downloaded the data a few days ago:
Ignore angelfire glitch instances · Issue #4 · simonw/scrape-instances-social

https://lite.datasette.io/?json=https%3A%2F%2Fraw.githubusercontent.com%2Fsimonw%2Fscrape-instances-social%2Fmain%2Finstances.json#/data/instances?_filter_column=&_filter_op=exact&_filter_v...

GitHub

@estebanmoro @simon

These were the really problematic ones. They would amount to 60 million new users, but luckily were filtered out about 6 hours ago.

@simon I never thought about the existance of spam instances before (naive, I know 😅 ). Is there such a thing as a community sourced list of spam instances that can be used to blacklist spammers?
@michael there are strong community practices for sharing instances that should be blocked for abusive behavior from what I've seen - instances that fake their user numbers and break scripts that try to measure the size of the ecosystem might be a new (and low priority) issue though

@simon

In case you haven't seen it.

some interesting (and more complex) graphs (from the same source) are available via @mastodonusercount

source code here https://github.com/gallizoltan/usercount

and you'll be happy to know it's in python ;)

GitHub - gallizoltan/usercount: User statistics bot for Mastodon

User statistics bot for Mastodon. Contribute to gallizoltan/usercount development by creating an account on GitHub.

GitHub
@simon when you are trying to measure the change from a known event (or indexing), plotting the change by percentage from that start date can be helpful. You’d just have to pick the date. This is especially good for comparing several values who might have different ranges (stock prices etc). Good explainer: https://www.dallasfed.org/research/basics/indexing.aspx
Federal Reserve Bank of Dallas

As part of the nation's central bank, the Dallas Fed plays an important role in monetary policy, bank supervision and regulation, and the operation of a nationwide payments system.

@jonkeegan oh that's fantastic advice, thanks! Need to figure out how to do that with Observable Plot
@jonkeegan OK, I had a clumsy go at doing that - I've added an "Experimental charts start here" section to my notebook https://observablehq.com/@simonw/mastodon-users-and-statuses-over-time#cell-230
Mastodon users and statuses over time

Gathered by scraping the JSON from https://instances.social/ every 20 minutes using this repository: https://github.com/simonw/scrape-instances-social For full details about how this works, see Tracking Mastodon user numbers over time with a bucket of tricks on my blog. How much should you trust these numbers? The user number here is calculated by adding up the number of registered users reported for every server in the https://instances.social/instances.json file published by https://instances.social/ This

Observable

@simon "misleading" charts are often from mismatch of what a chart is supposed to illustrate and what it's being used to illustrate. Often the latter is in someone's own head, and there's nothing you can do about it.

I wonder if, in this case, taking the first derivative wouldn't help; explicitly show the rate of climb rather than imply it by showing the raw count.

@simon Wait until you give them a logarithmic scale...
@simon would it then be better perhaps to use a logarithmic scale? That should then hopefully give you finer granularity of the interesting data points.
@simon @awsamuel Hey Simon, haven't used your stuff much, pardon if this is a dumb question, my impression is there's no way to look back in history?

@timbray @awsamuel the key point of truth is the commit history of the scraped JSON file itself: https://github.com/simonw/scrape-instances-social/commits/main/instances.json

Since digging through that commit log is a bit clumsy I use my "git-history" CLI tool to turn that history into a SQLite database you can query - that's what's generating the counts.db database you can play with here https://lite.datasette.io/?url=https://scrape-instances-social.s3.amazonaws.com/counts.db#/counts/item_version_detail

Actions · simonw/scrape-instances-social

https://instances.social/instances.json. Contribute to simonw/scrape-instances-social development by creating an account on GitHub.

GitHub
@timbray @awsamuel it's also possible to generate a much larger SQLite database with rows for every individual Mastodon instance and when it was changed - I'm not yet running that in GitHub Actions but you can build that by checking out the repo and running the build-instance-history.sh script

@timbray @awsamuel but I only started scraping data 24 hours ago so I don't have anything captured before then, sadly

@mastodonusercount uses the same data source as I do I believe and has been running for multiple years

@simon @timbray Thanks so much for the quick reply! I fully relate to the problem of capturing data from the past! and also I <3 your data magic. (I do a lot of data work myself but not THIS hardcore!!)

As someone who's been thinking about this data/growth carefully, can I ask if the numbers here pass the nod test for you? I'm wondering if we can trust this estimate of ~400k users pre-Elno. https://absolutelymaybe.plos.org/2022/11/13/mapping-the-mastodon-migration-is-it-a-one-way-trip-or-an-each-way-bet-for-science-twitter/

Mapping the Mastodon Migration: Is It a One-Way Trip or an Each-Way Bet for Science Twitter? - Absolutely Maybe

Like many people, I opened a Mastodon account on the weekend the Musk era began on Twitter. This post picks up from…

Absolutely Maybe
@simon huh. their db doesn’t have your instance or mine in it.
@jesse yeah I noticed that, one of the reasons I'm suspicious as to how trustworthy the overall numbers are

@simon I've seen a popn bot that is suggesting our popn is around 7M3

Will have to find time to have a play and see if I can find the disparity

@simon Probably should stop talking about Mastodon users and talk about Fediverse users or similar.
Does eg @[email protected] count as a Mastodon user? https://tantek.com/2022/301/t1/twittermigration-bridgyfed-mastodon-indieweb
Does other users of https://fed.brid.gy/ count as Mastodon users?
They both should and shouldn’t l. They are interoperable, but they do not run Mastodon and Mastodon itself is not a protocol.
Interesting number: Mastodon market share in Fediverse?
#TwitterMigration, first time? Have posted notes to https://tantek.com/ since 2010, POSSEd tweets & #AtomFeed. Added one .htaccess line today, and thanks to #BridgyFed, #Mastodon users can follow my #IndieWeb site @[email protected] No Mastodon install or account needed. Just one line in .htaccess: RewriteRule ^.well-known/(host-meta|webfinger).* https://fed.brid.gy/$0 [redirect=302,last] is enough for Mastodon users to search for and follow that @[email protected] username. Took a little more work to setup Bridgy Fed to push new posts to followers. Note by the way both the redundancy & awkwardness (it’s not a clickable URL) of such @-@ (AT-AT) usernames when you’re already using your own domain. Why can’t Mastodon follow a username of “@tantek.com”? Or just “tantek.com”? And either way expanding it internally if need be to the AT-AT syntax. Why this regression from what we had with classic feed readers where a domain was enough to discover & follow a feed? Also, why does following show a blank result? Contrast that with classic feed readers which immediately show you the most recent items in a feed you subscribed to. Lastly (for now), I asked around and no one knew of a simple public way to “preview” or “validate” that @[email protected] actually “worked”. You have to be *logged-in* to a Mastodon instance and search for a username to check to see if it works. Contrast that with https://validator.w3.org/feed/ which you can use without any log-in to validate your classic feed file. Why these regressions from the days of feed readers? - Tantek

@voxpelli @tantek.com I'd love to gather those numbers but I honestly have no idea how one could do that - Mastodon does at least provide an API that returns a count of users - https://fedi.simonwillison.net/api/v1/instance - but what does the concept of a "user" even mean for other ActivityPub implementations? Number of unique inboxes perhaps?

@[email protected] @simon Yeah, probably have to fall back to blogosphere type of measures

There were some work in in the IndieWeb on that I believe, eg: https://indieweb.org/indie-stats

Python script here: https://github.com/bear/indie-stats

And I guess @[email protected] and others from the Technorati era has an idea or two, likewise with @bradfitz who made the Google Social Graph API back in the day I believe

indie-stats - IndieWeb

@simon might be nice to put in a spinner loop to automatically update.

@simon

> Call it a TIL—that way you’re not promising anyone a revelation or an in-depth tutorial. You’re saying “I just figured this out: here are my notes, you may find them useful too”.

I recently resurrected my personal site/blog and have been roughly following this mindset and it's helped tremendously.

@simon

You are a champion for writing about it! I can barely write a tweet/toot let along a blog post which is why mine is gathering dust 😆

@simon I'm really terrible at 'turning it into a blog post'. Something I need to work better on.
@simon very powerful bag of tricks!

@simon I have a question! ✋

When you set up something like this, do you anticipate it running forever? And if so, do you set up something else to check if it breaks, or for you to remember to check in on it?

@mala @kellan I've finally made peace with the fact that no online project is forever

But... I do try and pick tooling that's either free or virtually free and that's likely to keep going for a long time - hence why I'm using GitHub Actions, a tiny S3 bucket and an Observable notebook for this

I expect it will keep running without modifications for a LONG time

Weakest link with this one is probably if instances.social breaks or changes that JSON feed

Interesting! I didn't know about the "instances.json", and I see that that might be useful info for new users to pick an instance.

So, a few tweaks to https://github.com/comet-ml/kangas and now, with no-code:

pip install "kangas>=1.2.3"
kangas server https://instances.social/instances.json

Hints: sort on https_score, and filter on users.

#python #opensource

@simon

GitHub - comet-ml/kangas: 🦘 Explore multimedia datasets at scale

🦘 Explore multimedia datasets at scale. Contribute to comet-ml/kangas development by creating an account on GitHub.

GitHub