Mastodawn

Excited to announce that I will be at #fediforum today speed demo-ing my latest project: an ActivityPub data observatory!

This observatory does not collect any user data or metadata. Instead I am looking at the *shape* (aka schema) of data being sent around the fediverse. This will let software devs ask questions like "How is a Mastodon 4.2.0 image post formatted differently from a Misskey 2024.7.0 image post?"

And we'll get real answers based on data rather than on poor documentation.

Darius Kazemi Sep 14, 2024

I won't be actually LAUNCHING this tool until I've found out how you all would feel about it being opt-out vs opt-in. I will provide a longer blog post for you all to read with details, but in short:

It would be really helpful for general interop on the fedi if this were opt-out. But if people are generally freaked out by having technical details about software data formats being opt-out... I'll make it opt-in.

Quick explanation of the data scrubbing in the attached images

coreyja Sep 14, 2024

@darius not a fediverse dev (yet I guess lol)

But I’d definitely vote for this to be opt out, cause it sounds really useful for if I ever did want to do fediverse dev, which I’ve definitely considered!

blaine Sep 14, 2024

@darius nice! The only folks who *I* could imagine insisting on this being opt-in are Oracle's legal team, and they were told in no uncertain terms that this sort of data isn't even *eligible* for opt-out, even in the US of A. 😅

Darius Kazemi Sep 14, 2024

@blaine every morning I ask myself: "Am I going to do something today that Oracle's legal team won't like?" and if the answer is no I have already failed

Jamie Booth Sep 15, 2024

@darius
@blaine

I'm reasonably certain *thinking* the name "Oracle" probably qualifies as something their legal team objects to. 😁

(At least without a PO attached)

William Pietri Sep 14, 2024

@darius I'd say it's fine since it's not collecting user data. However, given how much jerks have caused sensitivity here I'd suggest an explanation page that uses some of your own posts as examples, with detailed explanations. And for usability/accessibility reasons, it should be in text, and with much higher contrast. Machine representations look forbidding to non-technical people anyhow, but especially so when dark and hard to read.

Darius Kazemi Sep 14, 2024

@williampietri Yes, sorry, this is something I whipped up in a few minutes for a microblog post and is not going to be what my macroblog post looks like

Emelia 👸🏻Sep 14, 2024

@darius oooh, I see, <uri> isn't a placeholder for an actual value, it's just a indicator of the value type

Darius Kazemi Sep 14, 2024

@thisismissem yes I am just storing the inferred type! I use https://github.com/triggerdotdev/schema-infer

GitHub - triggerdotdev/schema-infer: Infers JSON Schemas and Type Definitions from example JSON

Infers JSON Schemas and Type Definitions from example JSON - triggerdotdev/schema-infer

GitHub

Emelia 👸🏻Sep 14, 2024

@darius I think capabilities comes from either pixelfed or gotosocial?

In #Flancia we'll meet Sep 14, 2024

@darius very nice, thanks for checking but to me it's super clear this is fine to scrape by default/be opt-out.

Erin 💽✨Sep 14, 2024

@darius it would probably be very useful to also run the data through a JSON-LD processor and flag e.g. URIs being serialised as strings, undefined properties, etc

Darius Kazemi Sep 14, 2024

@erincandescent yeah! any pointers to things that can help me infer more stuff would be great.

Emelia 👸🏻Sep 14, 2024

@darius why there's the new security context in mastodon: https://github.com/mastodon/mastodon/pull/31871

I think that's a backport candidate, as the context was used but not actually present in the @context's object

Fix security context sometimes not being added in LD-Signed activities by ClearlyClaire · Pull Request #31871 · mastodon/mastodon

Contexts included in activities are so far decided at serialization-time, but Linked-Data signature occurs after serialization time. In some cases, signed activities were missing the https://w3id.o...

GitHub

Emelia 👸🏻Sep 14, 2024

@darius given it's just collecting the unique shapes of data, I think it's perfectly fine to be opt-out, since there's no user data at all. (assuming you never store the raw data anywhere)

Darius Kazemi Sep 14, 2024

@thisismissem correct, I just scrub the data and throw it away

Emelia 👸🏻Sep 14, 2024

@darius yeah, I think having a clear privacy policy that shows clearly that you're not storing any personal information and working only through relays should be fine to be opt-out

paramilitary organizer Sep 14, 2024

@darius Hm. Am I reading it right you would be logging that person x made a post with URL y on date z?
that might interfere with some people's want to not have their posts seen off fedi; that info could be used against someone even if they delete it later. "why're you posting while on the clock" fer a basic example.
the "in reply to" field as well might expose the shape of who you talk to in a concerning way

edit: it's clear that I don't get it but will Try again after coffee

Darius Kazemi Sep 14, 2024

@t54r4n1 no, I am logging that "some person somewhere but I don't who or where because I threw away that data, made a post with a "URL" field that contains some kind of URL in it but I don't know what because I threw away that data"

I'm not even logging the time something was posted! Just "there is a time field in this and it contains a time but I don't know what time"

Darius Kazemi Sep 14, 2024

@t54r4n1 like the second screenshot is the literal data I am recording, so like I am recording the word "<date-time>" instead of an actual date and time

☭🇧🇷🇺🇳🇵🇸Sep 14, 2024

@darius I imagine that it will #FreeSoftware and #OpenData so IMHO I can't see why this could not be opt-out.

Jeremy Bornstein Sep 19, 2024

This looks very cool, thank you!

As an implementor, one of the additional things I'm curious about are the commonalities (or lack thereof) among the structure of various URIs. Would you be open to, for example, analyzing common prefixes in a single activity, to notice for example that the actor ID is or is not present as a portion of (say) the followers collection?

Tim Chambers Sep 14, 2024

@darius Very cool, Darius!

@darius Oh this is too cool!

Evan Prodromou Sep 14, 2024

@darius can you compare to browser.pub?

Darius Kazemi Sep 14, 2024

@evan yeah. on browserpub I can say "hey help me take a look at these particular messages I know about". This observatory will surface information about stuff floating around the fedi that I don't even know about. For example I am already learning about server software I've never even heard of, and I would not have put that into browser.pub because I wouldn't have known it existed

Evan Prodromou Sep 14, 2024

@darius Ah, OK, interesting. Where does your network tap plug in?

Darius Kazemi Sep 14, 2024

@evan still figuring it out. Right now I am subscribing to a public relay as that is the most software-neutral source I could think of, but I am looking at other ingestion methods too. Importantly I want to ingest AP only... I'm not going to hit proprietary API endpoints like most scrapers do

Evan Prodromou Sep 14, 2024

@darius barf, no

Tim Chambers Sep 14, 2024

@darius Are you tracking #goblin? https://indieweb.social/@goblin@goblin.band

@goblin

this hellsite, itself.

goblin.band

Darius Kazemi Sep 14, 2024

@tchambers never heard of it!

Tim Chambers Sep 14, 2024

@darius Brand new Tumblr like service being built by @javi

Jenniferplusplus Sep 14, 2024

@darius how do you collect it? Do you just follow a bunch of actors on different software?

Darius Kazemi Sep 14, 2024

@jenniferplusplus right now I am subscribing to a public relay. I will probably turn this thing INTO a relay so that if server admins want to donate their data they can consider this a relay that just... doesn't send their data anywhere and instead records its shape

Jenniferplusplus Sep 14, 2024

@darius that seems like a kind of limited sample. It limits you to public messages that even can be sent to a relay. You'll never know what non-public messages look like, or messages from software that doesn't support relays.

Darius Kazemi Sep 14, 2024

@jenniferplusplus yes. I have to limit my sample if I'm not going to be a surveillance entity. It's a trade off I'm willing to accept

django Sep 14, 2024

@darius love this, it might even more useful than a test suite!

Darius Kazemi Sep 14, 2024

@django I think this will be really helpful for people writing tests for a test suite! Like it's one thing to write a test suite that tests conformance with a standard, it's another thing to write a test suite that tests conformance with actual software out in the wild

☭🇧🇷🇺🇳🇵🇸Sep 14, 2024

@darius @mar amazing project!

groxx Sep 14, 2024

@darius sounds great and useful, thank you!

Marco Rogers Sep 14, 2024

@darius this looks cool.