Mastodawn

A #Fediverse tech idea I've been considering for a while.

Hashtags can sometimes be annoying, right? Their primary application is finding posts related to a certain topic, including following one. Therefore, if I look for "horses", I want to find everything horse-related.

The following hashtags should therefore lead to the same set of posts:
- #horse
- #horses
- #equines
- #equine
- #horsetodon

Right? But that's not a reality. Some people tag with one of them, with multiple, or with none at all.

Similarly, #Döner, #Doener and perhaps even #Doner should lead to the same set of posts, right? And what about British-American-splits like how #Localization is #Localisation, and perhaps even #l10n as well? And #LGBT, #LGBTQ, #LBGTQ+, #LGBT+, #LGBTQIA and so on and so on also really should be one hashtag.

So what to do?

I propose:
We should introduce a second symbol for something like "fuzzy hashtags". Not the '#' symbol, but another; perhaps '&' or '~'.

That way I could tag my post '~horse', and it could appear on all hashtag feeds concerning the different spellings of horses! Or I could search for such a fuzzy hashtag.

The different variants could be crowdsourced, or overridden by instance mods. You could also opt-out of that system - your posts showing up in fuzzy searches, most likely.

This would also fix languages like German, where you have many inflected forms: #Männer #Mann #Mannes #Manns #Männern #Männers and so on should all lead to the same result.

Behold what I have to do to sign off on this post:
#MarkupLanguages #MarkupLanguage #Markup

Show thread

Essojadojef Sep 24

@lianna That's smart. I like the idea of crowdsourced hashtag synonyms. They could be global, per instance and per user (with each level adding to or removing from the previous) and with the option to turn them off altogether for the user.

I don't think a second symbol is useful tho. When clicking on hashtags one may see two tabs, one that includes synonyms and one with exact matches only. Or when using the search box one may use quotes like in search engines for exact matches of the hastags.

Show thread

🎸 POSS 🏁Sep 24

@lianna maybe a text classification library like fastText could do this automatically.

fastText processes around 80 million words/second on my dual cpu cascade lake setup (xeon silver 4214)

Show thread

🎸 POSS 🏁Sep 24

@lianna the "dumb" way is to create a character bigram vector for every combination of keywords but this only helps if they are written similarly enough since you'll be left with a character string similarity value between 0 and 100%.
Put the keyword + the score into a map where keyword is the key and the similarity is the value.
You can then cluster the bigrams like this by getting the dot product of the vectors:
dot += simA * simB
result: dot / (Math.sqrt(simA * simA) * Math.sqrt(simB * simB)

This way you can build a map of similar keywords to the keyword at hand.

This is incredibly inefficient compared to just using an indexed, dictionary based model

Show thread

🎸 POSS 🏁Sep 24

@lianna This might look something like this:
git.161.sh/Crimetoys/KeywordClusterDemo/src/commit/7b7e520a5dc7c90b629c6922a685e33c433fa5db/KeywordClusterDemo.java

KeywordClusterDemo/KeywordClusterDemo.java at 7b7e520a5dc7c90b629c6922a685e33c433fa5db

KeywordClusterDemo

Fire Systems Git Version Control

Show thread

lianna Sep 24

@Crimekillz@⚧.fm wouldn't that just match "word" with "wood" and "ford" too?

Show thread

pawr Sep 24

@lianna thanx for pointing out this problem, it's been worrying me as well 👍

Show thread

Marc Moskowitz Sep 30

@lianna The fanfiction archive AO3 has a much more complex (and human-intensive) system of canonical and synonymous tags to solve this problem. But AO3 is very centralized and has different needs.