Just gonna throw this out here - any Ruby gem for relatively quick & accurate language detection that you know of?

I didn't expect anyone to think I am asking for a General AI.

I just want semi-accurate heuristic to spare users having to tag language manually.

Proprietary APIs like Google Translate are out of the question. Already using WhatLanguage, it's not working well. If I can't find anything, welp, there's that. Not a big deal.

@Gargron Wait so Mastodon is guessing our languages by random? That creates all sorts of problem for us CJK users. (See Han Unification) We should at least have an option to manually pick the correct language for the text.
@hatsuki It's guessing, but there are no filters yet.
@Gargron Oops, now I see that none of the text has lang attribute. This is a bug, CJK languages will not display correctly without lang attributes, due to Unicode Han Unification. I know this will clutter UI but I strongly recommend looking into this before Mastodon gets critical mass among Chinese-speaking communities, that will be a very hard problem to solve...

@hatsuki @Gargron seems to me if it contains Unified Han and kana you call it "Japanese"; if it contains hangul with or without Unified Han you call it "Korean"; and if it contains Unified Han but no significant amount of those other things, you call it "Chinese." Done, and close enough. Telling Swedish from Norwegian is much harder.

As for needing a "lang" tag to choose fonts, nearly all the Web has that problem anyway.

@mattskala @Gargron It's broken everywhere, so don't fix it. Sounds strange enough, but I am not the one doing the work, so I have nothing to complain.

On detecting languages though, this assumes single language posts. At least it should be clear to the user which lang tag will be added when tooting.

@hatsuki @Gargron Well, Twitter seems to survive without good language detection. It *is* a little weird that "Haitian Creole" seems to be their default for unrecognizable stuff.
@mattskala @Gargron Unicode is a sad story.
@hatsuki @gargron @mattskala Han Unification + lack of language tags in Unicode is a reason why I'm aim to support ShiftJIS as well