Just gonna throw this out here - any Ruby gem for relatively quick & accurate language detection that you know of?

I didn't expect anyone to think I am asking for a General AI.

I just want semi-accurate heuristic to spare users having to tag language manually.

Proprietary APIs like Google Translate are out of the question. Already using WhatLanguage, it's not working well. If I can't find anything, welp, there's that. Not a big deal.

@Gargron Wait so Mastodon is guessing our languages by random? That creates all sorts of problem for us CJK users. (See Han Unification) We should at least have an option to manually pick the correct language for the text.
@hatsuki It's guessing, but there are no filters yet.
@Gargron Oops, now I see that none of the text has lang attribute. This is a bug, CJK languages will not display correctly without lang attributes, due to Unicode Han Unification. I know this will clutter UI but I strongly recommend looking into this before Mastodon gets critical mass among Chinese-speaking communities, that will be a very hard problem to solve...

@hatsuki @Gargron seems to me if it contains Unified Han and kana you call it "Japanese"; if it contains hangul with or without Unified Han you call it "Korean"; and if it contains Unified Han but no significant amount of those other things, you call it "Chinese." Done, and close enough. Telling Swedish from Norwegian is much harder.

As for needing a "lang" tag to choose fonts, nearly all the Web has that problem anyway.

@mattskala @Gargron It's broken everywhere, so don't fix it. Sounds strange enough, but I am not the one doing the work, so I have nothing to complain.

On detecting languages though, this assumes single language posts. At least it should be clear to the user which lang tag will be added when tooting.

@hatsuki @Gargron Well, Twitter seems to survive without good language detection. It *is* a little weird that "Haitian Creole" seems to be their default for unrecognizable stuff.
@mattskala @Gargron Unicode is a sad story.
@hatsuki @gargron @mattskala Han Unification + lack of language tags in Unicode is a reason why I'm aim to support ShiftJIS as well
@Gargron if it's not, it's something I've been meaning to work on for some time. Wrote a python port ages ago, should modernise.
@Gargron only thing that comes to mind is Python's Natural Language Toolkit. There's probably something similar for Ruby.
@Gargron https://github.com/jmhodges/rchardet is kind of old but it is not so old that the algorithm should be broken. Uses the same heuristics as web browsers: perhaps worth a shot.
@Gargron omg nevermind, I'm a dingus: encoding != language, nothing to see here
@Gargron there's an opportunity here to get the Mastodon users to train something custom that can be open source.
@Gargron you can train a system, the problem is there are too many languages
@Gargron if they ate tagging already you can use these data to train that system
@Gargron you don't need something like ai. https://github.com/BYVoid/uchardet There should gems too
@Gargron Have you looked at the implementation of WhatLanguage? If this #Python library looks like it is sufficiently different/better, we could port it to #Ruby: https://github.com/Mimino666/langdetect

@Gargron I tried out this library on a couple Japanese toots from the federated timeline and it detected they were Japanese every time.

The algorithm used looks pretty solid as well. Would you use a #Ruby port?

@Gargron you do not need something like ai, something like statistic can do it. I forget the one I used, but something similar https://github.com/peterc/whatlanguage
@Gargron cld2 is accurate given enough text, which for a toot will sometimes work and sometimes not

@Gargron #NLProc‌er here, Have you taken a look at https://github.com/diasks2/ruby-nlp?

Most NLP stuff isn't in Ruby (Python, Java, etc. are more prevalent).

@Gargron Look at N-gram based language detection libraries. There are many available, eg. https://github.com/optimaize/language-detector
@Gargron it seems like something a neural network might be good for! you'd probably need to train it for a long time though
@Gargron I'm actually planning on running a project (self-funded) to attempt to detect dominant languages on public Mastodon instances, as part of Radar.

@wogan @Gargron I have noticed that a few people claiming they are learning esperanto – but I do not see anyone writing here in it.

Social media actually does reflect the real world haha

@Gargron "A semi-accurate heuristic to spare users having to tag language manually"... Have you heard about Pierre Lévy's #IEML? Here's an aborted project envisioned at the time of Identica (2009)
http://perspective-numerique.net/wakka.php?wiki=PropositionCodageIEML (Sorry French)
@Gargron you could just do a regex search for "ananas"?
@xor it should support Japanese and the various variants of Chinese.
@Gargron @xor whatlanguage?
@mattn @xor That's what v1.2 is using, and when it encounters Japanese text, it pretty much reports a random language each time.
@Gargron @xor because some unicode code point is shared between japanese/chinese.
@mattn @xor uh no, it doesn't support either of those, so it just reports it as french, german, italian, polish, etc randomly
@Gargron ...I think that, perhaps, that might not be a good thing to roll into the core functionality? Perhaps instead build an API for modules other people can build for functionality like that if they so choose?
@munin @Gargron and perhaps a client setting for which language the user is using?
@Gargron You can't seriously expect a machine to do that accurately, can you?
@impiaaa accuracy is not required, only a close guess.
@impiaaa alternative is marking the language as whatever the user chose their UI to be in, or worse even, adding yet another control where you pick the language of the toot (bleh, I don't want even more controls!)
@Gargron What are you trying to achieve? Tag the toots with the correct lang attribute?
@andrewnez @Gargron as mentioned already, that doesn't work on Japanese: https://mastodon.social/@Gargron/3134956
@andrewnez @Gargron
> It works well on texts of over 10 words in length (e.g. blog posts or comments) and very poorly on short or Twitter-esque text, so be aware.
@Gargron There is probably some binding to the wordnet database; it's quite popular in the field