Mastodawn

Just gonna throw this out here - any Ruby gem for relatively quick & accurate language detection that you know of?

Eugen Rochko Apr 18, 2017

I didn't expect anyone to think I am asking for a General AI.

I just want semi-accurate heuristic to spare users having to tag language manually.

Proprietary APIs like Google Translate are out of the question. Already using WhatLanguage, it's not working well. If I can't find anything, welp, there's that. Not a big deal.

Show thread

葉月 Apr 18, 2017

@Gargron Wait so Mastodon is guessing our languages by random? That creates all sorts of problem for us CJK users. (See Han Unification) We should at least have an option to manually pick the correct language for the text.

Show thread

Eugen Rochko Apr 18, 2017

@hatsuki It's guessing, but there are no filters yet.

Show thread

葉月 Apr 18, 2017

@Gargron Oops, now I see that none of the text has lang attribute. This is a bug, CJK languages will not display correctly without lang attributes, due to Unicode Han Unification. I know this will clutter UI but I strongly recommend looking into this before Mastodon gets critical mass among Chinese-speaking communities, that will be a very hard problem to solve...

Show thread

Matthew Skala Apr 18, 2017

@hatsuki @Gargron seems to me if it contains Unified Han and kana you call it "Japanese"; if it contains hangul with or without Unified Han you call it "Korean"; and if it contains Unified Han but no significant amount of those other things, you call it "Chinese." Done, and close enough. Telling Swedish from Norwegian is much harder.

As for needing a "lang" tag to choose fonts, nearly all the Web has that problem anyway.

Show thread

葉月 Apr 18, 2017

@mattskala @Gargron It's broken everywhere, so don't fix it. Sounds strange enough, but I am not the one doing the work, so I have nothing to complain.

On detecting languages though, this assumes single language posts. At least it should be clear to the user which lang tag will be added when tooting.

Show thread

Matthew Skala Apr 18, 2017

@hatsuki @Gargron Well, Twitter seems to survive without good language detection. It *is* a little weird that "Haitian Creole" seems to be their default for unrecognizable stuff.

Show thread

葉月 Apr 18, 2017

@mattskala @Gargron Unicode is a sad story.

Show thread

Jookia Apr 18, 2017

@hatsuki @gargron @mattskala Han Unification + lack of language tags in Unicode is a reason why I'm aim to support ShiftJIS as well

Show thread

Brendan Molloy Apr 18, 2017

@Gargron if it's not, it's something I've been meaning to work on for some time. Wrote a python port ages ago, should modernise.

Show thread

Francesca Apr 18, 2017

@Gargron only thing that comes to mind is Python's Natural Language Toolkit. There's probably something similar for Ruby.

Show thread

Orión Montoya Apr 18, 2017

@Gargron https://github.com/jmhodges/rchardet is kind of old but it is not so old that the algorithm should be broken. Uses the same heuristics as web browsers: perhaps worth a shot.

Show thread

Orión Montoya Apr 18, 2017

@Gargron omg nevermind, I'm a dingus: encoding != language, nothing to see here

Show thread

Rick Minerich 🔥Apr 18, 2017

@Gargron there's an opportunity here to get the Mastodon users to train something custom that can be open source.

Show thread

NetzSoOc Apr 18, 2017

@Gargron you can train a system, the problem is there are too many languages

Show thread

NetzSoOc Apr 18, 2017

@Gargron if they ate tagging already you can use these data to train that system

Show thread

Francis Chong Apr 18, 2017

@Gargron you don't need something like ai. https://github.com/BYVoid/uchardet There should gems too

Show thread

CS Apr 18, 2017

@Gargron Have you looked at the implementation of WhatLanguage? If this #Python library looks like it is sufficiently different/better, we could port it to #Ruby: https://github.com/Mimino666/langdetect

Show thread

CS Apr 18, 2017

@Gargron I tried out this library on a couple Japanese toots from the federated timeline and it detected they were Japanese every time.

The algorithm used looks pretty solid as well. Would you use a #Ruby port?

Show thread

Francis Chong Apr 18, 2017

@Gargron you do not need something like ai, something like statistic can do it. I forget the one I used, but something similar https://github.com/peterc/whatlanguage

Show thread

Francis Chong Apr 18, 2017

@Gargron I think this is it https://github.com/igrigorik/language_detector

Show thread

CS Apr 18, 2017

@siuying @Gargron This looks quite good. 👍

Show thread

Nicki H ☑️Apr 18, 2017

@Gargron Thank you for trying!

Show thread

paulcbetts Apr 18, 2017

@Gargron cld2 is accurate given enough text, which for a toot will sometimes work and sometimes not

Show thread

L3viathan Apr 18, 2017

@Gargron #NLProc‌er here, Have you taken a look at https://github.com/diasks2/ruby-nlp?

Most NLP stuff isn't in Ruby (Python, Java, etc. are more prevalent).

Show thread

Fabrice Desré Apr 18, 2017

@Gargron Look at N-gram based language detection libraries. There are many available, eg. https://github.com/optimaize/language-detector

Show thread

arrufat Apr 18, 2017

@Gargron might this be of your interest? https://github.com/greyblake/whatlang-rs

Show thread

kasran, fourier transfemme Apr 18, 2017

@Gargron it seems like something a neural network might be good for! you'd probably need to train it for a long time though

Show thread

Spacedragon Apr 19, 2017

@Gargron I'm actually planning on running a project (self-funded) to attempt to detect dominant languages on public Mastodon instances, as part of Radar.

Show thread

jd Ⓐ★😼🚀🌍🇪🇺🇭🇺Apr 19, 2017

@wogan @Gargron I have noticed that a few people claiming they are learning esperanto – but I do not see anyone writing here in it.

Social media actually does reflect the real world haha

Show thread

Olivier Auber Apr 20, 2017

@Gargron "A semi-accurate heuristic to spare users having to tag language manually"... Have you heard about Pierre Lévy's #IEML? Here's an aborted project envisioned at the time of Identica (2009)
http://perspective-numerique.net/wakka.php?wiki=PropositionCodageIEML (Sorry French)

Show thread

Parker Higgins Apr 18, 2017

@Gargron you could just do a regex search for "ananas"?

Show thread

Eugen Rochko Apr 18, 2017

@xor it should support Japanese and the various variants of Chinese.

Show thread

mattn✅Apr 18, 2017

@Gargron @xor whatlanguage?

Show thread

Eugen Rochko Apr 18, 2017

@mattn @xor That's what v1.2 is using, and when it encounters Japanese text, it pretty much reports a random language each time.

Show thread

mattn✅Apr 18, 2017

@Gargron @xor because some unicode code point is shared between japanese/chinese.

Show thread

Eugen Rochko Apr 18, 2017

@mattn @xor uh no, it doesn't support either of those, so it just reports it as french, german, italian, polish, etc randomly

Show thread

mattn✅Apr 18, 2017

@Gargron @xor like this? https://github.com/peterc/whatlanguage/issues/1

Show thread

mattn✅Apr 18, 2017

@Gargron @xor Ah, it doesn't support japanese yet. :-(

Show thread

Munin, Keeper of Lore Apr 18, 2017

@Gargron ...I think that, perhaps, that might not be a good thing to roll into the core functionality? Perhaps instead build an API for modules other people can build for functionality like that if they so choose?

Show thread

ashley Apr 18, 2017

@munin @Gargron and perhaps a client setting for which language the user is using?

Show thread

Spencer Alves [backup]Apr 18, 2017

@Gargron You can't seriously expect a machine to do that accurately, can you?

Show thread

Eugen Rochko Apr 18, 2017

@impiaaa accuracy is not required, only a close guess.

Show thread

Eugen Rochko Apr 18, 2017

@impiaaa alternative is marking the language as whatever the user chose their UI to be in, or worse even, adding yet another control where you pick the language of the toot (bleh, I don't want even more controls!)

Show thread

polarity 🇩🇪Apr 18, 2017

@Gargron @impiaaa magic tag #ger or 🇩🇪

Show thread

葉月 Apr 18, 2017

@Gargron What are you trying to achieve? Tag the toots with the correct lang attribute?

Show thread

Andrew Nesbitt Apr 18, 2017

@Gargron this one is pretty nice: https://github.com/peterc/whatlanguage

Show thread

CS Apr 18, 2017

@andrewnez @Gargron as mentioned already, that doesn't work on Japanese: https://mastodon.social/@Gargron/3134956

Show thread

jomo Apr 18, 2017

@andrewnez @Gargron
> It works well on texts of over 10 words in length (e.g. blog posts or comments) and very poorly on short or Twitter-esque text, so be aware.

Show thread

optikfluffel✨Apr 18, 2017

@Gargron there should be something useful in here https://github.com/arbox/nlp-with-ruby/blob/master/README.md

Show thread

val Apr 18, 2017

@Gargron There is probably some binding to the wordnet database; it's quite popular in the field

Show thread

Brendan Molloy Apr 18, 2017

@Gargron https://github.com/hashwin/scylla ? any textcat derivative is a good start