Mastodawn

#Heliboard gesture data project's first batch has been run through the arcane scripts I wrote.

I only had to discard <1% of the gestures; those were duplicates. No obviously fake data!

269,845 gestures, 500+ contributors.

Only ~60% of the gestures are some form of English.

The other most common languages are:
German
Polish
French
Russian
Arabic
Swedish
Italian

There are other languages, with fewer than 1000 gestures, but I don't know their codes by heart... I will map those later!

Show thread

eclexic Apr 9

@mindystclaire if I remember correctly you were interested in hearing about this?

Show thread

lgsp Apr 9

@theeclecticdyslexic

That's looks good, is it? Do you need more data for some language in particular?

Show thread

eclexic Apr 9

@lgsp I mostly need a variety of contributors to make sure we cover different ways people might gesture.

There are many languages with 0 examples... but the reality is some languages simply don't work with the existing library... so you can't really expect much there. They would have to manually force Heliboard to accept each gesture.

What I actually want to do is roll out a character limit slider so we can get more examples of shorter words. Shorter words are typically harder.

Show thread

lgsp Apr 9

@theeclecticdyslexic

I contributed Italian samples and the proposed random words are quite funny, because they are completely random, so some words are absurd 😆
Maybe the char limit is one thing, the other could be a statistical selection based on some text, like publicly available books, so you get more samples of most used words

Show thread

eclexic Apr 9

@lgsp Yes! some are absurd for sure. They are weighted by log(frequency), but I don't think the weights in the aosp dicts are very good.

To give slightly more common words, without boring people, we just raised those log values to pow_4, which still skews towards uncommon words. Otherwise, In english 7% would be "the"... it would suck to contribute to.

The background collection should give a better distribution if we need it. I think active data is better though, because it is always labeled.

Show thread

eclexic Apr 9

@lgsp luckily, we don't really need the correct distribution. I am not planning on any ML yet.

Show thread

Kalle Kniivilä Apr 9

@theeclecticdyslexic I'll send you some more in Esperanto!

Show thread

eclexic 6d ago

@kallekn thanks! I appreciate it!