Mastodawn

Has anyone used Dirk Hovy's MACE (multi annotator competence evaluation) and knows if it can deal with items that receive multiple labels (by the same annotator)? #nlproc

Show thread

Tatjana Scheffler 2d ago

ah I looked at the data structures it uses and expect it's not possible. That's really too bad. I wonder how hard it would be to extend...

Show thread

Marc Schulder 1d ago

@tschfflr now that’s a blast from the past. I was Dirk‘s office mate back then and remember suggesting that acronym to sneak in my RPG character‘s name. Can’t help with your question though, sorry.

Show thread

Tatjana Scheffler 1d ago

@MarcSchulder ah, cool, I didn't know that!
On my bike to work I had the idea to binarize the labels by using an item-label combination as the new items and then just estimating the annotator competence on these binary yes/no labels. Then I could use MACE directly. Looks like someone already did it that way (https://homepages.tuni.fi/annamaria.mesaros/pubs/martinmorato_eusipco2021.pdf).
This should work if the items are basically independently rated for the overlapping labels (fine for my data). However, MACE also uses priors for the individual label prevalences, and in general information about overall label distribution (= harder or easier labels), and you'd completely use this information.

Show thread

Marc Schulder 1d ago

@tschfflr Bummer. Maybe safer to separately evaluate the different label categories? (If I understood your setup correctly)

Show thread

Tatjana Scheffler 1d ago

@MarcSchulder what I have is 30 untrained people who labelled a set of items with 20 labels, but several labels might apply for any given item (say, up to 5 or so make sense). The annotators are very unreliable, and I want to select the "better" ones based on their consistency, so I can remove the less reliable annotators.
Technically, the labels are not even independent of each other (they're facial action units applied to emojis, so if you have a "lip corners turned up" then "lip corners turned down" should be excluded) - but this info can probably be learned from the data by EM