Has anyone used Dirk Hovy's MACE (multi annotator competence evaluation) and knows if it can deal with items that receive multiple labels (by the same annotator)? #nlproc
ah I looked at the data structures it uses and expect it's not possible. That's really too bad. I wonder how hard it would be to extend...
@tschfflr now that’s a blast from the past. I was Dirk‘s office mate back then and remember suggesting that acronym to sneak in my RPG character‘s name. Can’t help with your question though, sorry.
@MarcSchulder ah, cool, I didn't know that!
On my bike to work I had the idea to binarize the labels by using an item-label combination as the new items and then just estimating the annotator competence on these binary yes/no labels. Then I could use MACE directly. Looks like someone already did it that way (https://homepages.tuni.fi/annamaria.mesaros/pubs/martinmorato_eusipco2021.pdf).
This should work if the items are basically independently rated for the overlapping labels (fine for my data). However, MACE also uses priors for the individual label prevalences, and in general information about overall label distribution (= harder or easier labels), and you'd completely use this information.
@tschfflr Bummer. Maybe safer to separately evaluate the different label categories? (If I understood your setup correctly)
@MarcSchulder what I have is 30 untrained people who labelled a set of items with 20 labels, but several labels might apply for any given item (say, up to 5 or so make sense). The annotators are very unreliable, and I want to select the "better" ones based on their consistency, so I can remove the less reliable annotators.
Technically, the labels are not even independent of each other (they're facial action units applied to emojis, so if you have a "lip corners turned up" then "lip corners turned down" should be excluded) - but this info can probably be learned from the data by EM