@MarcSchulder what I have is 30 untrained people who labelled a set of items with 20 labels, but several labels might apply for any given item (say, up to 5 or so make sense). The annotators are very unreliable, and I want to select the "better" ones based on their consistency, so I can remove the less reliable annotators.
Technically, the labels are not even independent of each other (they're facial action units applied to emojis, so if you have a "lip corners turned up" then "lip corners turned down" should be excluded) - but this info can probably be learned from the data by EM