ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks

with Meysam Alizadeh & Maël Kubli

We find that zero-shot #ChatGPT:
> has better accuracy than #MTurk
> has better intercoder agreement than MTurk and trained coders
> is 20x cheaper than MTurk

https://arxiv.org/abs/2303.15056

ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks

Many NLP applications require manual data annotations for a variety of tasks, notably to train classifiers or evaluate the performance of unsupervised models. Depending on the size and degree of complexity, the tasks may be conducted by crowd-workers on platforms such as MTurk as well as trained annotators, such as research assistants. Using a sample of 2,382 tweets, we demonstrate that ChatGPT outperforms crowd-workers for several annotation tasks, including relevance, stance, topics, and frames detection. Specifically, the zero-shot accuracy of ChatGPT exceeds that of crowd-workers for four out of five tasks, while ChatGPT's intercoder agreement exceeds that of both crowd-workers and trained annotators for all tasks. Moreover, the per-annotation cost of ChatGPT is less than $0.003 -- about twenty times cheaper than MTurk. These results show the potential of large language models to drastically increase the efficiency of text classification.

arXiv.org
@fgilardi Maybe I misunderstand the paper's argument (have only read the abstract) but isn't the purpose of much text annotation precisely to get a benchmark of *human* interpretation of text?
@RenseC to me, most of the time the purpose is to get consistent labels based on conceptual categories developed by researchers. Otherwise it's just opinions, no?
@fgilardi For such cases, comparing to AI indeed makes sense. But if you want to benchmark the performance of an NLP algorithm that aims to approximate human interpretation ("opinions"), then having an actual human benchmark seems indispensable.

@fgilardi 3 questions I had after reading:

a) How much is what you find a function of having tested this for Tweets (i.e. is ChatGPT particularly good for Twitter)?

b) If you use ChatGPT-based annotations for supervised learning, doesn't this amplify the biases/errors of ChatGPT?

c) Who’d still want to train classifiers if the original ChatGPT is getting so good that it's classifying accurately enough? (Is the argument about costs, or transparency/possibility to validate annotations?)

@ronpatz

a) we have tested only Tweets so far. Important to look at other kinds of data

b) might be, but unclear how that’s different from using manual annotations. Certainly important to evaluate

c) cost might still be an issue for very large datasets

Our main conclusion is simply that there’s potential, but it’s important to study these and other issues more in detail

@fgilardi Cool! Looking forward to see if I can replicate some of this on my data - I already wrote Mael about this but I'd hypothesize it will be harder for an LLM to do this for my local discourse data... but it will be interesting to see how the purposes of human expert annotation will shift. I think in some ways it becomes even more important.

Also: you had good codebooks - how much were these the results of knowledge gained from iterating at the beginning of annotation?

@mario_angst_sci we didn't test different codebooks on ChatGPT because we wanted to use exactly the same we used for research assistants
@fgilardi Yeah that makes a lot of sense. I was more alluding to the fact that (at least for me), the final production codebooks are often the results of iterative rounds of annotation and discussions within the team.
That is one example in my opinion of where the purpose of annotation might shift to - codebook development, test set creation, refinement of complex concepts...
@mario_angst_sci oh yes absolutely, our codebooks involved the process you describe
@fgilardi which in a way then is somehow important in this context - the LLM implicitly still relied on/ probably could not have worked so well without human annotators creating quality codebooks through immersion in the corpus. I think these nuances are going to be quite relevant to figure out in future workflows.