A Study of Label Errors and Their Impact on LLM Performance Evaluations
Everybody is a genius. But if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid. — Einstein (attributed)
Natural Language Processing (NLP) benchmarks have been instrumental in advancing the field, both by providing standardized datasets for training but also for evaluating models. These datasets have evolved over time and as models have improved. They have grown both in scale and in the number of tasks that it is possible to test. Initially, these datasets were manually verified, as labeling by experts is clearly preferred. As these datasets have grown in size, though, expert annotation is cost-prohibitive and therefore no longer feasible. At the same time there is still demand for annotation (indeed there is always more demand) so researchers have turned to crowd-sourcing.
While crowd-sourcing has reduced the cost and time of obtaining datasets, it has increased the amount of error present. In general, no dataset is perfect, as it is about achieving a trade-off between scale, efficiency, and expertise. On the one hand, even experts make errors (arising from factors such as task subjectivity, annotator fatigue, inattention, insufficient guidelines, and more). On the other hand, when datasets are annotated by non-experts there are many more errors.

These errors can have a number of unexpected effects. They can damage the generalization of the model if it is trained on these errors or lead to the model not being evaluated correctly. Large language models (LLMs) have shown several advances and the ability to learn the task in context. Therefore, some authors have thought of using them to annotate datasets instead of crowdsourcing.
We find that across the four datasets, ChatGPT’s zero-shot accuracy is higher than that of MTurk for most tasks. For all tasks, ChatGPT’s intercoder agreement exceeds that of both MTurk and trained annotators. Moreover, ChatGPT is significantly cheaper than MTurk.

Especially for low-resource language using LLM seems promising. In fact, in some cases, it is difficult to find annotators and LLMs might be an option. Or if you already have the dataset, LLMs could be used to improve it by detecting errors and correcting them (experts for re-annotation and correction).
Questions remain:
- How many errors are there in these benchmarks?
- Can LLMs identify them?
- How do expert, crowd-sourced, and LLM-based annotations compare in quality and efficiency?
- What are the implications of these errors on model performance and can we mitigate their impact?
In this study, they ask exactly that:
The authors start from the fact that yes LLMs can identify errors but often there is no concordance and it must therefore be reviewed by a human being. Therefore, the authors suggest that an ensemble of LLMs can be used. To create this ensemble, use a number of different prompts and different LLMs (as if it were a kind of pool of experts). They thus check whether there is a strong disagreement between the original labels and the results of the annotating LLMs. If this is the case, these examples are considered mislabeled.