Benchmarking large language models for automated labeling: The case of issue report classification
Abstract
Context: Issue labeling is a fundamental task for software development as it is critical for the effective management of software projects. This practice involves assigning a label to issues, such as bug or feature request, denoting a task relevant to the project management. To date, large language models (LLMs) have been proposed to automate this task, including both fine-tuned BERT-like models and zero-shot GPT-like models. Objectives: In this paper, we investigate which LLMs offer the best trade-off between performance, response time, hardware requirements, and quality of the responses for issue report classification. Methods: We design and execute a comprehensive benchmark study to assess 22 generative decoder-only LLMs and 2 baseline BERT-like encoder-only models, which we evaluate on two different datasets of GitHub issues. Results: Generative LLMs demonstrate potential for zero-shot classification. However, their performance varies significantly across datasets and they require substantial computational resources for deployment. In contrast, BERT-like models show more consistent performance and lower resource requirements. Conclusions: Based on the empirical evidence provided in this study, we discuss implications for researchers and practitioners. In particular, our results suggest that fine-tuning BERT-like encoder-only models enables achieving consistent, state-of-the-art performance across datasets even in presence of a small amount of labeled data available for training.
Type
Publication
Information and Software Technology