Impact of Data Quality for Automatic Issue Classification Using Pre-Trained Language Models

Apr 1, 2024·

Giuseppe Colavito

Filippo Lanubile

Nicole Novielli

Luigi Quaranta

· 0 min read

Abstract

Issue classification aims to recognize whether an issue reports a bug, a request for enhancement or support. In this paper we use pre-trained models for the automatic classification of issues and investigate how the quality of data affects the performance of classifiers. Despite the application of data quality filters, none of our attempts had a significant effect on model quality. As root cause we identify a threat to construct validity underlying the issue labeling. Editor’s note: Open Science material was validated by the Journal of Systems and Software Open Science Board.

Type

Journal article

Publication

Journal of Systems and Software

Last updated on Apr 1, 2024

BERT GitHub Issue Labeling Issue Trackers Label Correctness Model Quality

← Benchmarking large language models for automated labeling: The case of issue report classification Jan 1, 2025

Leveraging GPT-like LLMs to Automate Issue Labeling Apr 1, 2024 →