Deep Learning (DL)-based code generators have seen significant advancements in recent years. Tools such as GitHub Copilot are used by thousands of developers with the main promise of a boost in productivity. However, researchers have recently questioned their impact on code quality showing, for example, that code generated by DL-based tools may be affected by security vulnerabilities. Since DL models are trained on large code corpora, one may conjecture that low-quality code they output is the result of low-quality code they have seen during training. However, there is very little empirical evidence documenting this phenomenon. Indeed, most of previous work look at the frequency with which commercial code generators (e.g., Copilot, ChatGPT) recommend low-quality code without the possibility of relating this to their (publicly unavailable) training set. In this paper, we investigate the extent to which low-quality code instances seen during training affect the quality of the code generated at inference time. We start by fine-tuning a pre-trained DL model on a large-scale dataset (>4.4M functions) being representative of those usually adopted in the training of code generators. We show that 4.98 % of functions in this dataset exhibit one or more quality issues related to security, maintainability, coding practices, etc. We use the fine-tuned model to generate 551 k Python functions, showing that 5.85 % of them are affected by at least one quality issue. We then remove from the training set the low-quality functions, and use the cleaned dataset to fine-tune a second model which has been used to generate the same 551k Python functions. We show that the model trained on the cleaned dataset exhibits similar performance in terms of functional correctness as compared to the original model (i.e., the one trained on the whole dataset) while, however, generating a statistically significant lower number of low-quality functions (2.16 %). Our study empirically documents the importance of high-quality training data for code generators.

Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation / Improta, Cristina; Tufano, Rosalia; Liguori, Pietro; Cotroneo, Domenico; Bavota, Gabriele. - (2025), pp. 454-465. ( 33rd IEEE/ACM International Conference on Program Comprehension, ICPC 2025 can 2025) [10.1109/icpc66645.2025.00056].

Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation

Improta, Cristina;Liguori, Pietro;Cotroneo, Domenico;
2025

Abstract

Deep Learning (DL)-based code generators have seen significant advancements in recent years. Tools such as GitHub Copilot are used by thousands of developers with the main promise of a boost in productivity. However, researchers have recently questioned their impact on code quality showing, for example, that code generated by DL-based tools may be affected by security vulnerabilities. Since DL models are trained on large code corpora, one may conjecture that low-quality code they output is the result of low-quality code they have seen during training. However, there is very little empirical evidence documenting this phenomenon. Indeed, most of previous work look at the frequency with which commercial code generators (e.g., Copilot, ChatGPT) recommend low-quality code without the possibility of relating this to their (publicly unavailable) training set. In this paper, we investigate the extent to which low-quality code instances seen during training affect the quality of the code generated at inference time. We start by fine-tuning a pre-trained DL model on a large-scale dataset (>4.4M functions) being representative of those usually adopted in the training of code generators. We show that 4.98 % of functions in this dataset exhibit one or more quality issues related to security, maintainability, coding practices, etc. We use the fine-tuned model to generate 551 k Python functions, showing that 5.85 % of them are affected by at least one quality issue. We then remove from the training set the low-quality functions, and use the cleaned dataset to fine-tune a second model which has been used to generate the same 551k Python functions. We show that the model trained on the cleaned dataset exhibits similar performance in terms of functional correctness as compared to the original model (i.e., the one trained on the whole dataset) while, however, generating a statistically significant lower number of low-quality functions (2.16 %). Our study empirically documents the importance of high-quality training data for code generators.
2025
Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation / Improta, Cristina; Tufano, Rosalia; Liguori, Pietro; Cotroneo, Domenico; Bavota, Gabriele. - (2025), pp. 454-465. ( 33rd IEEE/ACM International Conference on Program Comprehension, ICPC 2025 can 2025) [10.1109/icpc66645.2025.00056].
File in questo prodotto:
File Dimensione Formato  
Quality_In_Quality_Out_Investigating_Training_Datas_Role_in_AI_Code_Generation.pdf

solo utenti autorizzati

Tipologia: Versione Editoriale (PDF)
Licenza: Copyright dell'editore
Dimensione 606.58 kB
Formato Adobe PDF
606.58 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/1008376
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact