Evaluating the correctness of code generated by AI is a challenging open problem. In this paper, we propose a fully automated method, named ACCA, to evaluate the correctness of AI-generated code for security purposes. The method uses symbolic execution to assess whether the AI-generated code behaves as a reference implementation. We use ACCA to assess four state-of-the-art models trained to generate security-oriented assembly code and compare the results of the evaluation with different baseline solutions, including output similarity metrics, widely used in the field, and the well-known ChatGPT, the AI-powered language model developed by OpenAI. Our experiments show that our method outperforms the baseline solutions and assesses the correctness of the AI-generated code similar to the human-based evaluation, which is considered the ground truth for the assessment in the field. Moreover, ACCA has a very strong correlation with the human evaluation (Pearson's correlation coefficient r=0.84 on average). Finally, since it is a full y automated solution that does not require any human intervention, the proposed method performs the assessment of every code snippet in ∼0.17 s on average, which is definitely lower than the average time required by human analysts to manually inspect the code, based on our experience.

Automating the correctness assessment of AI-generated code for security contexts / Cotroneo, D.; Foggia, A.; Improta, C.; Liguori, P.; Natella, R.. - In: THE JOURNAL OF SYSTEMS AND SOFTWARE. - ISSN 0164-1212. - 216:(2024). [10.1016/j.jss.2024.112113]

Automating the correctness assessment of AI-generated code for security contexts

Cotroneo D.;Foggia A.;Improta C.;Liguori P.;Natella R.
2024

Abstract

Evaluating the correctness of code generated by AI is a challenging open problem. In this paper, we propose a fully automated method, named ACCA, to evaluate the correctness of AI-generated code for security purposes. The method uses symbolic execution to assess whether the AI-generated code behaves as a reference implementation. We use ACCA to assess four state-of-the-art models trained to generate security-oriented assembly code and compare the results of the evaluation with different baseline solutions, including output similarity metrics, widely used in the field, and the well-known ChatGPT, the AI-powered language model developed by OpenAI. Our experiments show that our method outperforms the baseline solutions and assesses the correctness of the AI-generated code similar to the human-based evaluation, which is considered the ground truth for the assessment in the field. Moreover, ACCA has a very strong correlation with the human evaluation (Pearson's correlation coefficient r=0.84 on average). Finally, since it is a full y automated solution that does not require any human intervention, the proposed method performs the assessment of every code snippet in ∼0.17 s on average, which is definitely lower than the average time required by human analysts to manually inspect the code, based on our experience.
2024
Automating the correctness assessment of AI-generated code for security contexts / Cotroneo, D.; Foggia, A.; Improta, C.; Liguori, P.; Natella, R.. - In: THE JOURNAL OF SYSTEMS AND SOFTWARE. - ISSN 0164-1212. - 216:(2024). [10.1016/j.jss.2024.112113]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/972386
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? ND
social impact