Information Retrieval (IR) techniques are being exploited by an increasing number of tools suited to support Software Maintenance activities. This is because the lexical information embedded in the source code by programmers can be valuable for tasks such as concept location, clustering or recovery of traceability links. However, the application of such IR-based techniques relies on the consistency of lexicon available in the different artifacts, and their effectiveness can worsen if programmers introduce abbreviations (e.g: rect) and/or did not strictly follow naming conventions such as Camel Case (e.g: UTFtoASCII). In this paper we propose an approach useful for all of these IR-based tools, suited to automatically split identifiers in their composing words, and expand abbreviations. The solution is able to perform in linear time, taking advantage of an approximate pattern matching technique applied in a graph-based model. Linear complexity allows exploiting a number of different dictionaries, referring to increasingly broader contexts, in order to achieve a disambiguation strategy based on the knowledge gathered from the most appropriate domain. The proposed approach has been compared to other splitting and expansion techniques, using freely available oracles for the identifiers extracted from a number of C/C++ and Java open source systems. Results show an improvement in both splitting and expanding performance, in addition to a strong enhancement in the computational efficiency.

LINSEN: An Efficient Approach to Split Identifiers and Expand Abbreviations / Corazza, Anna; DI MARTINO, Sergio; Maggio, Valerio. - (2012), pp. 233-242. (Intervento presentato al convegno 28th IEEE International Conference on SOFTWARE MAINTENANCE tenutosi a Riva del Garda, Italy nel September 23rd - 30th, 2012) [10.1109/ICSM.2012.6405277].

LINSEN: An Efficient Approach to Split Identifiers and Expand Abbreviations

CORAZZA, ANNA;DI MARTINO, SERGIO;MAGGIO, VALERIO
2012

Abstract

Information Retrieval (IR) techniques are being exploited by an increasing number of tools suited to support Software Maintenance activities. This is because the lexical information embedded in the source code by programmers can be valuable for tasks such as concept location, clustering or recovery of traceability links. However, the application of such IR-based techniques relies on the consistency of lexicon available in the different artifacts, and their effectiveness can worsen if programmers introduce abbreviations (e.g: rect) and/or did not strictly follow naming conventions such as Camel Case (e.g: UTFtoASCII). In this paper we propose an approach useful for all of these IR-based tools, suited to automatically split identifiers in their composing words, and expand abbreviations. The solution is able to perform in linear time, taking advantage of an approximate pattern matching technique applied in a graph-based model. Linear complexity allows exploiting a number of different dictionaries, referring to increasingly broader contexts, in order to achieve a disambiguation strategy based on the knowledge gathered from the most appropriate domain. The proposed approach has been compared to other splitting and expansion techniques, using freely available oracles for the identifiers extracted from a number of C/C++ and Java open source systems. Results show an improvement in both splitting and expanding performance, in addition to a strong enhancement in the computational efficiency.
2012
9781467323123
LINSEN: An Efficient Approach to Split Identifiers and Expand Abbreviations / Corazza, Anna; DI MARTINO, Sergio; Maggio, Valerio. - (2012), pp. 233-242. (Intervento presentato al convegno 28th IEEE International Conference on SOFTWARE MAINTENANCE tenutosi a Riva del Garda, Italy nel September 23rd - 30th, 2012) [10.1109/ICSM.2012.6405277].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/509767
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 61
  • ???jsp.display-item.citation.isi??? 48
social impact