Weighing lexical information for software clustering in the context of architecture recovery

Corazza, Anna; Di Martino, Sergio; Maggio, Valerio; Scanniello, Giuseppe

doi:10.1007/s10664-014-9347-3

In literature some approaches have been proposed to partition software systems into meaningful subsystems exploiting the lexical information provided by programmers into the source code. However these techniques usually do not consider the programming language constructs in which the lexicon appears (e.g.: comments, class names, method names) even if it is a common experience that programmers place di erent care in choosing terms for different constructs. In this paper we present a novel lexical-based software clustering technique which exploits the contribution of terms placed in six different parts of the source code (i.e. zones), namely Class Names, Attribute Names, Method Names, Parameter Names, Comments and Source Code Statements. These zones convey information with different levels of relevance, and so their contribution should be di erently weighted according to the specificities of the analyzed software system. To this aim we dene a probabilistic model of the data whose parameters are estimated automatically by the Expectation-Maximization algorithm. These weights are then exploited to generate the software partitions with two distinct clustering algorithms properly customized to make them more suitable for the specic domain. The overall technique has been assessed in a case study conducted on a dataset of 16 open source software systems whose results are presented in the paper. In particular, we experimentally observed that the use of both the dened zones and the Expectation-Maximization algorithm improves the overall quality of results.

Weighing lexical information for software clustering in the context of architecture recovery / Corazza, A., DI MARTINO, S., Maggio, V., Giuseppe, S.. - In: EMPIRICAL SOFTWARE ENGINEERING. - ISSN 1382-3256. - 21:1(2016), pp. 72-103. [10.1007/s10664-014-9347-3]