DESPOTA (DEndrogram Slicing through a PermutatiOn Test Approach) is a clustering technique that cuts the tree branches at various heterogeneity levels to find the optimal division among those feasible from a hierarchical clustering tree. According to the null hypothesis that two descending branches support only one cluster, DESPOTA does a permutation test at each node. The choice of the ideal number of clusters is based on separate permutation tests, taking into account the minimal cost necessary for combining two branches and the cost associated with the merging process. DESPOTA uses a data-driven methodology and does not rely on any distributional assumptions. Mixed data comprises both numeric and categorical features, and mixed datasets occur frequently in many domains, such as biology, education, and healthcare, among others. Mixed datasets are frequently subjected to clustering to identify structures and collect similar individuals. However, it can be difficult to directly apply mathematical operations to mixed features, making clustering a tricky task. This paper extends the applicability of DESPOTA to mixed-type data. To this aim, the original agglomerative-based procedure is questioned and a divisive approach is proposed. The presented approach only requires the distance matrices, and thus is well suited in the case of mixed data.
Dendrogram slicing through a permutation test approach for mixed data / Palazzo, Lucio; Vistocco, Domenico; Palumbo, Francesco. - (2023). (Intervento presentato al convegno DSSV-ECDA 2023, Joint conference of Data Science, Statistics & Visualisation and the European Conference on Data Analysis tenutosi a Antwerp).
Dendrogram slicing through a permutation test approach for mixed data
Lucio Palazzo;Domenico Vistocco;Francesco Palumbo
2023
Abstract
DESPOTA (DEndrogram Slicing through a PermutatiOn Test Approach) is a clustering technique that cuts the tree branches at various heterogeneity levels to find the optimal division among those feasible from a hierarchical clustering tree. According to the null hypothesis that two descending branches support only one cluster, DESPOTA does a permutation test at each node. The choice of the ideal number of clusters is based on separate permutation tests, taking into account the minimal cost necessary for combining two branches and the cost associated with the merging process. DESPOTA uses a data-driven methodology and does not rely on any distributional assumptions. Mixed data comprises both numeric and categorical features, and mixed datasets occur frequently in many domains, such as biology, education, and healthcare, among others. Mixed datasets are frequently subjected to clustering to identify structures and collect similar individuals. However, it can be difficult to directly apply mathematical operations to mixed features, making clustering a tricky task. This paper extends the applicability of DESPOTA to mixed-type data. To this aim, the original agglomerative-based procedure is questioned and a divisive approach is proposed. The presented approach only requires the distance matrices, and thus is well suited in the case of mixed data.File | Dimensione | Formato | |
---|---|---|---|
DSSV_ECDA_All_Abstracts-9.pdf
accesso aperto
Licenza:
Copyright dell'editore
Dimensione
18.67 MB
Formato
Adobe PDF
|
18.67 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.