DESPOTA (DEndrogram Slicing through a PermutatiOn Test Approach) is a clustering technique that cuts the tree branches at various heterogeneity levels to find the optimal division among those feasible from a hierarchical clustering tree. According to the null hypothesis that two descending branches support only one cluster, DESPOTA does a permutation test at each node. The choice of the ideal number of clusters is based on separate permutation tests, taking into account the minimal cost necessary for combining two branches and the cost associated with the merging process. DESPOTA uses a data-driven methodology and does not rely on any distributional assumptions. Mixed data comprises both numeric and categorical features, and mixed datasets occur frequently in many domains, such as biology, education, and healthcare, among others. Mixed datasets are frequently subjected to clustering to identify structures and collect similar individuals. However, it can be difficult to directly apply mathematical operations to mixed features, making clustering a tricky task. This paper extends the applicability of DESPOTA to mixed-type data. To this aim, the original agglomerative-based procedure is questioned and a divisive approach is proposed. The presented approach only requires the distance matrices, and thus is well suited in the case of mixed data.

Dendrogram slicing through a permutation test approach for mixed data / Palazzo, Lucio; Vistocco, Domenico; Palumbo, Francesco. - (2023). (Intervento presentato al convegno DSSV-ECDA 2023, Joint conference of Data Science, Statistics & Visualisation and the European Conference on Data Analysis tenutosi a Antwerp).

Dendrogram slicing through a permutation test approach for mixed data

Lucio Palazzo;Domenico Vistocco;Francesco Palumbo
2023

Abstract

DESPOTA (DEndrogram Slicing through a PermutatiOn Test Approach) is a clustering technique that cuts the tree branches at various heterogeneity levels to find the optimal division among those feasible from a hierarchical clustering tree. According to the null hypothesis that two descending branches support only one cluster, DESPOTA does a permutation test at each node. The choice of the ideal number of clusters is based on separate permutation tests, taking into account the minimal cost necessary for combining two branches and the cost associated with the merging process. DESPOTA uses a data-driven methodology and does not rely on any distributional assumptions. Mixed data comprises both numeric and categorical features, and mixed datasets occur frequently in many domains, such as biology, education, and healthcare, among others. Mixed datasets are frequently subjected to clustering to identify structures and collect similar individuals. However, it can be difficult to directly apply mathematical operations to mixed features, making clustering a tricky task. This paper extends the applicability of DESPOTA to mixed-type data. To this aim, the original agglomerative-based procedure is questioned and a divisive approach is proposed. The presented approach only requires the distance matrices, and thus is well suited in the case of mixed data.
2023
Dendrogram slicing through a permutation test approach for mixed data / Palazzo, Lucio; Vistocco, Domenico; Palumbo, Francesco. - (2023). (Intervento presentato al convegno DSSV-ECDA 2023, Joint conference of Data Science, Statistics & Visualisation and the European Conference on Data Analysis tenutosi a Antwerp).
File in questo prodotto:
File Dimensione Formato  
DSSV_ECDA_All_Abstracts-9.pdf

accesso aperto

Licenza: Copyright dell'editore
Dimensione 18.67 MB
Formato Adobe PDF
18.67 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/961201
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact