Road traffic injuries are a leading cause of death worldwide and are projected to become even more critical by 2030. Understanding the factors influencing crash severity is essential for developing effective safety interventions. However, crash data often suffer from severe class imbalance, especially when distinguishing between fatal and non-fatal accidents. Traditional machine learning algorithms tend to perform poorly under these conditions, favoring the majority class and misclassifying critical minority cases. To address this, we propose a novel resampling algorithm—SONCA (Synthetic Over-sampling for Numerical and Categorical variables)—designed to balance datasets containing mixed data types. Unlike existing oversampling methods, SONCA handles numerical, ordinal, nominal, and dichotomous. We evaluated SONCA using both parametric (Logit) and non-parametric (CART) models on imbalanced datasets: PTW-ISTAT. The original models failed to detect the minority class effectively, while models estimated on SONCA-resampled data showed substantial improvements in True Positive Rate, G-mean, and Fmeasure. These results demonstrate SONCA’s potential as a flexible, model-agnostic preprocessing tool for addressing class imbalance in diverse real-world scenarios
Synthetic Resampling Algorithm for Response Class Imbalance in Supervised Learning: Application to Road Accident Severity Prediction / Mauriello, Filomena; Aria, Massimo; Siciliano, Roberta; Galante, Francesco; Montella, Alfonso. - In: TRANSPORTATION RESEARCH PROCEDIA. - ISSN 2352-1465. - 95:(2026), pp. 504-511. ( Euro Working Group on Transportation Annual Meeting 2025 - EWGT2025) [10.1016/j.trpro.2026.02.064].
Synthetic Resampling Algorithm for Response Class Imbalance in Supervised Learning: Application to Road Accident Severity Prediction
Mauriello, Filomena
Primo
;Aria, Massimo;Siciliano, Roberta;Galante, Francesco;Montella, Alfonso
2026
Abstract
Road traffic injuries are a leading cause of death worldwide and are projected to become even more critical by 2030. Understanding the factors influencing crash severity is essential for developing effective safety interventions. However, crash data often suffer from severe class imbalance, especially when distinguishing between fatal and non-fatal accidents. Traditional machine learning algorithms tend to perform poorly under these conditions, favoring the majority class and misclassifying critical minority cases. To address this, we propose a novel resampling algorithm—SONCA (Synthetic Over-sampling for Numerical and Categorical variables)—designed to balance datasets containing mixed data types. Unlike existing oversampling methods, SONCA handles numerical, ordinal, nominal, and dichotomous. We evaluated SONCA using both parametric (Logit) and non-parametric (CART) models on imbalanced datasets: PTW-ISTAT. The original models failed to detect the minority class effectively, while models estimated on SONCA-resampled data showed substantial improvements in True Positive Rate, G-mean, and Fmeasure. These results demonstrate SONCA’s potential as a flexible, model-agnostic preprocessing tool for addressing class imbalance in diverse real-world scenarios| File | Dimensione | Formato | |
|---|---|---|---|
|
Synthetic Resampling Algorithm - TransportationResearchProcedia-EWGT2025.pdf
accesso aperto
Descrizione: Articolo completo
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
483.14 kB
Formato
Adobe PDF
|
483.14 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


