Road safety is a crucial dimension of transport system reliability. The large-scale SHRP2 (Strategic Research Program Naturalistic Driving Study) database is analyzed. The main goal is to identify causal relationships among external, human factors, and safety-critical driver behaviors at the level of specific driving situations. This paper presents a three-step interpretable machine learning framework integrating Multiple Correspondence Analysis as an unsupervised exploration of latent risk profiles; Classification Trees for imbalanced outcomes, enhanced by the LIFT measure to explain specific risk rules and overcome the limitations of standard splitting criteria; and Random Forests combined with Global Sensitivity Analysis, providing a more robust Variable Importance ranking than standard metrics. As a main result, we identify latent behavioral typologies and quantify how contextual and human factors influence the probability that a safety-critical driving event results in a Crash or Near-Crash. This transparent approach yields actionable insights into how individual and environmental factors interact to determine driving safety. Beyond road transport, the proposed approach can be generalized to other complex service contexts where risk monitoring and quality improvement rely on heterogeneous behavioral data.
Interpretable and robust tree-based methodology for imbalanced classification in driving safety assessment / Vannucci, G., D'Ambrosio, A., Siciliano, R.. - In: QUALITY AND QUANTITY. - ISSN 1573-7845. - (2026), pp. -1. [10.1007/s11135-026-02877-w]
Interpretable and robust tree-based methodology for imbalanced classification in driving safety assessment
Vannucci, Giulia
;D'Ambrosio, Antonio;Siciliano, Roberta
2026
Abstract
Road safety is a crucial dimension of transport system reliability. The large-scale SHRP2 (Strategic Research Program Naturalistic Driving Study) database is analyzed. The main goal is to identify causal relationships among external, human factors, and safety-critical driver behaviors at the level of specific driving situations. This paper presents a three-step interpretable machine learning framework integrating Multiple Correspondence Analysis as an unsupervised exploration of latent risk profiles; Classification Trees for imbalanced outcomes, enhanced by the LIFT measure to explain specific risk rules and overcome the limitations of standard splitting criteria; and Random Forests combined with Global Sensitivity Analysis, providing a more robust Variable Importance ranking than standard metrics. As a main result, we identify latent behavioral typologies and quantify how contextual and human factors influence the probability that a safety-critical driving event results in a Crash or Near-Crash. This transparent approach yields actionable insights into how individual and environmental factors interact to determine driving safety. Beyond road transport, the proposed approach can be generalized to other complex service contexts where risk monitoring and quality improvement rely on heterogeneous behavioral data.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


