Generalization is a main issue for current audio deepfake detectors, which struggle to provide reliable results on out-of-distribution data. Given the speed at which more and more accurate synthesis methods are developed, it is important to design techniques that work well also on data they were not trained for. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection. To this end, the detection problem is reformulated as a speaker verification framework and fake audios are exposed by the mismatch between the voice sample under test and the voice of the claimed identity. With this paradigm, no fake speech sample is necessary in training, cutting off any link with the generation method at the root. Features are extracted by general-purpose large pre-trained models, with no need for training or fine-tuning on specific datasets. At detection time only a limited set of voice fragments of the identity under test is required. Experiments on several datasets show that detectors based on pre-trained models achieve excellent performance and show strong generalization ability, rivaling supervised methods on in-distribution data and largely overcoming them on out-of-distribution data.
Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models / Pianese, Alessandro; Cozzolino, Davide; Poggi, Giovanni; Verdoliva, Luisa. - (2024). (Intervento presentato al convegno ACM Workshop on Information Hiding and Multimedia Security) [10.1145/3658664.3659662].
Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models
Alessandro Pianese;Davide Cozzolino;Giovanni Poggi;Luisa Verdoliva
2024
Abstract
Generalization is a main issue for current audio deepfake detectors, which struggle to provide reliable results on out-of-distribution data. Given the speed at which more and more accurate synthesis methods are developed, it is important to design techniques that work well also on data they were not trained for. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection. To this end, the detection problem is reformulated as a speaker verification framework and fake audios are exposed by the mismatch between the voice sample under test and the voice of the claimed identity. With this paradigm, no fake speech sample is necessary in training, cutting off any link with the generation method at the root. Features are extracted by general-purpose large pre-trained models, with no need for training or fine-tuning on specific datasets. At detection time only a limited set of voice fragments of the identity under test is required. Experiments on several datasets show that detectors based on pre-trained models achieve excellent performance and show strong generalization ability, rivaling supervised methods on in-distribution data and largely overcoming them on out-of-distribution data.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.