This work presents a dual-factor biometric authentication system developed for the BIOVID Challenge 2025, targeting open-set verification scenarios using synchronized audio-visual data. Each authentication attempt consists of an MP4 video containing a spoken passphrase and corresponding lip motion. Our system employs a dual-stream deep learning model leveraging i) a ResNet3D-18 network with a Bidirectional GRU to extract spatiotemporal visual features from lip motion and ii) a fine-tuned ECAPA-TDNN (Emphasized Channel Attention, Propagation, and Aggregation Time Delay Neural Network) model to generate robust audio embeddings from speech. These modality-specific representations are fused using a Gated Multimodal Unit (GMU), which outputs a 256-dimensional joint embedding used for both classification and embedding-based matching for identity verification. A composite loss function combining triplet loss (with semi-hard negative mining) and cross-entropy loss is used for training. The system is evaluated using 3-fold cross-validation, achieving an average accuracy of 71.36%, EER of 28.61%, APCER of 28.68%, and BPCER of 28.55%. To make our system able to operate under an open-set decision-making scenario, we propose a method using a threshold-based cosine similarity reject option.
Multimodal Biometric Authentication Using Lip Motion and Spoken Passphrases / Govindaraju, Venu Siddapura; Marrone, Stefano; Sansone, Carlo. - 16170:(2025), pp. 545-555. ( Workshops and competitions hosted by the 23rd International Conference on Image Analysis and Processing, ICIAP 2025 ita 2025) [10.1007/978-3-032-11381-8_45].
Multimodal Biometric Authentication Using Lip Motion and Spoken Passphrases
Govindaraju, Venu Siddapura;Marrone, Stefano;Sansone, Carlo
2025
Abstract
This work presents a dual-factor biometric authentication system developed for the BIOVID Challenge 2025, targeting open-set verification scenarios using synchronized audio-visual data. Each authentication attempt consists of an MP4 video containing a spoken passphrase and corresponding lip motion. Our system employs a dual-stream deep learning model leveraging i) a ResNet3D-18 network with a Bidirectional GRU to extract spatiotemporal visual features from lip motion and ii) a fine-tuned ECAPA-TDNN (Emphasized Channel Attention, Propagation, and Aggregation Time Delay Neural Network) model to generate robust audio embeddings from speech. These modality-specific representations are fused using a Gated Multimodal Unit (GMU), which outputs a 256-dimensional joint embedding used for both classification and embedding-based matching for identity verification. A composite loss function combining triplet loss (with semi-hard negative mining) and cross-entropy loss is used for training. The system is evaluated using 3-fold cross-validation, achieving an average accuracy of 71.36%, EER of 28.61%, APCER of 28.68%, and BPCER of 28.55%. To make our system able to operate under an open-set decision-making scenario, we propose a method using a threshold-based cosine similarity reject option.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


