This work presents a dual-factor biometric authentication system developed for the BIOVID Challenge 2025, targeting open-set verification scenarios using synchronized audio-visual data. Each authentication attempt consists of an MP4 video containing a spoken passphrase and corresponding lip motion. Our system employs a dual-stream deep learning model leveraging i) a ResNet3D-18 network with a Bidirectional GRU to extract spatiotemporal visual features from lip motion and ii) a fine-tuned ECAPA-TDNN (Emphasized Channel Attention, Propagation, and Aggregation Time Delay Neural Network) model to generate robust audio embeddings from speech. These modality-specific representations are fused using a Gated Multimodal Unit (GMU), which outputs a 256-dimensional joint embedding used for both classification and embedding-based matching for identity verification. A composite loss function combining triplet loss (with semi-hard negative mining) and cross-entropy loss is used for training. The system is evaluated using 3-fold cross-validation, achieving an average accuracy of 71.36%, EER of 28.61%, APCER of 28.68%, and BPCER of 28.55%. To make our system able to operate under an open-set decision-making scenario, we propose a method using a threshold-based cosine similarity reject option.

Multimodal Biometric Authentication Using Lip Motion and Spoken Passphrases / Govindaraju, Venu Siddapura; Marrone, Stefano; Sansone, Carlo. - 16170:(2025), pp. 545-555. ( Workshops and competitions hosted by the 23rd International Conference on Image Analysis and Processing, ICIAP 2025 ita 2025) [10.1007/978-3-032-11381-8_45].

Multimodal Biometric Authentication Using Lip Motion and Spoken Passphrases

Govindaraju, Venu Siddapura;Marrone, Stefano;Sansone, Carlo
2025

Abstract

This work presents a dual-factor biometric authentication system developed for the BIOVID Challenge 2025, targeting open-set verification scenarios using synchronized audio-visual data. Each authentication attempt consists of an MP4 video containing a spoken passphrase and corresponding lip motion. Our system employs a dual-stream deep learning model leveraging i) a ResNet3D-18 network with a Bidirectional GRU to extract spatiotemporal visual features from lip motion and ii) a fine-tuned ECAPA-TDNN (Emphasized Channel Attention, Propagation, and Aggregation Time Delay Neural Network) model to generate robust audio embeddings from speech. These modality-specific representations are fused using a Gated Multimodal Unit (GMU), which outputs a 256-dimensional joint embedding used for both classification and embedding-based matching for identity verification. A composite loss function combining triplet loss (with semi-hard negative mining) and cross-entropy loss is used for training. The system is evaluated using 3-fold cross-validation, achieving an average accuracy of 71.36%, EER of 28.61%, APCER of 28.68%, and BPCER of 28.55%. To make our system able to operate under an open-set decision-making scenario, we propose a method using a threshold-based cosine similarity reject option.
2025
9783032113801
9783032113818
Multimodal Biometric Authentication Using Lip Motion and Spoken Passphrases / Govindaraju, Venu Siddapura; Marrone, Stefano; Sansone, Carlo. - 16170:(2025), pp. 545-555. ( Workshops and competitions hosted by the 23rd International Conference on Image Analysis and Processing, ICIAP 2025 ita 2025) [10.1007/978-3-032-11381-8_45].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/1039183
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact