Identifying Cardiological Disorders in Spanish via Data Augmentation and Fine-Tuned Language Models

Romano, A.; Riccio, G.; Postiglione, M.; Moscato, V.

This study presents a novel approach to Biomedical Named Entity Recognition (BioNER), specifically tailored for the cardiology domain. The challenge of adapting models to specific fields is addressed through the integration of cross-domain transfer learning and data augmentation techniques. The process begins with the fine-tuning of a compact Biomedical Transformer model on the DisTEMIST corpus, enabling the capture of general biomedical concepts. This model is then further trained on the CardioCCC corpus, a cardiology-specific dataset, enhancing its ability to identify and interpret cardiological entities. A data augmentation strategy then is employed, leveraging Context Similarity and K-Nearest Neighbors (KNN) to generate augmented datasets. This enhances the model's ability to recognize medical entities. The final step involves a NER Fusion strategy, which combines outputs from multiple BioNER taggers to bolster robustness and accuracy in entity recognition. Experimental results from the MultiCardioNER challenge demonstrate the effectiveness of the proposed approach. Our framework surpasses the median F1 Score of 0.7566 by approximately 4%, achieving a score of 0.791, which is only 2% lower w.r.t. the top submission, despite being based on much smaller language models.

Identifying Cardiological Disorders in Spanish via Data Augmentation and Fine-Tuned Language Models / Romano, A., Riccio, G., Postiglione, M., Moscato, V.. - 3740:(2024), pp. 207-222. (25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024 fra 2024).