The world of bioinformatics and nutrigenetics is complex and challenging, especially when analyzing the vast amounts of genomic data on human genetic polymorphisms. Interpreting the health implications of the interactions between genotype and environmental factors requires refined data-driven approaches. At this scope, this study applies topic modeling techniques on a dataset of 37,042 genomic literature abstracts to dissect the functional implications of genetic polymorphisms on food tolerances, allergies, diet-induced oxidative stress, and xenobiotics metabolism. Our methodology identified 50 topics, distilling 21 relevant topics to nutrigenetics. We elucidated our model structure and inter-topic relationships using hierarchical clustering and similarity matrices. This streamlined approach facilitated identifying and selecting the key topics, highlighting synergistic interactions and critical overlaps essential for further analyses. The final dataset comprises 10,238 papers alongside 3,911 genes variation data. Our approach offers a novel avenue for organizing genetic knowledge pertinent to personalized dietary therapy and preventive healthcare, showcasing the potential of machine learning in genomic data analysis. The code implemented for this study is openly available at https://github.com/johndef64/grpm_bertopic.
Advanced Topic Modeling in Genomics: Towards Personalized Dietary Recommendations Through BERTopic Analysis / De Filippis, G. M.; Rinaldi, A. M.; Russo, C.; Tommasino, C.. - 15343:(2025), pp. 3-17. ( 26th International Conference on Information Integration and Web Intelligence, iiWAS 2024 svk 2024) [10.1007/978-3-031-78093-6_1].
Advanced Topic Modeling in Genomics: Towards Personalized Dietary Recommendations Through BERTopic Analysis
De Filippis G. M.;Rinaldi A. M.;Russo C.;Tommasino C.
2025
Abstract
The world of bioinformatics and nutrigenetics is complex and challenging, especially when analyzing the vast amounts of genomic data on human genetic polymorphisms. Interpreting the health implications of the interactions between genotype and environmental factors requires refined data-driven approaches. At this scope, this study applies topic modeling techniques on a dataset of 37,042 genomic literature abstracts to dissect the functional implications of genetic polymorphisms on food tolerances, allergies, diet-induced oxidative stress, and xenobiotics metabolism. Our methodology identified 50 topics, distilling 21 relevant topics to nutrigenetics. We elucidated our model structure and inter-topic relationships using hierarchical clustering and similarity matrices. This streamlined approach facilitated identifying and selecting the key topics, highlighting synergistic interactions and critical overlaps essential for further analyses. The final dataset comprises 10,238 papers alongside 3,911 genes variation data. Our approach offers a novel avenue for organizing genetic knowledge pertinent to personalized dietary therapy and preventive healthcare, showcasing the potential of machine learning in genomic data analysis. The code implemented for this study is openly available at https://github.com/johndef64/grpm_bertopic.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


