LAPSE:2025.0524
Published Article

LAPSE:2025.0524
Multi-Omics biological embeddings for ML-models
June 27, 2025
Abstract
Machine learning algorithms have led to the development of numerous vector embeddings for biological entities such as metabolites, proteins, genes, and enzymes. However, these embeddings often lack contextual information due to their specialized focus on individual omics. Disease progression and biosynthesis pathways are increasingly understood through complex, multi-layered networks that integrate diverse omics data and intricate signaling and reaction sequences. Capturing these relationships in a meaningful way requires embeddings that account for both functional and multi-modal dependencies. We propose an embedding approach that unifies these different biological modalities by treating them as directions in a shared space rather than as isolated data types. Similar to how word embeddings in natural language processing reveal meaningful relationships (e.g., Tokyo Japan + UK = London, indicating a directional representation of capitals), we can model genes and proteins in a way that captures their inherent connections. A gene implies information about the protein it encodes, and vice versa, forming a structured and interpretable representation of biological pathways. Our model, inspired by NLP techniques, breaks down pathway sequences into contextual pairs spanning different omics types. By aligning pathway steps in proximity, the embeddings reflect biologically relevant relationships, enhancing their interpretability and utility. Because these embeddings are generated from pathway sequences, they can be applied to optimize reaction pathways, aiding retrosynthesis in microbiomes, drug development, and even human health interventions.
Machine learning algorithms have led to the development of numerous vector embeddings for biological entities such as metabolites, proteins, genes, and enzymes. However, these embeddings often lack contextual information due to their specialized focus on individual omics. Disease progression and biosynthesis pathways are increasingly understood through complex, multi-layered networks that integrate diverse omics data and intricate signaling and reaction sequences. Capturing these relationships in a meaningful way requires embeddings that account for both functional and multi-modal dependencies. We propose an embedding approach that unifies these different biological modalities by treating them as directions in a shared space rather than as isolated data types. Similar to how word embeddings in natural language processing reveal meaningful relationships (e.g., Tokyo Japan + UK = London, indicating a directional representation of capitals), we can model genes and proteins in a way that captures their inherent connections. A gene implies information about the protein it encodes, and vice versa, forming a structured and interpretable representation of biological pathways. Our model, inspired by NLP techniques, breaks down pathway sequences into contextual pairs spanning different omics types. By aligning pathway steps in proximity, the embeddings reflect biologically relevant relationships, enhancing their interpretability and utility. Because these embeddings are generated from pathway sequences, they can be applied to optimize reaction pathways, aiding retrosynthesis in microbiomes, drug development, and even human health interventions.
Record ID
Keywords
Biological Pathways, Biosynthesis, Chemical fingerprints, Drug Discovery, multi-omics
Subject
Suggested Citation
Otte LB, Hogstrand C, Mardinoglu A, Guo M. Multi-Omics biological embeddings for ML-models. Systems and Control Transactions 4:2316-2321 (2025) https://doi.org/10.69997/sct.136974
Author Affiliations
Otte LB: Kings College London, Department of Engineering, London, UK
Hogstrand C: Kings College London, Department of Analytical, Environmental and Forensic Sciences, London, UK
Mardinoglu A: Centre for Host-Microbiome Interactions, Faculty of Dentistry, Oral & Craniofacial Sciences, King's College London, London, SE1 9RT, United Kingdom; Science for Life Laboratory, KTH - Royal Institute of Technology, Stockholm, Sweden
Guo M: Kings College London, Department of Engineering, London, UK
Hogstrand C: Kings College London, Department of Analytical, Environmental and Forensic Sciences, London, UK
Mardinoglu A: Centre for Host-Microbiome Interactions, Faculty of Dentistry, Oral & Craniofacial Sciences, King's College London, London, SE1 9RT, United Kingdom; Science for Life Laboratory, KTH - Royal Institute of Technology, Stockholm, Sweden
Guo M: Kings College London, Department of Engineering, London, UK
Journal Name
Systems and Control Transactions
Volume
4
First Page
2316
Last Page
2321
Year
2025
Publication Date
2025-07-01
Version Comments
Original Submission
Other Meta
PII: 2316-2321-1210-SCT-4-2025, Publication Type: Journal Article
Record Map
Published Article

LAPSE:2025.0524
This Record
External Link

https://doi.org/10.69997/sct.136974
Article DOI
Download
Meta
Record Statistics
Record Views
625
Version History
[v1] (Original Submission)
Jun 27, 2025
Verified by curator on
Jun 27, 2025
This Version Number
v1
Citations
Most Recent
This Version
URL Here
http://psecommunity.org/LAPSE:2025.0524
Record Owner
PSE Press
Links to Related Works
References Cited
- Ron Caspi, Tomer Altman, et al. The metacyc database of metabolic pathways and enzymes and the biocyc collection of pathway/genome databases. Nucleic acids research, 42(D1):D459-D471, 2014 https://doi.org/10.1093/nar/gkt1103
- Manish Kumar. A beginner's guide for understanding extended-connectivity fingerprints(ecfps), Mar 2021
- Xinmeng Liao, Mehmet Ozcan, et al. Open moa: revealing the mechanism of action (moa) based on network topology and hierarchy. Bioinformatics, 39(11):btad666, 2023 https://doi.org/10.1093/bioinformatics/btad666
- Javier Lopez-Ibáñez, Florencio Pazos, and Monica Chagoyen. Predicting biological pathways of chemical compounds with a profile-inspired approach. BMC bioinformatics, 22(1):320, 2021 https://doi.org/10.1186/s12859-021-04252-y
- Tomas Mikolov, Ilya Sutskever, Kai Chen, et al. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013
- Kumar Saurabh Singh, Justin JJ van der Hooft, et al. Integrative omics approaches for biosynthetic pathway discovery in plants. Natural Product Reports, 39(9):1876-1896, 2022 https://doi.org/10.1039/D2NP00032F
- Raman, K. and Chandra, N., 2009. Flux balance analysis of biological systems: applications and challenges. Briefings in bioinformatics, 10(4), pp.435-449 https://doi.org/10.1093/bib/bbp011
- Honda S, Shi S, Ueda HR. Smiles transformer: Pre-trained molecular fingerprint for low data drug discovery. arXiv preprint arXiv:1911.04738. 2019
- Durant, J. L., Leland, B. A., Henry, D. R., & Nourse, J. G. (2002). Reoptimization of MDL keys for use in drug discovery. Journal of Chemical Information and Computer Sciences, 42(6), 1273-1280 https://doi.org/10.1021/ci010132r
- Subramanian I, Verma S, et al.. Multi-omics data integration, interpretation, and its application. Bioinformatics and biology insights. 2020 Jan;14:1177932219899051 https://doi.org/10.1177/1177932219899051
- Muller, E., Shiryan, I. and Borenstein, E., 2024. Multi-omic integration of microbiome data for identifying disease-associated modules. Nature Communications, 15(1), p.2621 https://doi.org/10.1038/s41467-024-46888-3
- Hussain, M.H., Mohsin, et al., 2022. Multiscale engineering of microbial cell factories: A step forward towards sustainable natural products industry. Synthetic and systems biotechnology, 7(1), pp.586-601 https://doi.org/10.1016/j.synbio.2021.12.012

