LAPSE:2026.0315
Published Article

LAPSE:2026.0315
Chemical Language Transformers for the Inverse Design of Novel Surfactants
June 12, 2026
Abstract
Rapid, sustainable redesign of large functional molecules demands efficient exploration of vast chemical spaces. Chemical language models (CLMs), especially transformers, can learn long-range structure-property relationships and enable fast candidate generation after training. However, inverse molecular design is ill-posed - many structures can meet the same target - and conditioned generation often decodes to invalid or off-spec molecules. To address this challenge, we propose a CLM-based inverse design framework that optimises latent representations toward target properties and explicitly evaluates round-trip fidelity, i.e., whether decoded candidates remain on-target after decoding and forward re-evaluation. To improve reliability, we introduce post-decoding beam re-ranking using round-trip consistency and a predictor-guided minimal-edit repair step that corrects invalid near-misses while preserving closeness to the target property. We demonstrate the approach on surfactant critical micelle concentration (CMC) design, benchmarking existing large pretrained CLMs against our lightweight domain-trained CLM. The framework produces a high proportion of valid and diverse molecules (~90%) while maintaining target property error near 1%. Moreover, atom-level saliency analysis confirms that the generated structures follow established surfactant design rules, supporting interpretable structure-property control. Overall, the framework provides an efficient and broadly applicable solution to reliable inverse design of novel functional molecules.
Rapid, sustainable redesign of large functional molecules demands efficient exploration of vast chemical spaces. Chemical language models (CLMs), especially transformers, can learn long-range structure-property relationships and enable fast candidate generation after training. However, inverse molecular design is ill-posed - many structures can meet the same target - and conditioned generation often decodes to invalid or off-spec molecules. To address this challenge, we propose a CLM-based inverse design framework that optimises latent representations toward target properties and explicitly evaluates round-trip fidelity, i.e., whether decoded candidates remain on-target after decoding and forward re-evaluation. To improve reliability, we introduce post-decoding beam re-ranking using round-trip consistency and a predictor-guided minimal-edit repair step that corrects invalid near-misses while preserving closeness to the target property. We demonstrate the approach on surfactant critical micelle concentration (CMC) design, benchmarking existing large pretrained CLMs against our lightweight domain-trained CLM. The framework produces a high proportion of valid and diverse molecules (~90%) while maintaining target property error near 1%. Moreover, atom-level saliency analysis confirms that the generated structures follow established surfactant design rules, supporting interpretable structure-property control. Overall, the framework provides an efficient and broadly applicable solution to reliable inverse design of novel functional molecules.
Record ID
Keywords
chemical language models, interpretable AI, Inverse molecular design, surfactants, transformers
Subject
Suggested Citation
Rogers AW, Zillmer R, Lane A, Kowalski A, Zhang D. Chemical Language Transformers for the Inverse Design of Novel Surfactants. Systems and Control Transactions 5:903-909 (2026) https://doi.org/10.69997/sct.161720
Author Affiliations
Rogers AW: The University of Manchester, Department of Chemical Engineering, Manchester, UK [ORCID]
Zillmer R: Unilever, R&D Port Sunlight, Liverpool, UK
Lane A: Unilever, R&D Port Sunlight, Liverpool, UK
Kowalski A: Unilever, R&D Port Sunlight, Liverpool, UK
Zhang D: The University of Manchester, Department of Chemical Engineering, Manchester, UK. Unilever, R&D Port Sunlight, Liverpool, UK [ORCID]
[Login] to see author email addresses.
Zillmer R: Unilever, R&D Port Sunlight, Liverpool, UK
Lane A: Unilever, R&D Port Sunlight, Liverpool, UK
Kowalski A: Unilever, R&D Port Sunlight, Liverpool, UK
Zhang D: The University of Manchester, Department of Chemical Engineering, Manchester, UK. Unilever, R&D Port Sunlight, Liverpool, UK [ORCID]
[Login] to see author email addresses.
Journal Name
Systems and Control Transactions
Volume
5
First Page
903
Last Page
909
Year
2026
Publication Date
2026-06-12
Version Comments
Original Submission
Other Meta
PII: 0903-0909-55-SCT-5-2026, Publication Type: Journal Article
Record Map
Published Article

LAPSE:2026.0315
This Record
External Link

https://doi.org/10.69997/sct.161720
Publisher Version
Download
Meta
Record Statistics
Record Views
94
Version History
[v1] (Original Submission)
Jun 12, 2026
Verified by curator on
Jun 12, 2026
This Version Number
v1
Citations
Most Recent
This Version
URL Here
http://psecommunity.org/LAPSE:2026.0315
Record Owner
PSE Press
Links to Related Works
References Cited
- Arora J, Ranjan A, Chauhan A, Biswas R, Rajput VD, Sushkova S, Mandzhieva S, Minkina T, Jindal T. Surfactant pollution, an emerging threat to ecosystem: approaches for effective bacterial degradation. Journal of Applied Microbiology 133:1229-1244 (2022) https://doi.org/10.1111/jam.15631
- Chitre A, Querimit RCM, Rihm SD, Karan D, Zhu B, Wang K, Wang L, Hippalgaonkar K, Lapkin AA. Accelerating formulation design via machine learning: generating a high-throughput shampoo formulations dataset. Sci Data 11: (2024) https://doi.org/10.1038/s41597-024-03573-w
- Salager JL, Marquez R, Bullon J, Forgiarini A. Formulation in surfactant systems: from-winsor-to-hldn. Encyclopedia 2:778-839 (2022) https://doi.org/10.3390/encyclopedia2020054
- Weininger D. SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28:31-36 (2002) https://doi.org/10.1021/ci00057a005
- Wieder O, Kohlbacher S, Kuenemann M, Garon A, Ducrot P, Seidel T, Langer T. A compact review of molecular property prediction with graph neural networks. Drug Discovery Today: Technologies 37:1-12 (2020) https://doi.org/10.1016/j.ddtec.2020.11.009
- Meyers J, Fabian B, Brown N. De novo molecular design and generative models. Drug Discovery Today 26:2707-2715 (2021) https://doi.org/10.1016/j.drudis.2021.05.019
- Kwon Y, Kang S, Choi YS, Kim I. Evolutionary design of molecules based on deep learning and a genetic algorithm. Sci Rep 11: (2021) https://doi.org/10.1038/s41598-021-96812-8
- Gómez-Bombarelli R, Wei JN, Duvenaud D, Hernández-Lobato JM, Sánchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel TD, Adams RP, Aspuru-Guzik A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4:268-276 (2018) https://doi.org/10.1021/acscentsci.7b00572
- Segler MHS, Kogej T, Tyrchan C, Waller MP. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 4:120-131 (2017) https://doi.org/10.1021/acscentsci.7b00512
- Kim K, Kang S, Yoo J, Kwon Y, Nam Y, Lee D, Kim I, Choi YS, Jung Y, Kim S, Son WJ, Son J, Lee HS, Kim S, Shin J, Hwang S. Deep-learning-based inverse design model for intelligent discovery of organic molecules. npj Comput Mater 4: (2018) https://doi.org/10.1038/s41524-018-0128-1
- S. Chithrananda, G. Grand, B. Ramsundar, ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction, (2020) https://doi.org/10.48550/arXiv.2010.09885
- C. Xu, S. Zhu, V. Viswanathan, CLOUD: A Scalable and Physics-Informed Foundation Model for Crystal Representation Learning, (2025) https://doi.org/10.48550/arXiv.2506.17345
- Chen LY, Li YP. Uncertainty quantification with graph neural networks for efficient molecular design. Nat Commun 16: (2025) https://doi.org/10.1038/s41467-025-58503-0
- Xu C, Wang Y, Barati Farimani A. Transpolymer: a transformer-based language model for polymer property predictions. npj Comput Mater 9: (2023) https://doi.org/10.1038/s41524-023-01016-5
- Nnadili M, Okafor AN, Olayiwola T, Akinpelu D, Kumar R, Romagnoli JA. Surfactant-specific ai-driven molecular design: integrating generative models, predictive modeling, and reinforcement learning for tailored surfactant synthesis. Ind. Eng. Chem. Res. 63:6313-6324 (2024) https://doi.org/10.1021/acs.iecr.4c00401
- E. Soares, Z. Sumer, R.L. Anderson, REPRESENTING SURFACTANTS BY FOUNDATION MOD-, (2025).
- T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T.L. Scao, S. Gugger, M. Drame, Q. Lhoest, A.M. Rush, HuggingFace's Transformers: State-of-the-art Natural Language Processing, (2020) https://doi.org/10.48550/arXiv.1910.03771
- P.C. Mahalanobis, Reprint of: Mahalanobis, P.C. (1936) "On the Generalised Distance in Statistics., " Sankhya A 80 (1936) 1-7 https://doi.org/10.1007/s13171-019-00164-5
- P. Mukerjee, K. Mysels, Critical micelle concentrations of aqueous surfactant systems:, NIST (1971). https://www.nist.gov/publications/critical-micelle-concentrations-aqueous-surfactant-systems (accessed September 19, 2025).
- S. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B.A. Shoemaker, P.A. Thiessen, B. Yu, L. Zaslavsky, J. Zhang, E.E. Bolton, PubChem 2025 update, Nucleic Acids Res. 53 (2025) D1516-D1525 https://doi.org/10.1093/nar/gkae1059
- E. Soares, E.V. Brazil, V.Y. Shirasuna, D. Zubarev, R. Cerqueira, K. Schmidt, SMI-TED: A large-scale foundation model for materials and chemistry, (2024). https://openreview.net/forum?id=Yq8At31hLi&utm_source=chatgpt.com (accessed August 18, 2025).
- C. Edwards, T. Lai, K. Ros, G. Honke, K. Cho, H. Ji, Translation between Molecules and Natural Language, (2022) https://doi.org/10.48550/arXiv.2204.11817
- Rosen MJ, Kunjappu JT. Surfactants and interfacial phenomena. Wiley (2012) https://doi.org/10.1002/9781118228920
- D. Smilkov, N. Thorat, B. Kim, F. Viégas, M. Wattenberg, SmoothGrad: removing noise by adding noise, (2017) https://doi.org/10.48550/arXiv.1706.03825
(0.09 seconds)
[0.1 s]

