LAPSE:2026.0510
Published Article

LAPSE:2026.0510
Enhancing Control in Chemical Processes using Reinforcement from Human Feedback
June 12, 2026
Abstract
Reinforcement learning (RL) presents a promising alternative to model-based advanced control schemes, such as model predictive control (MPC), whose application can be limited by highly complex system models. However, incorporating constraints in RL remains challenging and formulating a suitable optimization objective is not straightforward. Reinforcement learning from human feedback (RLHF) offers an approach to derive the RL reward function from human expert preferences, enabling the incorporation of process knowledge. In this work, we present the application of RLHF to fine-tune an approximate MPC controller with suboptimal performance. We demonstrate that combining conventional reward formulations with RLHF, along with varying trajectory segment lengths for collecting human feedback, improves the control methodology for a batch bioreactor by enhancing safety and accounting for long-term effects. Furthermore, direct-preference based policy optimization (DPPO) represents a promising alternative for directly fine-tuning learning-based controllers while circumventing explicit reward model design.
Reinforcement learning (RL) presents a promising alternative to model-based advanced control schemes, such as model predictive control (MPC), whose application can be limited by highly complex system models. However, incorporating constraints in RL remains challenging and formulating a suitable optimization objective is not straightforward. Reinforcement learning from human feedback (RLHF) offers an approach to derive the RL reward function from human expert preferences, enabling the incorporation of process knowledge. In this work, we present the application of RLHF to fine-tune an approximate MPC controller with suboptimal performance. We demonstrate that combining conventional reward formulations with RLHF, along with varying trajectory segment lengths for collecting human feedback, improves the control methodology for a batch bioreactor by enhancing safety and accounting for long-term effects. Furthermore, direct-preference based policy optimization (DPPO) represents a promising alternative for directly fine-tuning learning-based controllers while circumventing explicit reward model design.
Record ID
Keywords
human feedback, Model predictive control, reinforcement learning
Subject
Suggested Citation
H G, D B, S L. Enhancing Control in Chemical Processes using Reinforcement from Human Feedback. Systems and Control Transactions 5:2457-2465 (2026) https://doi.org/10.69997/sct.156501
Author Affiliations
H G: Technische Universität Dortmund, Laboratory of Process Automation Systems, Emil-Figge-Straße 70, Dortmund 44227, Germany [ORCID]
D B: Technische Universität Dortmund, Laboratory of Process Automation Systems, Emil-Figge-Straße 70, Dortmund 44227, Germany [ORCID]
S L: Technische Universität Dortmund, Laboratory of Process Automation Systems, Emil-Figge-Straße 70, Dortmund 44227, Germany [ORCID]
[Login] to see author email addresses.
D B: Technische Universität Dortmund, Laboratory of Process Automation Systems, Emil-Figge-Straße 70, Dortmund 44227, Germany [ORCID]
S L: Technische Universität Dortmund, Laboratory of Process Automation Systems, Emil-Figge-Straße 70, Dortmund 44227, Germany [ORCID]
[Login] to see author email addresses.
Journal Name
Systems and Control Transactions
Volume
5
First Page
2457
Last Page
2465
Year
2026
Publication Date
2026-06-12
Version Comments
Original Submission
Other Meta
PII: 2457-2465-231-SCT-5-2026, Publication Type: Journal Article
Record Map
Published Article

LAPSE:2026.0510
This Record
External Link

https://doi.org/10.69997/sct.156501
Publisher Version
Download
Meta
Record Statistics
Record Views
0
Version History
[v1] (Original Submission)
Jun 12, 2026
Verified by curator on
Jun 12, 2026
This Version Number
v1
Citations
Most Recent
This Version
URL Here
https://psecommunity.org/LAPSE:2026.0510
Record Owner
PSE Press
Links to Related Works
References Cited
- Sutton RS, Barto AG. Reinforcement Learning: An Introduction, Second edition. The MIT Press (2018)
- Christiano PF, et al. Deep reinforcement learning from human preferences. Adv Neural Inf Process Syst 30 (2017)
- Agarwal S, Almeida D, Askell A, Christiano P, Hilton J, Jiang X, Kelton F, Leike J, Lowe R, Miller L, Mishkin P, Ouyang L, Ray A, Schulman J, Simens M, Slama K, Wainwright C, Welinder P, Wu J, Zhang C. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 :27730-27744 (2022) https://doi.org/10.52202/068431-2011
- Palan M, Shevchuk G, Charles Landolfi N, Sadigh D. Learning reward functions by integrating human demonstrations and preferences. Robotics: Science and Systems XV : (2019) https://doi.org/10.15607/rss.2019.xv.023
- Hejna J, Sadigh D. Inverse preference learning: preference-based RL without a reward function. Advances in Neural Information Processing Systems 36 :18806-18827 (2023) https://doi.org/10.52202/075280-0825
- Kaufmann T, et al. A survey of reinforcement learning from human feedback. arXiv 2312.14925 (2023) https://doi.org/10.48550/arXiv.2312.14925
- Casper S, et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv 2307.15217 (2023) https://doi.org/10.48550/arXiv.2307.15217
- Lambert N, Calandra R. The alignment ceiling: Objective mismatch in reinforcement learning from human feedback. arXiv 2311.00168 (2023) https://doi.org/10.48550/arXiv.2311.00168
- An G, Kim KM, Kosaka N, Lee J, Song HO, Zuo X. Direct preference-based policy optimization without reward modeling. Advances in Neural Information Processing Systems 36 :70247-70266 (2023) https://doi.org/10.52202/075280-3078
- Ermon S, Finn C, Manning CD, Mitchell E, Rafailov R, Sharma A. Direct preference optimization: your language model is secretly a reward model. Advances in Neural Information Processing Systems 36 :53728-53741 (2023) https://doi.org/10.52202/075280-2338
- Nika A, et al. Reward model learning vs. direct policy optimization: A comparative analysis of learning from human preferences. arXiv 2403.01857 (2024) https://doi.org/10.48550/arXiv.2403.01857
- Xu S, et al. Is DPO superior to PPO for LLM alignment? A comprehensive study. arXiv 2404.10719 (2024) https://doi.org/10.48550/arXiv.2404.10719
- Srinivasan B, et al. Dynamic optimization of batch processes: II. Role of measurements in handling uncertainty. Comput Chem Eng 27:27-44 (2003) https://doi.org/10.1016/S0098-1354(02)00117-5
- Karg B, Lucia S. Efficient representation and approximation of model predictive control laws via deep learning. IEEE Trans. Cybern. 50:3866-3878 (2020) https://doi.org/10.1109/tcyb.2020.2999556
- Chittepu Y, Finn C, Hejna J, Knox W, Niekum S, Park R, Rafailov R, Sikchi H. Scaling laws for reward model overoptimization in direct alignment algorithms. Advances in Neural Information Processing Systems 37 :126207-126242 (2024) https://doi.org/10.52202/079017-4009
- Zhang S, et al. Improving reinforcement learning from human feedback with efficient reward model ensemble. arXiv 2401.16635 (2024) https://doi.org/10.48550/arXiv.2401.16635
- Coste T, et al. Reward model ensembles help mitigate overoptimization. arXiv 2310.02743 (2024) https://doi.org/10.48550/arXiv.2310.02743
- Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D. Deep reinforcement learning that matters. AAAI 32: (2018) https://doi.org/10.1609/aaai.v32i1.11694
- Lillicrap TP, et al. Continuous control with deep reinforcement learning. arXiv 1509.02971 (2019) https://doi.org/10.48550/arXiv.1509.02971
- Raffin A, et al. Stable-Baselines3: Reliable reinforcement learning implementations. J Mach Learn Res 22:1-8 (2021)
(0.15 seconds)
[0.16 s]

