Proceedings of ESCAPE 36ISSN: 2818-4734
Volume: 5 (2026)
Table of Contents
LAPSE:2026.0393
Published Article
LAPSE:2026.0393
A Multimodal Framework Integrating Procedural Texts and Visual Perception for Laboratory Safety Monitoring
Shuo Xu, Jinsong Zhao
June 12, 2026
Abstract
Laboratory safety is procedure-dependent: required personal protective equipment (PPE) and permissible actions vary across experiments and across experimental steps, yet most vision-based monitoring remains appearance-driven and often produces generic warnings without reliable procedural context. We propose a multimodal framework for step-aware safety monitoring in laboratory videos. The framework first localizes procedural context through clip-level step prediction and protocol alignment to identify the experiment and current step. Given this context, it retrieves step-specific safety constraints, extracts evidence of step-relevant equipment and interactions using an equipment database, and prompts a video-capable vision-language model (VLM) to generate structured (JSON) monitoring reports supported by retrieved constraints and visual evidence. Experiments on protocol-annotated molecular biology lab videos show that our approach improves the mean score from 0.4352 to 0.6430 and reduces the missing rate from 65.00% to 33.75% relative to a video-only baseline, demonstrating more faithful and step-specific safety judgments.
Keywords
Artificial Intelligence, Laboratory Safety Monitoring, Vision-Language Model
Suggested Citation
Xu S, Zhao J. A Multimodal Framework Integrating Procedural Texts and Visual Perception for Laboratory Safety Monitoring. Systems and Control Transactions 5:1503-1512 (2026) https://doi.org/10.69997/sct.104078
Author Affiliations
Xu S: Tsinghua University, Department of Chemical Engineering, Beijing, China. State Key Laboratory of Chemical Engineering and Low-Carbon Technology, Tsinghua University
Zhao J: Tsinghua University, Department of Chemical Engineering, Beijing, China. State Key Laboratory of Chemical Engineering and Low-Carbon Technology, Tsinghua University
[Login] to see author email addresses.
Journal Name
Systems and Control Transactions
Volume
5
First Page
1503
Last Page
1512
Year
2026
Publication Date
2026-06-12
Version Comments
Original Submission
Other Meta
PII: 1503-1512-12-SCT-5-2026, Publication Type: Journal Article
Record Map
Published Article

LAPSE:2026.0393
This Record
External Link

https://doi.org/10.69997/sct.104078
Publisher Version
Download
Files
Jun 12, 2026
Main Article
License
CC BY-SA 4.0
Meta
Record Statistics
Record Views
27
Version History
[v1] (Original Submission)
Jun 12, 2026
 
Verified by curator on
Jun 12, 2026
This Version Number
v1
Citations
Most Recent
This Version
URL Here
https://psecommunity.org/LAPSE:2026.0393
 
Record Owner
PSE Press
Links to Related Works
Directly Related to This Work
Publisher Version
References Cited
  1. Ménard AD, Trant JF. A review and critique of academic lab safety research. Nat. Chem. 12:17-25 (2019) https://doi.org/10.1038/s41557-019-0375-x
  2. Hill RH Jr. Recognizing and understanding hazards - the key first step to safety. J. Chem. Health Saf. 26:5-10 (2019) https://doi.org/10.1016/j.jchas.2018.11.005
  3. Vukicevic AM, Petrovic M, Milosevic P, Peulic A, Jovanovic K, Novakovic A. A systematic review of computer vision-based personal protective equipment compliance in industry practice: advancements, challenges and future directions. Artif Intell Rev 57: (2024) https://doi.org/10.1007/s10462-024-10978-x
  4. Luo H, Liu W, Xu P, Zhang L, Li L. Recognition algorithm for laboratory protective equipment based on improved yolov7. Heliyon 10:e36264 (2024) https://doi.org/10.1016/j.heliyon.2024.e36264
  5. Tang Y, Ding D, Rao Y, Zheng Y, Zhang D, Zhao L, Lu J, Zhou J. COIN: a large-scale dataset for comprehensive instructional video analysis. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) :1207-1216 (2019) https://doi.org/10.1109/cvpr.2019.00130
  6. Miech A, Zhukov D, Alayrac JB, Tapaswi M, Laptev I, Sivic J. Howto100m: learning a text-video embedding by watching hundred million narrated video clips. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) :2630-2640 (2019) https://doi.org/10.1109/iccv.2019.00272
  7. Zhukov D, Alayrac JB, Cinbis RG, Fouhey D, Laptev I, Sivic J. Cross-task weakly supervised learning from instructional videos. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) :3532-3540 (2019) https://doi.org/10.1109/cvpr.2019.00365
  8. Afouras T, Mavroudi E, Nagarajan T, Torresani L, Wang H. Ht-step: aligning instructional articles with how-to videos. Advances in Neural Information Processing Systems 36 :50310-50326 (2023) https://doi.org/10.52202/075280-2188
  9. Zhou H, Martín-Martín R, Kapadia M, Savarese S, Niebles JC. Procedure-aware pretraining for instructional video understanding. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) :10727-10738 (2023) https://doi.org/10.1109/cvpr52729.2023.01033
  10. Cui J, Gong Z, Huang S, Jia B, Ma J, Zheng Z, Zhu Y. Probio: a protocol-guided multimodal dataset for molecular biology lab. Advances in Neural Information Processing Systems 36 :41543-41571 (2023) https://doi.org/10.52202/075280-1799
  11. Zhang J, Huang J, Jin S, Lu S. Vision-language models for vision tasks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 46:5625-5644 (2024) https://doi.org/10.1109/tpami.2024.3369699
  12. Bai S, Cai Y, Chen R, Chen K, Chen X, Cheng Z, et al. Qwen3-VL technical report. arXiv preprint arXiv:2511.21631 (2025)
  13. Huang L, Yu W, Ma W, Zhong W, Feng Z, Wang H, Chen Q, Peng W, Feng X, Qin B, Liu T. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 43:1-55 (2025) https://doi.org/10.1145/3703155
  14. Petroni F, Piktus A, Fan A, Lewis P, Yazdani M, De Cao N, Thorne J, Jernite Y, Karpukhin V, Maillard J, Plachouras V, Rocktäschel T, Riedel S. KILT: a benchmark for knowledge intensive language tasks. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies :2523-2544 (2021) https://doi.org/10.18653/v1/2021.naacl-main.200
  15. Abootorabi MM, Zobeiri A, Dehghani M, Mohammadkhani MA, Mohammadi B, Ghahroodi O, Soleymani Baghshah M, Asgari E. Ask in any modality: A comprehensive survey on multimodal retrieval-augmented generation. arXiv preprint arXiv:2502.08826 (2025)
  16. Guo KX, Wong PKY, Cheng JCP, Chan CF, Leung PH, Tao X. Enhancing visual-llm for construction site safety compliance via prompt engineering and bi-stage retrieval-augmented generation. Automation in Construction 179:106490 (2025) https://doi.org/10.1016/j.autcon.2025.106490
  17. Li Y, Wu CY, Fan H, Mangalam K, Xiong B, Malik J, Feichtenhofer C. Mvitv2: improved multiscale vision transformers for classification and detection. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) :4794-4804 (2022) https://doi.org/10.1109/cvpr52688.2022.00476
(0.08 seconds)

[0.09 s]