Optimization Pathways for Transformer-Based Multimodal Pre-training Models in Open-Vocabulary Tasks
Abstract
closed-set assumption constraints, serving as an important research topic in the multimodality sector. The Transformer architecture possesses
excellent feature modeling capabilities and serves as the main support for multimodal pre-training models. However, it still faces open-vocabulary issues such as insufficient modal alignment and deficient generalization capabilities. This study summarizes the main optimization directions of Transformer-based multimodal pre-training models in open-vocabulary tasks, systematically analyzes the fundamental frameworks
and application logic of various optimization pathways from three aspects: pre-training strategies, modal fusion methodologies, and finetuning methodologies, and generalizes current research dilemmas and development trends, providing reference for relevant research.
Keywords
Full Text:
PDFReferences
[1] Ying Zheng, Xingzi He, Feng Zhao, et al. (2026) Internet of Things-Assisted Long-term Efficacy Prediction Method for Multimodal
Classroom Behavior [J]. Internet of Things Technologies, 16(08), 160-162.
[2] Fengyang Liu, Yujin Zhang, Fei Wu. (2026) Multimodal Media Content Tampering Detection and Localization Based on Multi-Perspective Vision-Language Interaction [J]. Journal of Image and Graphics, 31(04), 1090-1107.
[3] Shurui Zhang, Jingyu Wang. (2026) Dynamic Allocation of Cross-Layer Resources in Communication Based on Transformer Federated
Learning [J]. Modern Electronics Technique, 49(08), 145-148 + 155.
[4] Xiaofeng Mi, Xuyang Wang, Haojun Shi. (2026) Multimodal Sentiment Analysis Integrating Multi-subspace and Channel Attention [J].
Journal of Central China Normal University: Natural Sciences, 60(02), 284-295.
DOI: http://dx.doi.org/10.70711/aitr.v3i12.9461
Refbacks
- There are currently no refbacks.