Optimization Pathways for Transformer-Based Multimodal Pre-training Models in Open-Vocabulary Tasks

Haimin Zhai

doi:10.70711/aitr.v3i12.9461

Optimization Pathways for Transformer-Based Multimodal Pre-training Models in Open-Vocabulary Tasks

Haimin Zhai

Abstract

The open-vocabulary tasks require models to identify the categories that do not exist in the training set, transcending the traditional
closed-set assumption constraints, serving as an important research topic in the multimodality sector. The Transformer architecture possesses
excellent feature modeling capabilities and serves as the main support for multimodal pre-training models. However, it still faces open-vocabulary issues such as insufficient modal alignment and deficient generalization capabilities. This study summarizes the main optimization directions of Transformer-based multimodal pre-training models in open-vocabulary tasks, systematically analyzes the fundamental frameworks
and application logic of various optimization pathways from three aspects: pre-training strategies, modal fusion methodologies, and finetuning methodologies, and generalizes current research dilemmas and development trends, providing reference for relevant research.

Keywords

Transformer; Multimodal Pre-training; Open-Vocabulary Task; Modal Alignment; Model Optimization

Full Text:

PDF

Included Database

References

[1] Ying Zheng, Xingzi He, Feng Zhao, et al. (2026) Internet of Things-Assisted Long-term Efficacy Prediction Method for Multimodal

Classroom Behavior [J]. Internet of Things Technologies, 16(08), 160-162.

[2] Fengyang Liu, Yujin Zhang, Fei Wu. (2026) Multimodal Media Content Tampering Detection and Localization Based on Multi-Perspective Vision-Language Interaction [J]. Journal of Image and Graphics, 31(04), 1090-1107.

[3] Shurui Zhang, Jingyu Wang. (2026) Dynamic Allocation of Cross-Layer Resources in Communication Based on Transformer Federated

Learning [J]. Modern Electronics Technique, 49(08), 145-148 + 155.

[4] Xiaofeng Mi, Xuyang Wang, Haojun Shi. (2026) Multimodal Sentiment Analysis Integrating Multi-subspace and Channel Attention [J].

Journal of Central China Normal University: Natural Sciences, 60(02), 284-295.

DOI: http://dx.doi.org/10.70711/aitr.v3i12.9461

Refbacks

There are currently no refbacks.

Optimization Pathways for Transformer-Based Multimodal Pre-training Models in Open-Vocabulary Tasks

Abstract

Keywords

Full Text:

Included Database

References

Refbacks

Scineer

Valueble Links

Username
Password
Remember me