Research on Multi-Modal Retrieval System of E-Commerce 
Platform Based on Pre-Training Model

Bingbing Zhang; Yi Han; Xiaofei Han

doi:10.70711/aitr.v2i9.6879

Research on Multi-Modal Retrieval System of E-Commerce Platform Based on Pre-Training Model

Bingbing Zhang, Yi Han, Xiaofei Han

Abstract

In this paper, a multi-modal retrieval system for e-commerce platform is proposed, which integrates three advanced pre-training
models: BLIP, CLIP and CLIP Interrogator. The system solves the challenge of traditional keyword-based product search by realizing more
accurate and efficient graphic matching. We trained and evaluated our approach using 413, 000 image-text pairs from the Google conceptual
Captions dataset. Our method introduces a novel feature fusion mechanism and combines the advantages of several pre-trained models to
realize comprehensive visual semantic understanding. The system shows strong performance in daily business scenes and complex artistic
product description. Experimental results show that our proposed method can effectively generate detailed and context-aware descriptions and
accurately match user queries and product pictures. The adaptability and semantic understanding of the system make it of special value in improving the user experience of e-commerce applications. This research has contributed to the development of intelligent shopping platform by
bridging the gap between text query and visual content. It is worth emphasizing that the integration of the CLIP model significantly enhances
the e-commerce retrieval systems understanding of user intent and product semantics, thereby making product recommendations more accurate and the search process more targeted.

Keywords

Multi-modal Retrieval; E-commerce; CLIP; BLIP; Image-text Matching

Full Text:

PDF

Included Database

References

[1] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PmLR, 2021: 8748-8763.

[2] Li J, Li D, Xiong C, et al. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and

generation[C]//International conference on machine learning. PMLR, 2022: 12888-12900.

[3] Gu X, Wong Y, Shou L, et al. Multi-modal and multi-domain embedding learning for fashion retrieval and analysis[J]. IEEE Transactions on Multimedia, 2018, 21(6): 1524-1537.

[4] Jin Y, Li Y, Yuan Z, et al. Learning instance-level representation for large-scale multi-modal pretraining in e-commerce[C]//Proceedings

of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 11060-11069.

[5] Wang T, Li F, Zhu L, et al. Cross-modal retrieval: a systematic review of methods and future directions[J]. Proceedings of the IEEE, 2025.

[6] Yu T, Yang Y, Li Y, et al. Heterogeneous attention network for effective and efficient cross-modal retrieval[C]//Proceedings of the 44th

international ACM SIGIR conference on research and development in information retrieval. 2021: 1146-1156.

[7] Ji W, Liu X, Zhang A, et al. Online distillation-enhanced multi-modal transformer for sequential recommendation[C]//Proceedings of

the 31st ACM International Conference on Multimedia. 2023: 955-965.

DOI: http://dx.doi.org/10.70711/aitr.v2i9.6879

Refbacks

There are currently no refbacks.