Research on Multi-Modal Retrieval System of E-Commerce Platform Based on Pre-Training Model
Abstract
models: BLIP, CLIP and CLIP Interrogator. The system solves the challenge of traditional keyword-based product search by realizing more
accurate and efficient graphic matching. We trained and evaluated our approach using 413, 000 image-text pairs from the Google conceptual
Captions dataset. Our method introduces a novel feature fusion mechanism and combines the advantages of several pre-trained models to
realize comprehensive visual semantic understanding. The system shows strong performance in daily business scenes and complex artistic
product description. Experimental results show that our proposed method can effectively generate detailed and context-aware descriptions and
accurately match user queries and product pictures. The adaptability and semantic understanding of the system make it of special value in improving the user experience of e-commerce applications. This research has contributed to the development of intelligent shopping platform by
bridging the gap between text query and visual content. It is worth emphasizing that the integration of the CLIP model significantly enhances
the e-commerce retrieval systems understanding of user intent and product semantics, thereby making product recommendations more accurate and the search process more targeted.
Keywords
Full Text:
PDFReferences
[1] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PmLR, 2021: 8748-8763.
[2] Li J, Li D, Xiong C, et al. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and
generation[C]//International conference on machine learning. PMLR, 2022: 12888-12900.
[3] Gu X, Wong Y, Shou L, et al. Multi-modal and multi-domain embedding learning for fashion retrieval and analysis[J]. IEEE Transactions on Multimedia, 2018, 21(6): 1524-1537.
[4] Jin Y, Li Y, Yuan Z, et al. Learning instance-level representation for large-scale multi-modal pretraining in e-commerce[C]//Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 11060-11069.
[5] Wang T, Li F, Zhu L, et al. Cross-modal retrieval: a systematic review of methods and future directions[J]. Proceedings of the IEEE, 2025.
[6] Yu T, Yang Y, Li Y, et al. Heterogeneous attention network for effective and efficient cross-modal retrieval[C]//Proceedings of the 44th
international ACM SIGIR conference on research and development in information retrieval. 2021: 1146-1156.
[7] Ji W, Liu X, Zhang A, et al. Online distillation-enhanced multi-modal transformer for sequential recommendation[C]//Proceedings of
the 31st ACM International Conference on Multimedia. 2023: 955-965.
DOI: http://dx.doi.org/10.70711/aitr.v2i9.6879
Refbacks
- There are currently no refbacks.