pisco_log
banner

Analysis and Prospects of the Application of Large Language Models in the Field of Image Captioning

Sanxing Cui

Abstract


With the continuous development of computer neural networks and image processing technologies, the application of large language
models in the field of image captioning has gradually become a research hotspot. This article provides a brief overview of the current performance of major large language models in the field of image captioning. It combines the development of image captioning techniques and analyzes the current application status of large language models in image captioning tasks. Through a detailed comparison of relevant datasets,
evaluation metrics, and algorithm performance, the practical effects and potential of large language models in the field of image captioning are
thoroughly examined. The article also highlights some challenges that large language models face in the field of image captioning and identifies future research directions. It aims to provide reference and insights for the further development of this field.

Keywords


Computer Neural Networks; Image Processing and Computer Vision; Large Language Models; Image Captioning

Full Text:

PDF

Included Database


References


[1] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature 521.7553 (2015): 436-444.

[2] Voulodimos, Athanasios, et al. Deep learning for computer vision: A brief review. Computational intelligence and neuroscience 2018

(2018).

[3] Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., ... & Mian, A. (2023). A comprehensive overview of large language

models. arXiv preprint arXiv:2307.06435.

[4] Chowdhary, KR1442, and K. R. Chowdhary. Natural language processing. Fundamentals of artificial intelligence (2020): 603-649.

[5] Chowdhery A, Narang S, Devlin J, et al. Palm: Scaling language modeling with pathways[OL. arXiv preprint arXiv:2204.02311, 2022.

[6] Chung H W, Hou L, Longpre S, et al. Scaling instruction-finetuned language models[OL. arXiv preprint arXiv:2210.11416, 2022.

[7] Touvron H, Lavril T, Izacard G, et al. Llama: Open and efficient foundation language models[OL. arXiv preprint arXiv:2302.13971,

2023.

[8] Touvron H, Martin L, Stone K, et al. Llama 2: Open foundation and fine-tuned chat models[OL. arXiv preprint arXiv:2307.09288, 2023.

[9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ?ukasz Kaiser, and Illia Polosukhin.

2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS17).

Curran Associates Inc., Red Hook, NY, USA, 60006010.

[10] Stefanini, Matteo, et al. From show to tell: A survey on deep learning-based image captioning. IEEE transactions on pattern analysis

and machine intelligence 45.1 (2022): 539-559.

[11] Papineni K, Roukos S, Ward T and Zhu W J. 2002. BLEU: a method for automatic evaluation of machine translation//Proceedings of

the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, USA: ACL: 311-318 [DOI: 10.3115/1073083.

1073135]

[12] Vedantam R, Zitnick C L and Parikh D. 2015. CIDEr: consensus-based image captioning evaluation//Proceedings of 2015 IEEE

Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE:4566-4575[DOI:10.1109/CVPR.2015.7299087]




DOI: http://dx.doi.org/10.70711/aitr.v2i5.5270

Refbacks

  • There are currently no refbacks.