Download PDFOpen PDF in browser

Image Captioning: A Survey on its methods and Implementation.

19 pagesPublished: August 6, 2024

Abstract

This literature review aims to navigate the vast landscape of image captioning, an interdisciplinary field that lies combining natural language processing and computer vision. We start with a detailed examination of the CNN-to-Bi-CARU model, an attention-based bidirectional architecture for comprehensive contextual information extraction. The application of this model in image captioning therefore necessitates detecting image features and objects, and identifying them precisely. Attention mechanisms are important for securing precise matching regarding changes in focused content during caption generation. The efficiency concerns have been highlighted by the CNN-to-Bi-CARU model that has taken less time in coming up with images during inference. Stability is acknowledged even as improvements are proposed for a perfect BDR-GRU system. The experimental phase investigates different loss functions and optimizers leading to selecting cross-entropy as a loss function and Adam optimizer to achieve BLEU-4 metrics and better accuracy. The introduction of a new framework allows for the estimation of significant regions in images. The approach relies on image captioning, which incorporates semantic information while estimating important regions on basis of subject and object words contained in those captions. Experimental results confirm that the technique can estimate important regions with sensitivity rivaling human perception. In regard to remote sensing image captioning, this exploration ends up with an encoder-decoder model. Instead of traditional token generation, the model supports continuous output representations, using a proposed loss function to optimize semantic similarity at sequence level. This novel way may have a great impact on language generation in the context of remote sensing imagery. Viewing the diverse methods that were explored, problems that have been identified and inventions that have been realized, this paper provides an overview of the landscape and a call for further research. The importance of stability and loss functions in this emerging area emphasizes it’s dynamic nature, which portends improved image captioning. In conclusion, the present proposal presents an overview on what the field is currently experiencing thus serving as a basis for more improvement and exploration in image captioning which is considered fascinating.

Keyphrases: artificial intelligence, assistive technology, bilingual evaluation understudy, blind individuals, convolutional neural network, image captioning, interpretation, keywords, long short term memory, machine learning, neural networks, quality of life, real time visual, recurrent neural network, visual impairment

In: Rajakumar G (editor). Proceedings of 6th International Conference on Smart Systems and Inventive Technology, vol 19, pages 303-321.

BibTeX entry
@inproceedings{ICSSIT2024:Image_Captioning_Survey_its,
  author    = {Thirrunavukkarasu Ramasamy Radhakrishnan and Arun Thangaraju and Aravind Jayakumar and Dharanash Sundaramoorthi and Kishore Gurusamy},
  title     = {Image Captioning: A Survey on its methods and Implementation.},
  booktitle = {Proceedings of 6th International Conference on Smart Systems and Inventive Technology},
  editor    = {Rajakumar G},
  series    = {Kalpa Publications in Computing},
  volume    = {19},
  publisher = {EasyChair},
  bibsource = {EasyChair, https://easychair.org},
  issn      = {2515-1762},
  url       = {/publications/paper/7dmH4},
  doi       = {10.29007/43tn},
  pages     = {303-321},
  year      = {2024}}
Download PDFOpen PDF in browser