Conferences >2023 16th International Confe...

Evaluate The Image Captioning Technique Using State-of-the-art, Attention And Non-Attention Models To Generate Human Like Captions

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

By utilizing computer vision along with Natural Language Processing (NLP), Image Captioning generates descriptive text for an image with the goal of emulating human descr...Show More

Metadata

Abstract:

By utilizing computer vision along with Natural Language Processing (NLP), Image Captioning generates descriptive text for an image with the goal of emulating human description. This technology traditionally employed deep learning models, specifically an encoder-decoder architecture. The encoder evaluates an image, while the decoder generates the caption. This study highlights the comparison between Transformer-based attention and GAN networks. In this research the benchmark dataset MS COCO is employed for training and evaluation. Since the dataset is large and the library “pycoctools” is updated to allow only a certain number of images for the train, evaluation, and test images of the COCO. The two models that were used for this comparison were based on BERT and the other WGAN. The study uses ResNet to get image features and word embedding to extract caption features. The models were trained on two dataset size, one on 1k images and 5k captions and another on 3k images and 15k captions, to view the impact on training. Both models were compared on parameters such as accuracy, losses, and bleu score. Finally, the experiments conducted shows that the on the basic of loss, accuracy and bleu score the BERT-based attention models performs much better as compared to the WGAN based model. The BERT-based models reached up to an accuracy of 96.9% and the better bleu score. The WGAN was only able to learn very little after the train was completed. The resultant caption generated were meaningful and comparative to the ground truth in case of BERT based model. While the WGAN generated caption made little sense.

Published in: 2023 16th International Conference on Developments in eSystems Engineering (DeSE)

Date of Conference: 18-20 December 2023

Date Added to IEEE Xplore: 21 March 2024

ISBN Information:

DOI: 10.1109/DeSE60595.2023.10469642

Conference Location: Istanbul, Turkiye

Contents

I. Introduction

To generate more precise captions, modern models for image captioning integrate attention mechanisms. The attention mechanism was initially employed in neuroscience research to comprehend how humans differentiate or recognize objects [1], [2]. The introduction of attention models was documented in the well-known paper "Show, Attend and Tell" [3]. In the study the authors have utilized an attention mechanism with Recurrent Neural Network (RNN). An advantage of using this mechanism was that it resolved the problem of Long Short-Term Memory (LSTM)/RNN not able to work for longer sentences properly. The attention mechanism enabled the maintenance of context over longer sentences.

References is not available for this document.

MIT Libraries

MIT Libraries

Evaluate The Image Captioning Technique Using State-of-the-art, Attention And Non-Attention Models To Generate Human Like Captions

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Evaluate The Image Captioning Technique Using State-of-the-art, Attention And Non-Attention Models To Generate Human Like Captions

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

References