I. Introduction
To generate more precise captions, modern models for image captioning integrate attention mechanisms. The attention mechanism was initially employed in neuroscience research to comprehend how humans differentiate or recognize objects [1], [2]. The introduction of attention models was documented in the well-known paper "Show, Attend and Tell" [3]. In the study the authors have utilized an attention mechanism with Recurrent Neural Network (RNN). An advantage of using this mechanism was that it resolved the problem of Long Short-Term Memory (LSTM)/RNN not able to work for longer sentences properly. The attention mechanism enabled the maintenance of context over longer sentences.