Introduction
Recently, a massive number of images have been stored digitally and transferred on the Internet as a significant source of information [1], [2]. CV techniques enable computers to define the visual world and, therefore, could bring various promising applications, namely information retrieval, the interaction of human computers, assistance for visually impaired people, and child education [3], [4]. Image captioning is an extensive process in (NLP) and (CV) that could finalize multi-modal transformation from images to texts [5]. For example, an important and challenging domain of artificial intelligence (AI), automatically generating image portrayals is gaining considerable interest [6]. The objective of image captioning can be for producing linguistically plausible sentences, which are semantically correct for the image contents [7], [8]. Thus, the description of an image can have 2 main features: language processing and visual understanding. To guarantee that the created sentence has been grammatically and semantically correct, NLP and CV technologies could be utilized to properly incorporate them and deal with the issue created by the resultant modality [9].
Considerating an image depends essentially on attaining image features [10], [11]. This system utilized for these purposes is widely classified into (1) Deep machine learning (DML) assisted technique and (2) Traditional machine learning (ML) based technique [12]. In the DML technique, features are automatically learned from the training dataset, and they can handle a diverse and large set of videos and images [13], [14]. For instance, (CNN) is extensively employed for learning features, and classifiers like Softmax are utilized for classification [15]. Generally, CNN is followed by Recurrent Neural Network (RNN) to create captions. At the same time, in the traditional ML technique, the handcrafted feature was broadly applied [16]. In this technique, features have been removed from an input dataset. Then, it passed to classifiers, namely Support Vector Machines (SVM), to categorize the object. Since the handcrafted feature is a particular task, removing features from a diverse and large collection of information is not possible [17]. Furthermore, real-time information like video and images have different semantic interpretations and are complex [18].
Although image captioning research has made significant progress, several challenges should be solved. It is significant for the annotation of image captioning models that may struggle with ambiguous images. In addition, data processing is a key procedure for image captioning. More specifically, selecting the optimal hyperparameter and handling imbalanced datasets are two main issues that affect the training process. So far, real-time and multi-modal processing are two main limitations for most of the existing image captioning models. Recent approaches to image captioning systems did not focus on the hyperparameter selection method to affect the effectiveness of the classification model. Mainly, hyperparameters like batch size, epoch count, and learning rate selection could be required to gain improved performance. As the trial and error technique for hyperparameter tuning was a tiresome and erroneous procedure, meta-heuristic algorithms can be implemented. Consequently, in this work, the LSA is used for the parameter selection of the Hybrid Convolutional Neural Network (HCNN) model.
This article develops an LSA with an HCNN-based image Captioning System called (LSAHCNN-ICS) for NLP. The presented LSAHCNN-ICS method develops an end-to-end model which employs CNN-based ShuffleNet as an encoder and HCNN as a decoder. At the encoding part, the ShuffleNet model derives feature descriptors of the image. Besides, in the decoding part, the description of text can be generated using the HCNN model. To achieve improved captioning results, the LSA is applied as a hyperparameter tuning strategy. The investigational validation of this presented LSAHCNN-ICS system is implemented on a benchmark dataset.
The rest of this study is systematized as follows. Section II gives a literature review of image captioning techniques. Then, section III presents the proposed LSAHCNN-ICS method and section IV delivers the experimental validation. Lastly, section V accomplishes the work.
Related Works
Wang and Huang [19] have presented a local representation-improved recurrent convolution network (Lore-RCN). The authors have developed a visual convolution network for obtaining improved local linguistic context that integrates selective local visual data and methods of short-term neighbouring. In addition, they have designed a linguistic convolution network for obtaining improved linguistic representations that techniques long- and short-term connections explicitly for leveraging administrative data in preceding linguistic tokens. He and Lu [20] have suggested an end-to-end method that relies on RNN as a decoder deep and CNN as an encoder. For obtaining superior image captioning extracting, the authors have presented an extremely modularized multi-branch CNN that can improve accuracy while retaining the count of hyperparameters unaltered.
Al-Malla et al. [21] projected an attention-related, Encoding-Decoding deep framework that generates convolution feature extracting in CNNs technique pretraining on ImageNet (Xception), along with object extracting feature in YOLOv4 method, pre-training on MS COCO. Prudviraj et al. [22] have introduced a new multiscale FF network (M-FFN) to ICS tasks for incorporating distinct features and image contextual data of images. Specifically, the author gets benefits of MSFPN for incorporating global contextual data through atrous convolutional at top layers of CNNs. Faiyaz Khan et al. [23] elaborated an end-wise image captioning method employing a multi-modal infrastructure integrating a 1D-CNN for encoding sequence data with pre-training ResNet50 method image encoding to extract region-based visual features.
An effectual structure to caption the remote sensing image (RSI) was presented in [24]. This structure is dependent upon multi-level attention and multilabel feature graph convolutional. More precisely, the presented multi-level attention component is adaptably concentrated on particular spatial features, among them on features of certain scales. In addition, the attribute graph convolution component (GCN) has utilized the attribute graph for learning highly efficient attribute features to the image caption. Dong et al. [25] have investigated a Dual Graph Convolution Network (Dual-GCN) with curriculum and transformer learning to image caption. It is worth mentioning that the authors of [25] did not only utilize an object-level GCN for capturing the object-to-object spatial relationship in a single image.
With the well-planned Dual-GCN, the authors of [25] create the linguistic transformer superior to understand the connection betwixt distinct objects from the single image and generate complete utilization of the same images as auxiliary data for generating a reasonable caption explanation to a single image.
Wang and Gu [26] present a novel Joint Relationship Attention Network (JRAN), which newly discovers the connections among the feature from the image. In theory, the JRAN exploits semantic features as supplementary to region features, completely learning 2 kinds of connections, the visual connections among region features and the visual–semantic connections among the region and semantic features. Wang and Gu [27] examine the Double-Level Relationship Networks (DLRN) that newly act as the complementary local and global features from the image and improves the connection among features. The former learn distinct hierarchies of visual relationships by applying graph attention for local-level relationship improvement and pixel-level relationship improvement correspondingly.
In [28], the authors analyse the local visual modelling with grid features for image captioning that can be vital to generating correct and detailed captions. To accomplish this objective, the author presents a Locality-Sensitive Transformer Network (LSTNet) with two novel designs Locality-Sensitive Attention and Locality-Sensitive Fusion (LSF). In [29], a Local Relation Network (LRN) was planned over the objects and image regions that not only determines the connection among the object and image regions among them creates major context-based features equivalent to all the regions from the image. Lastly, a different typical LSTM utilizes an attention process that concentrates on related contextual data, spatial places, and deep visual features.
Different from the above recent methods, we propose to enhance the process of image captioning by incorporating CNN and LSA. The former is important for feature extraction, while the latter is useful for optimizing parameter tuning. We indicate that some of the related works have been applied for Flickr8k, Flickr30k and MSCOCO datasets, while others have been tested for a limited number of datasets. Besides, the related works on image captioning did not consider the number of channels.
The Proposed Model
In this study, an innovative LSAHCNN-ICS algorithm can be developed for captioning images in the NLP. This introduced LSAHCNN-ICS technique depends upon an end-to-end model comprising two major parts: CNN-based ShuffleNet as an encoder and HCNN as a decoder. Automated ICS employ an encoder and decoder framework to extract features from an image using the encoder, whereas the role of the decoder consists of generating a transcript. In this case, the ShuffleNet model has been exploited to remove features from the image, and the HCNN model acts as a decoder which produces the transcript. Fig. 1 represents the block diagram of the LSAHCNN-ICS model.
A. Encoder Unit: Shufflenet
In the encoding part, the ShuffleNet model derives feature descriptors of the image. The proposed model uses ShuffleNet as the backbone CNN to remove visual features at the input image. ShuffleNet’s compact architecture and efficient computation make it suitable for extracting image features. The encoding unit captures the fine-grained information, encodes it, and generates fixed-size vectors. ShuffleNet has related concepts to ResNet, MobileNet, and Xception. Depthwise separable convolution and channel shuffle are used to enhance the ResNet architecture, which ensures network performance and enhances operational efficacy [30]. Different from the residual structure that straightaway incorporates the deep and non-deep features accomplished by numerous convolutions, the inverted residual model splits the input feature maps into two divisions, X1 and X2, they are merged with deep and non-deep features, and lastly, it uses channel shuffle to fuse deep and non-deep features. Fig. 2 illustrates the framework of the ShuffleNet technique. Assume that the input layer is separated into G groups, and the overall no. of channels is G \begin{equation*} Z_{c}=F_{sq}\left ({U_{c} }\right)=\frac {1}{H\times W}\sum \nolimits _{i=1}^{H} \sum \nolimits _{j=1}^{W} U_{c} \left ({I,j }\right). \tag{1}\end{equation*}
Then, add linear mapping and the activation function to the feature vectors for handling non-linear conditions that could best adapt the complex correlations amongst channels. Lastly, the evaluated channel feature was multiplied with the deep feature maps to get the output. The SE method weakens the insignificant feature and strengthens the significant feature by controlling the size of the channel to make the extracted feature more directional. Channel attention is allowable to be inserted among all the feature maps.
B. Decoder Unit: LSA With HCNN Model
In this study, in the decoding part, the description of text can be generated using the HCNN model. The outcomes of the image feature and word sequence encoders can be integrated by combination and fed as input into the HCNN model. The HCNN model produces a softmax forecast in all vocabulary words to be the succeeding word in a sequence, and the word at the maximum possibility can be chosen. These procedures are continued till the ending token is produced. The HCNN methodology generates text description, which comprises a sequence connection of CNN and LSTM [31]. The presented method could extract complicated features amongst many sensor parameters gathered for forecasting power demands and save complex irregular trends. Firstly, the upper layer of HCNN comprises CNN. The CNN could obtain different parameters that affect power utilization, namely, sub-metering, voltage, and intensity. Furthermore, household features like time, date, household occupancy, and behaviour of the residents are modelled as Metadata in the CNN layer. CNN comprises of input layer which accepts sensor variables as input, an output unit that extracts features to LSTM, and multiple hidden layers. The convolution layer employs the convolutional process to the incoming multi-variate time sequence and passes the outcomes to the following layer. Every convolutional neuron processes power utilization information for the receptive field. The convolution process could decrease the parameter count and make the HCNN network deeper. Where \begin{align*} y_{ii}^{1}&=s\left ({b_{i}^{1}+\sum \limits _{m=1}^{M} w_{m,j}^{1} x_{i+m-1}^{0},j }\right) \tag{2}\\ y_{ij}^{1}&=s\left ({b_{i}^{l}+\sum \limits _{\mathrm {m=1}}^{M} w_{\mathrm {m,}j}^{l} x_{i\mathrm {+m-1}}^{0},j }\right) \tag{3}\end{align*}
The pooling layer decreases the space size of the demonstration to decrease the network computation and cost number of parameters. The convolutional layer employs a pooling layer that integrates the output of neuron clusters in a single layer into one neuron in the following layer. Eq. (4) characterizes the max-pooling layer operation. \begin{equation*} p_{ij}^{l}=\max _{r\in R}y_{i\mathrm {\times }T+r,j}^{l-1} \tag{4}\end{equation*}
LSTM is a lower layer of HCNN, which store time dataset regarding significant features. The output value in the preceding CNN layer could be accepted by the gate unit. The latter encompasses forget, input and output gates. The memory cell makes up the LSTM upgrade the state with activation of all the gating units that are controlled to constant values among zero and one as follows [31]:\begin{align*} i_{t}&=\sigma \left ({W_{pi}p_{t}\mathrm {+}\textit {VV}_{hi}h_{t-1}+W_{ci}\mathrm {\circ }c_{t-1}+b_{i} }\right) \tag{5}\\ f_{t}&=\sigma \left ({W_{pf}p_{t}+VV_{hf}h_{t-1}+W_{cf}\mathrm {\circ }c_{f-1}+b_{f} }\right) \tag{6}\\ \mathrm {O}_{t}&=\sigma \left ({W_{po}p_{t}+W_{ho}h_{t-1}+VV_{co}\mathrm {\circ }c_{f}+b_{o} }\right) \tag{7}\end{align*}
The input, forget, and output gates constitute the LSTM, as well as output of all the gates is characterized as \begin{align*} c_{f}&=f_{f}\mathrm {\circ }c_{f-1}+i_{r}\circ \sigma \left ({W_{pc}p_{t}+VV_{hc}h_{t-1}+b_{c} }\right) \tag{8}\\ h_{r}&={0}_{t}\circ \sigma \left ({c_{t} }\right) \tag{9}\end{align*}
The final layer of HCNN is composed of FC layers. They are utilized for generating the text and
To achieve improved captioning results, the LSA is applied as a hyperparameter tuning strategy. The LSA is a novel metaheuristic method that depends on the lightning occurrence in nature [32]. The main purpose has been a generalization of the hypotheses through the model of step leader propagation. LSA executes a search procedure through faster particles called projectile that moves in the searching space. The projectile technique is the same as other positions utilized in evolutionary mechanisms, comprising “chromosome”, “particle”, or “individual”. The LSA combines three kinds of projectiles that are determined as follows:
Transition projectile: projectiles form an early population of step leaders. The projectile was generated using a random value derived at a uniform likelihood distribution, and it can be expressed in the following [32]:
\begin{align*} f\left ({x }\right)=\begin{cases} \displaystyle \frac {1}{b-a}& a\le x\le b\\ \displaystyle 0& x < a or x>b\end{cases} \tag{10}\end{align*} View Source\begin{align*} f\left ({x }\right)=\begin{cases} \displaystyle \frac {1}{b-a}& a\le x\le b\\ \displaystyle 0& x < a or x>b\end{cases} \tag{10}\end{align*}
Space projectile: the projectiles are upgraded and evolved such that one of them becomes a leader. The upgrading process is formulated as follows:
\begin{equation*} p_{j}^{s}=p_{i}^{s}\pm exprnd (D) \tag{11}\end{equation*} View Source\begin{equation*} p_{j}^{s}=p_{i}^{s}\pm exprnd (D) \tag{11}\end{equation*}
In the expression, \begin{align*} f\left ({x }\right)=\begin{cases} \displaystyle \frac {1}{\mu }e^{\frac {-x}{\mu }}& x>0\\ \displaystyle 0& x\le 0\end{cases} \tag{12}\end{align*}
\begin{equation*} D=\left |{ p^{L}-p_{i}^{s} }\right | \tag{13}\end{equation*}
The leader projectile: characterize the optimal solution. This can be upgraded as [32]:
\begin{equation*} p_{new}^{L}=p^{L}+normrnd\left ({0, E_{k} }\right) \tag{14}\end{equation*} View Source\begin{equation*} p_{new}^{L}=p^{L}+normrnd\left ({0, E_{k} }\right) \tag{14}\end{equation*}
Assume \begin{align*} f\left ({x }\right)&=\frac {1}{\sigma \sqrt {2\pi }}e^{\frac {-(x-\mu)^{2}}{2\sigma ^{2}}} \tag{15}\\ E_{k}&=2.05-2\mathrm {exp}\left ({\frac {-5\left ({T-t }\right)}{T^{\gamma }} }\right) \tag{16}\end{align*}
The upgraded space and lead projectile for replace the old projectile and make the channel development as long as the energy (quality) is more proficient than the old one.
Algorithm 1 Pseudocode of LSA
Initializing Max iteration, channel time
Initializing lead tips energy
Produce transition projectiles at random using equation (10)
Assess the performance of projectiles
Iteration = 1
While iteration ≤ Max_iteration do
Upgrading lead tips energy using equation (16)
Upgrading best and worst leaders using equation (14)
If Max channel time is obtained, then
Move step leader from the worst location towards the best
Rearrange channel time
End if
Upgrading kinetic energy and its direction using equation (16)
Upgrading space and leader projectile
Assess the performance of the projectile
If fork then
Produce two symmetrical channels at the fork point
Remove the channel that has lower energy
End if
Iteration = iteration +1
End while
Return optimum step leader
The novel
Performance Validation
The proposed model is simulated using the Python 3.8.5 tool on PC i5-8600k, GeForce 1050Ti 4GB, 16GB RAM, 250GB SSD, and 1TB HDD. The parameter settings are given as follows: learning rate: 0.01, dropout: 0.5, batch size: 5, epoch count: 50, and activation: ReLU.
The experimental image captioning results of the LSAHCNN-ICS model are investigated on 3 databases (Flickr8k, Flickr30K, and MSCOCO), as given in Table 1. Flickr8k dataset is a novel standard collection for sentence-based image description and search, containing 8,000 images that have been each paired with 5 diverse captions, which present clear descriptions of the events and salient entities. The Flickr30k database comprises 31,000 images gathered at Flickr, along with 5 reference sentences offered via human annotators. The MS COCO (Microsoft Common Objects in Context) database can be a massive quantity of object detection, captioning dataset, segmentation, and key-point detection. These datasets have been widely used in various computer vision and natural language processing tasks, providing rich and comprehensive resources for image captioning and understanding research.
Fig. 3 illustrates some sample images. A set of brief comparative analyses is made with recent methods, namely hard-attention, Neural Image Caption (NIC) [36], soft-attention [37], hard attention [37], Spatial and Channel-wise Attention with CNN VGG (SCA-CNN-VGG) [38], and CNN [39] methods.
In this study, three parameters were employed for investigational validation such as CIDEr, BLEU, and Meter. Bleu is a commonly used evaluation to predict the quality of the produced text. The values of Blue should be more to increase the efficiency of machine translation. METEOR measure mostly relies upon single precision weighted harmonic mean and word recall rate. It evaluates the reconciliation accuracy as well as recalls among optimal candidate and reference translations. The CIDEr index considered all sentences as a “document” and reported it in the type of a TF-IDF vector. It determines the cosine resemblance between the reference caption and the generated caption utilizing a score value.
Table 2 presents an entire image captioning analysis of the LSAHCNN-ICS algorithm in the Flickr8K database. The experimental value demonstrates that the NIC approach has shown the least performance while the soft-attention and hard-attention models have certainly exhibited increased outcomes.
Along with that, the SCA-CNN-VGG and CNN models have tried to depict closer image captioning performance. However, the LSAHCNN-ICS model has outperformed the existing approaches with increased BLEU-1 of 70.15, BLEU-2 of 53.61, BLEU-3 of 43.70, BLEU-4 of 29.50, METEOR of 26.66, and CIDEr of 43.60.
The training accuracy (
Fig. 5 displays the training loss (
A widespread comparative image captioning results of the LSAHCNN-ICS methodology with other systems on the Flickr30K database can be described in Table 3. The obtained result represents that the NIC methodology has reached minimal image captioning outcomes.
The
Fig. 7 shows the
Fig. 8 shows the
Fig. 9 shows the
Discussion
The results assured the enhancements of the LSAHCNN-ICS model on the image captioning process. The better efficiency of this developed approach is because of the integration of LSA-based hyperparameter tuning manner and unique characteristics of the HCNN model. Since the manual hyperparameter tuning process impacts the effectiveness of the DL model, the automated hyperparameter tuning using LSA helps to accomplish improved performance over other DL models. By systematically searching through the space of possible architectures, the LSA determines the optimal combination of hyperparameter values that optimize the model’s performance on a specific task, such as image captioning. In addition, the LSA explores different architectural configurations to find the optimal combination that enhances the model’s understanding of image content and improves the quality of generated captions.
More precisely, table 5 shows the effectiveness of recent systems that have considered the same datasets tested in our work. The results obtained by our framework are superior to the results obtained in [20] for Flickr8K and Flickr30k datasets. In addition, our results are better than the results obtained in [22] in terms of the METEOR metric. Table 5 also illustrates that the LSAHCNN-ICS approach has demonstrated better results for the MSCOCO dataset compared to the results achieved in [20] and [23]. Moreover, LSAHCNN-ICS has provided better results than [25] in terms of the METEOR metric.
Despite the advantages of our framework, we indicate that our study is limited to three datasets including Flickr8k, Flickr30K, and MSCOCO. In addition, our system allows us to generate only one caption per image, while some applications need to generate multiple captions per image based on a specific purpose and perspective.
Conclusion
In this article, a novel LSAHCNN-ICS methodology can be developed for captioning images in the NLP. The presented LSAHCNN-ICS technique developed an end-to-end model comprising two major parts: CNN-based ShuffleNet as an encoder and HCNN as a decoder. At the encoding part, the ShuffleNet model derives feature descriptors of the image. Besides, in the decoding part, the description of text can be generated using the HCNN model. To achieve improved captioning results, the LSA is applied as a hyperparameter tuning strategy. The simulation analysis of the presented LSAHCNN-ICS technique is performed on a benchmark database, and the achieved results reported the superior outcomes of the LSAHCNN-ICS algorithm over existing systems with maximum CIDEr of 43.60, 59.54, and 135.14 on Flickr8k, Flickr30k, and MSCO-CO datasets respectively. The enhanced performance is owing to the addition of LSA assisted hyperparameter tuning process and the unique characteristics of the HCNN model. Therefore, the proposed model can be used to improve assistive technology and aid the visually impaired in comprehending their environment. In the forthcoming, the efficiency of the LSAHCNN-ICS method can be improved with the usage of the weighted voting ensemble DL model. Also, this developed method could be enriched for the development of hybrid meta-heuristic algorithms in hyperparameter tuning. This proposed image captioning model can be computationally expensive, especially when dealing with large images and complex architectures. Real-time applications require efficient and optimized models to generate captions quickly. The time required for generating captions can introduce significant latency in real-time applications. Reducing the inference time is essential to ensure a smooth user experience. In real-time scenarios, the model may encounter objects or scenes it has not seen during training. Robustness to handle such unseen concepts is crucial for accurate and relevant captions.
ACKNOWLEDGMENT
Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2023R408), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. Research Supporting Project number (RSPD2024R521), King Saud University, Riyadh, Saudi Arabia. This study is supported via funding from Prince Sattam bin Abdulaziz University project number (PSAU/2023/R/1444). This study is partially funded by the Future University in Egypt (FUE).