Journals & Magazines >IEEE Access >Volume: 11

Lighting Search Algorithm With Convolutional Neural Network-Based Image Captioning System for Natural Language Processing

Working process of the proposed model.

Abstract:

Recently, deep learning models have become more prominent due to their tremendous performance for real-time tasks like face recognition, object detection, natural languag...Show More

Metadata

Abstract:

Recently, deep learning models have become more prominent due to their tremendous performance for real-time tasks like face recognition, object detection, natural language processing (NLP), instance segmentation, image classification, gesture recognition, and video classification. Image captioning is one of the critical tasks in NLP and computer vision (CV). It completes conversion from image to text; specifically, the model produces description text automatically based on the input images. In this aspect, this article develops a Lighting Search Algorithm (LSA) with a Hybrid Convolutional Neural Network Image Captioning System (LSAHCNN-ICS) for NLP. This introduced LSAHCNN-ICS system develops an end-to-end model which employs convolutional neural network (CNN) based ShuffleNet as an encoder and HCNN as a decoder. At the encoding part, the ShuffleNet model derives feature descriptors of the image. Besides, in the decoding part, the description of text can be generated using the proposed hybrid convolutional neural network (HCNN) model. To achieve improved captioning results, the LSA is applied as a hyperparameter tuning strategy, representing the innovation of the study. The simulation analysis of the presented LSAHCNN-ICS technique is performed on a benchmark database, and the obtained results demonstrated the enhanced outcomes of the LSAHCNN-ICS algorithm over other recent methods with maximum Consensus-based Image Description Evaluation (CIDEr Code) of 43.60, 59.54, and 135.14 on Flickr8k, Flickr30k, and MSCOCO datasets correspondingly.

Working process of the proposed model.

Published in: IEEE Access ( Volume: 11)

Page(s): 142643 - 142651

Date of Publication: 13 December 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3342703

Funding Agency:

Contents

SECTION I.

Introduction

Recently, a massive number of images have been stored digitally and transferred on the Internet as a significant source of information [1], [2]. CV techniques enable computers to define the visual world and, therefore, could bring various promising applications, namely information retrieval, the interaction of human computers, assistance for visually impaired people, and child education [3], [4]. Image captioning is an extensive process in (NLP) and (CV) that could finalize multi-modal transformation from images to texts [5]. For example, an important and challenging domain of artificial intelligence (AI), automatically generating image portrayals is gaining considerable interest [6]. The objective of image captioning can be for producing linguistically plausible sentences, which are semantically correct for the image contents [7], [8]. Thus, the description of an image can have 2 main features: language processing and visual understanding. To guarantee that the created sentence has been grammatically and semantically correct, NLP and CV technologies could be utilized to properly incorporate them and deal with the issue created by the resultant modality [9].

Considerating an image depends essentially on attaining image features [10], [11]. This system utilized for these purposes is widely classified into (1) Deep machine learning (DML) assisted technique and (2) Traditional machine learning (ML) based technique [12]. In the DML technique, features are automatically learned from the training dataset, and they can handle a diverse and large set of videos and images [13], [14]. For instance, (CNN) is extensively employed for learning features, and classifiers like Softmax are utilized for classification [15]. Generally, CNN is followed by Recurrent Neural Network (RNN) to create captions. At the same time, in the traditional ML technique, the handcrafted feature was broadly applied [16]. In this technique, features have been removed from an input dataset. Then, it passed to classifiers, namely Support Vector Machines (SVM), to categorize the object. Since the handcrafted feature is a particular task, removing features from a diverse and large collection of information is not possible [17]. Furthermore, real-time information like video and images have different semantic interpretations and are complex [18].

Although image captioning research has made significant progress, several challenges should be solved. It is significant for the annotation of image captioning models that may struggle with ambiguous images. In addition, data processing is a key procedure for image captioning. More specifically, selecting the optimal hyperparameter and handling imbalanced datasets are two main issues that affect the training process. So far, real-time and multi-modal processing are two main limitations for most of the existing image captioning models. Recent approaches to image captioning systems did not focus on the hyperparameter selection method to affect the effectiveness of the classification model. Mainly, hyperparameters like batch size, epoch count, and learning rate selection could be required to gain improved performance. As the trial and error technique for hyperparameter tuning was a tiresome and erroneous procedure, meta-heuristic algorithms can be implemented. Consequently, in this work, the LSA is used for the parameter selection of the Hybrid Convolutional Neural Network (HCNN) model.

This article develops an LSA with an HCNN-based image Captioning System called (LSAHCNN-ICS) for NLP. The presented LSAHCNN-ICS method develops an end-to-end model which employs CNN-based ShuffleNet as an encoder and HCNN as a decoder. At the encoding part, the ShuffleNet model derives feature descriptors of the image. Besides, in the decoding part, the description of text can be generated using the HCNN model. To achieve improved captioning results, the LSA is applied as a hyperparameter tuning strategy. The investigational validation of this presented LSAHCNN-ICS system is implemented on a benchmark dataset.

The rest of this study is systematized as follows. Section II gives a literature review of image captioning techniques. Then, section III presents the proposed LSAHCNN-ICS method and section IV delivers the experimental validation. Lastly, section V accomplishes the work.

SECTION II.

Related Works

Wang and Huang [19] have presented a local representation-improved recurrent convolution network (Lore-RCN). The authors have developed a visual convolution network for obtaining improved local linguistic context that integrates selective local visual data and methods of short-term neighbouring. In addition, they have designed a linguistic convolution network for obtaining improved linguistic representations that techniques long- and short-term connections explicitly for leveraging administrative data in preceding linguistic tokens. He and Lu [20] have suggested an end-to-end method that relies on RNN as a decoder deep and CNN as an encoder. For obtaining superior image captioning extracting, the authors have presented an extremely modularized multi-branch CNN that can improve accuracy while retaining the count of hyperparameters unaltered.

Al-Malla et al. [21] projected an attention-related, Encoding-Decoding deep framework that generates convolution feature extracting in CNNs technique pretraining on ImageNet (Xception), along with object extracting feature in YOLOv4 method, pre-training on MS COCO. Prudviraj et al. [22] have introduced a new multiscale FF network (M-FFN) to ICS tasks for incorporating distinct features and image contextual data of images. Specifically, the author gets benefits of MSFPN for incorporating global contextual data through atrous convolutional at top layers of CNNs. Faiyaz Khan et al. [23] elaborated an end-wise image captioning method employing a multi-modal infrastructure integrating a 1D-CNN for encoding sequence data with pre-training ResNet50 method image encoding to extract region-based visual features.

An effectual structure to caption the remote sensing image (RSI) was presented in [24]. This structure is dependent upon multi-level attention and multilabel feature graph convolutional. More precisely, the presented multi-level attention component is adaptably concentrated on particular spatial features, among them on features of certain scales. In addition, the attribute graph convolution component (GCN) has utilized the attribute graph for learning highly efficient attribute features to the image caption. Dong et al. [25] have investigated a Dual Graph Convolution Network (Dual-GCN) with curriculum and transformer learning to image caption. It is worth mentioning that the authors of [25] did not only utilize an object-level GCN for capturing the object-to-object spatial relationship in a single image.

With the well-planned Dual-GCN, the authors of [25] create the linguistic transformer superior to understand the connection betwixt distinct objects from the single image and generate complete utilization of the same images as auxiliary data for generating a reasonable caption explanation to a single image.

Wang and Gu [26] present a novel Joint Relationship Attention Network (JRAN), which newly discovers the connections among the feature from the image. In theory, the JRAN exploits semantic features as supplementary to region features, completely learning 2 kinds of connections, the visual connections among region features and the visual–semantic connections among the region and semantic features. Wang and Gu [27] examine the Double-Level Relationship Networks (DLRN) that newly act as the complementary local and global features from the image and improves the connection among features. The former learn distinct hierarchies of visual relationships by applying graph attention for local-level relationship improvement and pixel-level relationship improvement correspondingly.

In [28], the authors analyse the local visual modelling with grid features for image captioning that can be vital to generating correct and detailed captions. To accomplish this objective, the author presents a Locality-Sensitive Transformer Network (LSTNet) with two novel designs Locality-Sensitive Attention and Locality-Sensitive Fusion (LSF). In [29], a Local Relation Network (LRN) was planned over the objects and image regions that not only determines the connection among the object and image regions among them creates major context-based features equivalent to all the regions from the image. Lastly, a different typical LSTM utilizes an attention process that concentrates on related contextual data, spatial places, and deep visual features.

Different from the above recent methods, we propose to enhance the process of image captioning by incorporating CNN and LSA. The former is important for feature extraction, while the latter is useful for optimizing parameter tuning. We indicate that some of the related works have been applied for Flickr8k, Flickr30k and MSCOCO datasets, while others have been tested for a limited number of datasets. Besides, the related works on image captioning did not consider the number of channels.

SECTION III.

The Proposed Model

In this study, an innovative LSAHCNN-ICS algorithm can be developed for captioning images in the NLP. This introduced LSAHCNN-ICS technique depends upon an end-to-end model comprising two major parts: CNN-based ShuffleNet as an encoder and HCNN as a decoder. Automated ICS employ an encoder and decoder framework to extract features from an image using the encoder, whereas the role of the decoder consists of generating a transcript. In this case, the ShuffleNet model has been exploited to remove features from the image, and the HCNN model acts as a decoder which produces the transcript. Fig. 1 represents the block diagram of the LSAHCNN-ICS model.

FIGURE 1.

Working process of the proposed model.

Show All

A. Encoder Unit: Shufflenet

In the encoding part, the ShuffleNet model derives feature descriptors of the image. The proposed model uses ShuffleNet as the backbone CNN to remove visual features at the input image. ShuffleNet’s compact architecture and efficient computation make it suitable for extracting image features. The encoding unit captures the fine-grained information, encodes it, and generates fixed-size vectors. ShuffleNet has related concepts to ResNet, MobileNet, and Xception. Depthwise separable convolution and channel shuffle are used to enhance the ResNet architecture, which ensures network performance and enhances operational efficacy [30]. Different from the residual structure that straightaway incorporates the deep and non-deep features accomplished by numerous convolutions, the inverted residual model splits the input feature maps into two divisions, X1 and X2, they are merged with deep and non-deep features, and lastly, it uses channel shuffle to fuse deep and non-deep features. Fig. 2 illustrates the framework of the ShuffleNet technique. Assume that the input layer is separated into G groups, and the overall no. of channels is G $\times $ n. Selective Kernel Networks (SK) and Squeeze-and-Excitation Networks (SE) are added to the model to accurately make the classification correspondingly. Initially, a feature map U with a size of H $\times $ W and the overall amount of channels $C$ are compressed into feature vectors of (1, 1, C) through a global pooled F_sq given in the following [30]. \begin{equation*} Z_{c}=F_{sq}\left ({U_{c} }\right)=\frac {1}{H\times W}\sum \nolimits _{i=1}^{H} \sum \nolimits _{j=1}^{W} U_{c} \left ({I,j }\right). \tag{1}\end{equation*} View Source

FIGURE 2.

Structure of ShuffleNet model.

Show All

Then, add linear mapping and the activation function to the feature vectors for handling non-linear conditions that could best adapt the complex correlations amongst channels. Lastly, the evaluated channel feature was multiplied with the deep feature maps to get the output. The SE method weakens the insignificant feature and strengthens the significant feature by controlling the size of the channel to make the extracted feature more directional. Channel attention is allowable to be inserted among all the feature maps.

B. Decoder Unit: LSA With HCNN Model

In this study, in the decoding part, the description of text can be generated using the HCNN model. The outcomes of the image feature and word sequence encoders can be integrated by combination and fed as input into the HCNN model. The HCNN model produces a softmax forecast in all vocabulary words to be the succeeding word in a sequence, and the word at the maximum possibility can be chosen. These procedures are continued till the ending token is produced. The HCNN methodology generates text description, which comprises a sequence connection of CNN and LSTM [31]. The presented method could extract complicated features amongst many sensor parameters gathered for forecasting power demands and save complex irregular trends. Firstly, the upper layer of HCNN comprises CNN. The CNN could obtain different parameters that affect power utilization, namely, sub-metering, voltage, and intensity. Furthermore, household features like time, date, household occupancy, and behaviour of the residents are modelled as Metadata in the CNN layer. CNN comprises of input layer which accepts sensor variables as input, an output unit that extracts features to LSTM, and multiple hidden layers. The convolution layer employs the convolutional process to the incoming multi-variate time sequence and passes the outcomes to the following layer. Every convolutional neuron processes power utilization information for the receptive field. The convolution process could decrease the parameter count and make the HCNN network deeper. Where $x_{i}^{0}=\mathrm {\{}x_{1},x_{2}\mathrm {,\ldots,}x_{n}\}$ is the power utilization input vector, and $n$ indicates the number of normalized 60 min units for every window. Eq. (2) denotes the outcome of vector $y_{ij}^{1}$ output from the initial convolution layer, $y_{ij}^{1}$ is computed using output vector $x_{ij}^{1}$ of the preceding layer, $m$ indicates the index value of filter, $b_{j}^{1}$ characterizes the bias for $j^{rh}$ feature maps, $w$ denotes the weight of the kernel, and 0 denotes the activation function as follows [31].\begin{align*} y_{ii}^{1}&=s\left ({b_{i}^{1}+\sum \limits _{m=1}^{M} w_{m,j}^{1} x_{i+m-1}^{0},j }\right) \tag{2}\\ y_{ij}^{1}&=s\left ({b_{i}^{l}+\sum \limits _{\mathrm {m=1}}^{M} w_{\mathrm {m,}j}^{l} x_{i\mathrm {+m-1}}^{0},j }\right) \tag{3}\end{align*} View Source

The pooling layer decreases the space size of the demonstration to decrease the network computation and cost number of parameters. The convolutional layer employs a pooling layer that integrates the output of neuron clusters in a single layer into one neuron in the following layer. Eq. (4) characterizes the max-pooling layer operation. $R$ indicates the pooling size lesser than the input size $y, $ and $T$ denotes the stride that decides how much further to move the region of the input dataset.\begin{equation*} p_{ij}^{l}=\max _{r\in R}y_{i\mathrm {\times }T+r,j}^{l-1} \tag{4}\end{equation*} View Source

LSTM is a lower layer of HCNN, which store time dataset regarding significant features. The output value in the preceding CNN layer could be accepted by the gate unit. The latter encompasses forget, input and output gates. The memory cell makes up the LSTM upgrade the state with activation of all the gating units that are controlled to constant values among zero and one as follows [31]:\begin{align*} i_{t}&=\sigma \left ({W_{pi}p_{t}\mathrm {+}\textit {VV}_{hi}h_{t-1}+W_{ci}\mathrm {\circ }c_{t-1}+b_{i} }\right) \tag{5}\\ f_{t}&=\sigma \left ({W_{pf}p_{t}+VV_{hf}h_{t-1}+W_{cf}\mathrm {\circ }c_{f-1}+b_{f} }\right) \tag{6}\\ \mathrm {O}_{t}&=\sigma \left ({W_{po}p_{t}+W_{ho}h_{t-1}+VV_{co}\mathrm {\circ }c_{f}+b_{o} }\right) \tag{7}\end{align*} View Source

The input, forget, and output gates constitute the LSTM, as well as output of all the gates is characterized as $i,f$ , and $o$ . The cell and hidden states $c$ and $h$ are defined by input, forget, and output gates. 0 denotes an activation function, namely tanh. This activation function has nonlinearity and correspondingly squashes the input within [−1, 1]. $W$ indicates the weighted matrixes, and $b$ indicates the bias vector. $p_{f}$ encompasses the critical feature of power utilization as the output.\begin{align*} c_{f}&=f_{f}\mathrm {\circ }c_{f-1}+i_{r}\circ \sigma \left ({W_{pc}p_{t}+VV_{hc}h_{t-1}+b_{c} }\right) \tag{8}\\ h_{r}&={0}_{t}\circ \sigma \left ({c_{t} }\right) \tag{9}\end{align*} View Source

The final layer of HCNN is composed of FC layers. They are utilized for generating the text and $h^{l}=\mathrm {\{}h_{1},h_{2},h_{l}\}$ , whereas $l$ denotes the number of units in LSTM. The output of LSTM was applied as input for the FC layer.

To achieve improved captioning results, the LSA is applied as a hyperparameter tuning strategy. The LSA is a novel metaheuristic method that depends on the lightning occurrence in nature [32]. The main purpose has been a generalization of the hypotheses through the model of step leader propagation. LSA executes a search procedure through faster particles called projectile that moves in the searching space. The projectile technique is the same as other positions utilized in evolutionary mechanisms, comprising “chromosome”, “particle”, or “individual”. The LSA combines three kinds of projectiles that are determined as follows:

Transition projectile: projectiles form an early population of step leaders. The projectile was generated using a random value derived at a uniform likelihood distribution, and it can be expressed in the following [32]:\begin{align*} f\left ({x }\right)=\begin{cases} \displaystyle \frac {1}{b-a}& a\le x\le b\\ \displaystyle 0& x < a or x>b\end{cases} \tag{10}\end{align*} View Source
Space projectile: the projectiles are upgraded and evolved such that one of them becomes a leader. The upgrading process is formulated as follows: \begin{equation*} p_{j}^{s}=p_{i}^{s}\pm exprnd (D) \tag{11}\end{equation*} View Source

In the expression, $p_{j}^{s}$ represents the upgraded space projectile, $p_{i}^{s}$ indicates the older space projectile, and $exprnd$ generates arbitrary value at the exponential distribution, and it can be given as follows:\begin{align*} f\left ({x }\right)=\begin{cases} \displaystyle \frac {1}{\mu }e^{\frac {-x}{\mu }}& x>0\\ \displaystyle 0& x\le 0\end{cases} \tag{12}\end{align*} View Source

$\mu $ is considered as the distance $D$ within $p_{i}^{s} $ and leader projectile $p^{L}$ is shown below.\begin{equation*} D=\left |{ p^{L}-p_{i}^{s} }\right | \tag{13}\end{equation*} View Source

The leader projectile: characterize the optimal solution. This can be upgraded as [32]:\begin{equation*} p_{new}^{L}=p^{L}+normrnd\left ({0, E_{k} }\right) \tag{14}\end{equation*} View Source

Assume $p_{new}^{L}$ as the upgraded leader projectile, ${and p}^{L}$ shows the old leader projectiles, $normran (\mu, \sigma)$ arbitrarily generated number with $\mu $ mean and $\sigma $ standard deviation. The parameter $\sigma $ is employed as $E_{k}$ kinetic energy, which is reduced exponentially through the evolution of iteration. $t$ defines the existing iteration count, and $T$ refers to the overall iteration count. Eq. (15) shows that the randomly produced lead projectile could be searched in every direction at the current place described by the shape parameter [32].\begin{align*} f\left ({x }\right)&=\frac {1}{\sigma \sqrt {2\pi }}e^{\frac {-(x-\mu)^{2}}{2\sigma ^{2}}} \tag{15}\\ E_{k}&=2.05-2\mathrm {exp}\left ({\frac {-5\left ({T-t }\right)}{T^{\gamma }} }\right) \tag{16}\end{align*} View Source

The upgraded space and lead projectile for replace the old projectile and make the channel development as long as the energy (quality) is more proficient than the old one.

Algorithm 1 Pseudocode of LSA

Initializing Max iteration, channel time

Initializing lead tips energy

Produce transition projectiles at random using equation (10)

Assess the performance of projectiles

Iteration = 1

While iteration ≤ Max_iteration do

Upgrading lead tips energy using equation (16)

Upgrading best and worst leaders using equation (14)

If Max channel time is obtained, then

Move step leader from the worst location towards the best

Rearrange channel time

End if

Upgrading kinetic energy and its direction using equation (16)

Upgrading space and leader projectile

Assess the performance of the projectile

If fork then

Produce two symmetrical channels at the fork point

Remove the channel that has lower energy

End if

Iteration = iteration +1

End while

Return optimum step leader

The novel $p_{j} $ projectiles are produced by the subsequent formula. $a$ and $b$ refer to the upper and lower boundaries. Later, the quality of ${the p}_{j}$ is estimated. The better of 2 projectiles remain in the population, whereas others are disregarded. This process is utilized in the LSA with the lowest rate.

SECTION IV.

Performance Validation

The proposed model is simulated using the Python 3.8.5 tool on PC i5-8600k, GeForce 1050Ti 4GB, 16GB RAM, 250GB SSD, and 1TB HDD. The parameter settings are given as follows: learning rate: 0.01, dropout: 0.5, batch size: 5, epoch count: 50, and activation: ReLU.

The experimental image captioning results of the LSAHCNN-ICS model are investigated on 3 databases (Flickr8k, Flickr30K, and MSCOCO), as given in Table 1. Flickr8k dataset is a novel standard collection for sentence-based image description and search, containing 8,000 images that have been each paired with 5 diverse captions, which present clear descriptions of the events and salient entities. The Flickr30k database comprises 31,000 images gathered at Flickr, along with 5 reference sentences offered via human annotators. The MS COCO (Microsoft Common Objects in Context) database can be a massive quantity of object detection, captioning dataset, segmentation, and key-point detection. These datasets have been widely used in various computer vision and natural language processing tasks, providing rich and comprehensive resources for image captioning and understanding research.

TABLE 1 Dataset Details

Fig. 3 illustrates some sample images. A set of brief comparative analyses is made with recent methods, namely hard-attention, Neural Image Caption (NIC) [36], soft-attention [37], hard attention [37], Spatial and Channel-wise Attention with CNN VGG (SCA-CNN-VGG) [38], and CNN [39] methods.

FIGURE 3.

Sample images.

Show All

In this study, three parameters were employed for investigational validation such as CIDEr, BLEU, and Meter. Bleu is a commonly used evaluation to predict the quality of the produced text. The values of Blue should be more to increase the efficiency of machine translation. METEOR measure mostly relies upon single precision weighted harmonic mean and word recall rate. It evaluates the reconciliation accuracy as well as recalls among optimal candidate and reference translations. The CIDEr index considered all sentences as a “document” and reported it in the type of a TF-IDF vector. It determines the cosine resemblance between the reference caption and the generated caption utilizing a score value.

Table 2 presents an entire image captioning analysis of the LSAHCNN-ICS algorithm in the Flickr8K database. The experimental value demonstrates that the NIC approach has shown the least performance while the soft-attention and hard-attention models have certainly exhibited increased outcomes.

TABLE 2 Image Captioning Result of LSAHCNN-ICS Technique With Other Approaches Under Flickr8K Dataset [36], [37], [38], [39]

Along with that, the SCA-CNN-VGG and CNN models have tried to depict closer image captioning performance. However, the LSAHCNN-ICS model has outperformed the existing approaches with increased BLEU-1 of 70.15, BLEU-2 of 53.61, BLEU-3 of 43.70, BLEU-4 of 29.50, METEOR of 26.66, and CIDEr of 43.60.

The training accuracy ($TR_{acc}$ ) and validation accuracy ($VL_{acc}$ ) accomplished with the LSAHCNN-ICS system with the Flickr8K database are showcased in Fig. 4. The obtained value pointed out the LSAHCNN-ICS system has realized increased values of $TR_{acc}$ and $VL_{acc}$ . Specifically, the $VL_{acc}$ looked better than the $TR_{acc}$ .

$FIGURE 4. - ${T}{R}_{acc}$ and ${V}{L}_{acc}$ analysis of LSAHCNN-ICS model at Flickr8K dataset.$

FIGURE 4.

${T}{R}_{acc}$ and ${V}{L}_{acc}$ analysis of LSAHCNN-ICS model at Flickr8K dataset.

Show All

Fig. 5 displays the training loss ($TR_{loss}$ ) and validation loss ($VL_{loss}$ ) accomplished using LSAHCNN-ICS methodology in the Flickr8K database. The simulation outcome stated that the LSAHCNN-ICS algorithm had attained decreased values of $TR_{loss}$ and $VL_{loss}$ . In certain, the $VL_{loss}$ is lesser than $TR_{loss}$ .

$FIGURE 5. - ${T}{R}_{loss}$ and ${V}{L}_{loss}$ analysis of LSAHCNN-ICS methodology under Flickr8K dataset.$

FIGURE 5.

${T}{R}_{loss}$ and ${V}{L}_{loss}$ analysis of LSAHCNN-ICS methodology under Flickr8K dataset.

Show All

A widespread comparative image captioning results of the LSAHCNN-ICS methodology with other systems on the Flickr30K database can be described in Table 3. The obtained result represents that the NIC methodology has reached minimal image captioning outcomes.

TABLE 3 Image Captioning Result of LSAHCNN-ICS Technique With Other Approaches Under Flickr30K Dataset [36], [37], [38], [39]

The $TR_{acc}$ and $VL_{acc}$ gained by the LSAHCNN-ICS methodology in the Flickr30K database are defined in Fig. 6. The simulation result outperformed the LSAHCNN-ICS method is getting superior values of $TR_{acc}$ and $VL_{acc}$ . Precisely, the $VL_{acc}$ appeared higher than $TR_{acc}$ .

$FIGURE 6. - ${T}{R}_{acc}$ and ${V}{L}_{acc}$ analysis of LSAHCNN-ICS technique with Flickr30K database.$

FIGURE 6.

${T}{R}_{acc}$ and ${V}{L}_{acc}$ analysis of LSAHCNN-ICS technique with Flickr30K database.

Show All

Fig. 7 shows the $TR_{loss}$ and $VL_{loss}$ accomplished by the LSAHCNN-ICS system with the Flickr30K database. The result revealed that the LSAHCNN-ICS approach had reached reduced values of $TR_{loss}$ and $VL_{loss}$ . Explicitly, the $VL_{loss}$ is smaller than $TR_{loss}$ .

$FIGURE 7. - ${T}{R}_{loss}$ and ${V}{L}_{loss}$ outcome of LSAHCNN-ICS technique on Flickr30K database.$

FIGURE 7.

${T}{R}_{loss}$ and ${V}{L}_{loss}$ outcome of LSAHCNN-ICS technique on Flickr30K database.

Show All

Fig. 8 shows the $TR_{acc}$ and $VL_{acc}$ achieved by the LSAHCNN-ICS algorithm at the MSCOCO database. The achieved result states that the LSAHCNN-ICS technique can be realized improved values of $TR_{acc}$ and $VL_{acc}$ . Notably, $VL_{acc}$ appears to exist better than $TR_{acc}$ .

$FIGURE 8. - ${T}{R}_{acc}$ and ${V}{L}_{acc}$ analysis of LSAHCNN-ICS technique under the MSCOCO dataset.$

FIGURE 8.

${T}{R}_{acc}$ and ${V}{L}_{acc}$ analysis of LSAHCNN-ICS technique under the MSCOCO dataset.

Show All

Fig. 9 shows the $TR_{loss}$ and $VL_{loss}$ accomplished by the LSAHCNN-ICS method under the MSCOCO database are exhibited. The simulation result demonstrated that the LSAHCNN-ICS approach is capable of minimal values of $TR_{loss}$ and $VL_{loss}$ . At certain, the $VL_{loss}$ is reduced than $TR_{loss}$ .

$FIGURE 9. - ${T}{R}_{loss}$ and ${V}{L}_{loss}$ result of LSAHCNN-ICS methodology under MSCOCO database.$

FIGURE 9.

${T}{R}_{loss}$ and ${V}{L}_{loss}$ result of LSAHCNN-ICS methodology under MSCOCO database.

Show All

SECTION V.

Discussion

The results assured the enhancements of the LSAHCNN-ICS model on the image captioning process. The better efficiency of this developed approach is because of the integration of LSA-based hyperparameter tuning manner and unique characteristics of the HCNN model. Since the manual hyperparameter tuning process impacts the effectiveness of the DL model, the automated hyperparameter tuning using LSA helps to accomplish improved performance over other DL models. By systematically searching through the space of possible architectures, the LSA determines the optimal combination of hyperparameter values that optimize the model’s performance on a specific task, such as image captioning. In addition, the LSA explores different architectural configurations to find the optimal combination that enhances the model’s understanding of image content and improves the quality of generated captions.

More precisely, table 5 shows the effectiveness of recent systems that have considered the same datasets tested in our work. The results obtained by our framework are superior to the results obtained in [20] for Flickr8K and Flickr30k datasets. In addition, our results are better than the results obtained in [22] in terms of the METEOR metric. Table 5 also illustrates that the LSAHCNN-ICS approach has demonstrated better results for the MSCOCO dataset compared to the results achieved in [20] and [23]. Moreover, LSAHCNN-ICS has provided better results than [25] in terms of the METEOR metric.

TABLE 4 Image Captioning Result of LSAHCNN-ICS Technique With Other Approaches Under the MSCOCO Dataset [36], [37], [38], [39]

TABLE 5 Image Captioning Result for the State-of-Art Methods

Despite the advantages of our framework, we indicate that our study is limited to three datasets including Flickr8k, Flickr30K, and MSCOCO. In addition, our system allows us to generate only one caption per image, while some applications need to generate multiple captions per image based on a specific purpose and perspective.

SECTION VI.

Conclusion

In this article, a novel LSAHCNN-ICS methodology can be developed for captioning images in the NLP. The presented LSAHCNN-ICS technique developed an end-to-end model comprising two major parts: CNN-based ShuffleNet as an encoder and HCNN as a decoder. At the encoding part, the ShuffleNet model derives feature descriptors of the image. Besides, in the decoding part, the description of text can be generated using the HCNN model. To achieve improved captioning results, the LSA is applied as a hyperparameter tuning strategy. The simulation analysis of the presented LSAHCNN-ICS technique is performed on a benchmark database, and the achieved results reported the superior outcomes of the LSAHCNN-ICS algorithm over existing systems with maximum CIDEr of 43.60, 59.54, and 135.14 on Flickr8k, Flickr30k, and MSCO-CO datasets respectively. The enhanced performance is owing to the addition of LSA assisted hyperparameter tuning process and the unique characteristics of the HCNN model. Therefore, the proposed model can be used to improve assistive technology and aid the visually impaired in comprehending their environment. In the forthcoming, the efficiency of the LSAHCNN-ICS method can be improved with the usage of the weighted voting ensemble DL model. Also, this developed method could be enriched for the development of hybrid meta-heuristic algorithms in hyperparameter tuning. This proposed image captioning model can be computationally expensive, especially when dealing with large images and complex architectures. Real-time applications require efficient and optimized models to generate captions quickly. The time required for generating captions can introduce significant latency in real-time applications. Reducing the inference time is essential to ensure a smooth user experience. In real-time scenarios, the model may encounter objects or scenes it has not seen during training. Robustness to handle such unseen concepts is crucial for accurate and relevant captions.

ACKNOWLEDGMENT

Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2023R408), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. Research Supporting Project number (RSPD2024R521), King Saud University, Riyadh, Saudi Arabia. This study is supported via funding from Prince Sattam bin Abdulaziz University project number (PSAU/2023/R/1444). This study is partially funded by the Future University in Egypt (FUE).

References is not available for this document.

Lighting Search Algorithm With Convolutional Neural Network-Based Image Captioning System for Natural Language Processing

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Works