I. Introduction
Multi-modal approaches are gaining attention in recent times due to their capability of representing distributed information across multiple modalities. Multi-modal rendition is essential where a single modality cannot signify the objective of the task. Providing image information through the answer (e.g. visual question answering) [1]–[3], describing a scene in natural language (e.g. image captioning) [4]–[6], generating diverse, meaningful, and goal-oriented questions (e.g. visual question generation) [7]–[9] or some real-life applications [10], [11] are some of the applications of multi-modal tasks. Visual question generation (VQG) is a generative model that focuses on generating questions from images. An immense effort to include visual question generation tasks in many applications [7], [12] has been addressed in recent times. The main purpose of the VQG process is to generate task-oriented, relevant, and diverse questions from images. Question generation is not a one-to-one function, meaning that for an image, our objective is not to predict a specific question. Rather, a set of possible questions can be generated from a single image. Generated questions should be task-specific in the sense that there should be flexibility in the model that allows generating questions conditioned by answer types. For instance, if we want our model to generate questions that will provide answers in binary form, then questions will follow a specific structure (e.g. questions starting with "is" or "do/did") or if we want our model to generate questions that will provide the frequency (counting number) in responses, questions should start with "how many". Subsequently, the generated questions should have relevancy to the objective of the task. Questions like "how many objects are there in the image?" and "what is the color of the object?" are very generic. These less informative questions will not add sense to the question generation process. To bring out a high-level understanding of a scene through an answer, the model should focus on generating relevant questions. finally, diversity in the VQG task enables the model to generate questions that do not exist in the training data. These unseen settings certainly permit many models [1], [7] to examine the robustness in terms of performance.