I. Introduction
One of the most basic yet difficult computer vision tasks is scene segmentation, which aims to separate and parse a scene image into various image regions related to semantic categories, including background texture (such as the mountain and the sky) and distinct objects (e.g., car, person, and bike) at the pixel level. The outcome of semantic segmentation maintains the very same matching resolution as the input image, calling it a dense prediction. Understanding images is an important step in vision systems that has countless uses. Finding patterns of interaction among class categories for segmenting an image into semantically meaningful categories is a major challenge for an image segmentation model. Accurate pixel categorization into object categories in complicated natural situations is a difficult task due to the visual heterogeneity across structured and unstructured items [l].