Introduction
Surround-view fisheye cameras have gained significant attention due to their wide field of view (FoV), particularly in the context of autonomous driving [1]. The first wide-angle rear-view camera paired with a TV display was introduced in General Motors' Buick Centurion concept model in 1956 [2]. In 2008, BMW implemented surround-view cameras for parking assistance [3]. By 2018, the United States mandated rear-view cameras to prevent reversing accidents [4]. These fisheye-cameras are also commonly used in other domains such as video surveillance [5] and augmented reality [6]. Today, surround-view cameras are standard in many vehicles and are used in computer vision applications such as cross-traffic alerts [7], object detection [8], and automated parking [9].
Typical automotive surround-view systems employ four cameras positioned around a vehicle to capture a full 360
The top figure displays a typical automotive surround-view system, which includes four fisheye cameras mounted at the front, rear, and on each side mirror of the vehicle. The bottom figure provides a near-field perspective, capturing a complete
Integrating accurate semantic segmentation algorithms with fisheye images enhances autonomous vehicles' capabilities, allowing them to interpret the captured scenes containing elements such as roads, pedestrians, and other critical objects around the vehicle and navigate adeptly across diverse driving scenarios, as discussed in [10]. However, a significant challenge for semantic segmentation algorithms arises from the fisheye geometrical distortion inherent in surround-view optical fisheye lenses [11]. Fig. 2 illustrates the geometric distortion in one of the fisheye images from WoodScape [12], where objects appear more deformed in scale and size due to the spatially variant distortion, especially those closer to the camera, and considerable pixel distortions at the boundary of the image.
Illustration of Object Deformation and Larger Pixel Distortion on the Boundary of the Woodscape fisheye Image.
One of the challenges associated with fisheye semantic segmentation is the limited fully annotated fisheye dataset availability; few public datasets are available for fisheye images, resulting in limited research in the fisheye domain [1]. For example, the Oxford Robotcar dataset is one such dataset that offers fisheye camera images for autonomous driving [13]. It includes over 100 repetitions of a consistent route through Oxford, U.K., captured over a year, and is widely used primarily for long-term localization and mapping. OmniScape is a synthetic dataset featuring front-view fisheye RGB images, captured using cameras mounted on a motorcycle, designed for semantic segmentation research but lacking semantic annotations [14]. Additionally, KITTI-360 [15] is a suburban driving dataset that provides a wide range of input modalities, extensive semantic instance annotations, and precise localization aimed at supporting research in vision, graphics, and robotics; however, it does not provide the fisheye image semantic annotations. One of the earliest publically available real-world datasets for this domain is the WoodScape dataset [12], featuring surround-view fisheye images for automotive applications to advance multi-task, multi-camera computer vision research for autonomous driving. The dataset includes four fisheye cameras that provide a
Another associated challenge is that fisheye perception tasks have been less studied than pinhole cameras that exhibit minimal geometric distortions. Though CNN methods have become a cornerstone in computer vision, achieving remarkable success in tasks such as image classification, object detection, and semantic segmentation [17], they are not without limitations. The deformation of objects in fisheye images complicates the learning process for CNNs, which rely on translation invariance as an inductive bias and have fixed receptive fields. Consequently, the model needs to learn all distorted versions of an object efficiently, increasing the sample set's complexity. Therefore, CNN algorithms face difficulties adapting to fisheye cameras due to the pronounced radial distortion in feature maps, which may fail to account for distortion characteristics during training [18]. Moreover, with fisheye images, using a bounding box technique for object detection becomes more complex, as the box does not fit the distorted objects optimally. For this perspective, one of the sophisticated representations, such as curved bounding boxes to account for the fisheye known radial distortions instead of fixed bounding boxes, was discussed in [8]. It should be noted that, apart from the geometric distortion, other optical artifacts are present in automotive fisheye cameras [11]. However, this paper does not consider these other optical effects.
CNN algorithms are primarily designed for rectified pinhole camera images. A common yet simplistic approach for applying CNNs to fisheye imagery involves rectifying fisheye images before applying standard algorithms; however, this method introduces significant drawbacks, such as a reduced field of view and resampling distortion artifacts near the edges [19]. These limitations have fueled a growing interest in exploring alternative or complementary approaches, particularly regarding robustness, interpretability, and adaptability in image analysis tasks. Hence, specialized CNN architectures capable of effectively handling these geometric distortions are required. The adoption of surround-view systems has become more widespread, and recent research efforts have begun to address this challenge through deformable convolution methods [20].
Spherical images, often captured by
Most recently, advancements have been made to build dynamic CNNs and ViTs by introducing Deformable Convolutions as an intrinsic component for addressing the unique distortions in spherical images [25], [26]. Additionally, after the remarkable performance with ViTs [27], [28] in image analysis tasks, some recent studies have attempted to build l ViTs to handle the unique geometry of spherical images, as discussed in [24], [29], [30], [31]. Moreover, Spherical Transformers [24], [29], [30], [31] account for the unique deformations in Surround-view
Proposed framework for fisheye semantic segmentation using spherical neural networks.
Fisheye images taken from the real-world Woodscape dataset and segmented by the assessed spherical segmentation models. (a) Original Fisheye Image, (b) Ground Truth Mask, and (c), (d), (e) are example segmented results from the spherical models PASS, OOSS, and Trans4PASS+, respectively.
In summary, our contributions are as follows:
We provide a comprehensive survey of existing models, including CNNs and ViTs, applied to semantic segmentation in both spherical and fisheye domains for Autonomous Driving.
Notably, we address the application of spherical models designed for spherical images into the fisheye image domain; it is worth noting that fisheye is non-spherical, though it has some similar geometric distortion characteristics to spherical images.
Our comparison represents the first attempt to apply an efficient semantic segmentation model based on spherical methods on raw real-world Woodscape [12] and Synthetic fisheye datasets (i.e. SynWoodscape and SynCityscapes-fisheye [16], [35]).
We demonstrate the effectiveness of Fisheye Semantic Segmentation with spherical neural networks, which can achieve better performance in the semantic segmentation of fisheye images while maintaining real-time performance.
The structure of the paper is as follows: Section II presents a review of the commonly used models in the fields of CNNs and ViTs for semantic segmentation. It also discusses enhancements of these models designed for semantic segmentation, evaluates their significance on spherical and fisheye images, and surveys existing CNNs and ViTs for surround-view imaging and their associated challenges. Section III outlines the methods and datasets used for fisheye semantic segmentation, including comparing the spherical models with standard CNNs and ViTs on the real world and synthetic fisheye datasets. Section IV experiment setup, results, and analysis. Finally, Section V presents the concluding remarks and highlights potential future research directions for the autonomous vehicle community.
Background
This section provides a comprehensive overview of the types of surround-view images commonly used in different applications and advancements in semantic segmentation algorithms. We explore how these algorithms have been adapted specifically for surround-view images, focusing on methods that address challenges such as image distortion in spherical and fisheye domains. Additionally, we discuss the most widely used ViTs for semantic segmentation and the development of dynamic algorithms to handle these distortions more effectively.
A. Types of Surround-View Images
Two main types of omnidirectional images exist: Spherical and Fisheye images [36]. Other less common omnidirectional camera modalities exist, such as catadioptic cameras [37], [38], but these are relatively less common (particularly in the automotive space) and, as such, are not considered in this work. Both spherical and fisheye cameras are increasingly relevant in computer vision due to their ability to capture a wide field of view, providing a comprehensive visual context that is invaluable across various applications. Spherical images are typically created by stitching multiple images together, resulting in a broad, continuous view ideal for applications such as virtual reality, surveillance, and mapping systems [23]. In contrast, fisheye images achieve a wide FoV through a single lens, making them suitable for real-time applications where capturing a large area in a single shot is essential, such as in automotive surround-view systems, robotics, and immersive media [39]. Both image types enhance situational awareness and spatial understanding, which are crucial for navigation and scene-understanding tasks. However, they pose unique challenges due to geometric distortions inherent in the projection methods that map three-dimensional scenes onto two-dimensional surfaces. Addressing these challenges requires specialized algorithms and models to maximize their potential while mitigating issues related to image quality and computational complexity [11]. Despite their differences, spherical and fisheye images exhibit superficially similar distortion characteristics, particularly where objects near the periphery experience greater distortion and deformation, as illustrated in Fig. 5, where the dual fisheye cameras are mapped to the spherical image via equirectangular projection. Even though differences exist, for example, in that vertical lines map to vertical lines in the spherical image and the “stretching” at the poles, this paper explores the application of spherical models to fisheye imagery.
Fisheye vs. Spherical Image: On the left (Fisheye Image), straight lines appear curved at the edges of the image, especially with the ceiling and floor tile, and objects near the center experience significant deformation((like the Buddha statues in the image). On the right (Spherical Image), features near the poles appear stretched(e.g., ceiling design and floor tiles), whereas those along the equator retain a more natural appearance. Both image types exhibit geometric distortions unique to their projection methods. (Image reproduced with permission from Paul Bourke.).
B. Surround View Image Rectification and Drawbacks
Surround-view image distortions can be corrected to allow the use of standard perception algorithms. Although this approach can expedite the development of fisheye camera perception, it introduces several issues. Firstly, it is theoretically impossible to fully rectify a fisheye image into a rectilinear view because the horizontal field of view exceeds
Fisheye image rectification methods: (a) Rectilinear mapping; (b) Piecewise linear mapping; (c) Cylindrical mapping. Left: Raw image; Right: Rectified image.
C. Semantic Segmentation: Methods and Enhancement
Semantic segmentation involves assigning a class label to each pixel in an image, identifying entities such as pedestrians, roads, or curbs. This field has gained significant attention and has made substantial progress since the introduction of Fully Convolutional Networks (FCNs) [41], [42], which framed semantic segmentation as an end-to-end per-pixel classification task. Following FCNs, subsequent efforts have enhanced the segmentation performance by using encoder-decoder architectures [43], aggregating high-resolution representations [44], widening receptive fields [45], and collecting contextual priors [46]. Inspired by the non-local blocks, self-attention [47] is leveraged to establish long-range dependencies within FCNs. Contemporary architectures substitute convolutional backbones with transformer ones [48], [49]. Thus, image understanding can be viewed via a perspective of sequence-to-sequence learning with dense prediction transformers [27], [28], [50] and semantic segmentation transformers [51], [52]. More recently, Multi-layer Perception (MLP) based architectures [53], [54], [55], [56] that alternate spatial and channel mixing have shown to be promising in tackling visual recognition tasks.
However, most of these methods are designed for narrow FoV pinhole images, which have minimal distortion and tend to suffer significant downgrades when applied to the fisheye semantic segmentation. Therefore, this study aims to tackle semantic segmentation with state-of-the-art computer vision algorithms, specifically spherical methods, to effectively manage the unique geometry of fisheye through parallel MLP-based, channel-wise mixing and pooling mixing mechanisms, where MLP-based mixing uses fully connected layers for spatial or channel-wise information mixing while pooling mixing aggregates features across spatial or channel dimensions using pooling operations like average or max pooling.
D. Deformable and Dynamic ViTs
Recently, studies have focused on making ViTs dynamic due to their widespread application. For instance, the non-overlapping Dynamic Attention Transformer (DAT) [57] and the anchor-based Deformable Prediction Transformer (DPT) [58] incorporate deformable designs in the later stages of the encoder, leveraging Feature Pyramid Network (FPN) decoders adapted from their CNN counterparts. Additionally, Deformable Transformer (DETR) [20] employs deformable attention to enhance feature maps, while PS-ViT [59] identifies discriminative regions through a progressive sampling module. Other approaches also aim to improve the efficacy of ViTs by dynamically modeling relevant dependencies using query grouping [54] or adaptively optimizing the number of informative tokens [59], [60], [61], [62]. However, all these efforts have been made by focusing on pinhole images with narrow FoV and may not effectively handle semantic segmentation tasks with distorted FoV. Therefore, there is a need to develop a segmentation transformer tailored for wide-FoV images such as the fisheye image. Such a model would effectively adapt to fisheye perspectives by learning to mitigate severe object deformation and fisheye distortion on the image boundary and closer to the center of the image.
E. Dynamic CNN Methods for Surround-View Image Semantic Segmentation
CNN-based approaches have recently outperformed traditional computer vision methods in semantic segmentation tasks using pinhole front cameras [63]. However, autonomous vehicles require a broader FoV to perceive their surroundings effectively in urban traffic scenarios. To address this need, Deng et al. [64] developed the Overlapping Pyramid Pooling Network (OPP-Net), which utilizes a zoom augmentation technique with multiple focal lengths to generate various fisheye images along with their corresponding annotations to enhance the model's generalization performance. The OPP-Net was trained and evaluated on the Cityscapes dataset [65], explicitly focusing on fisheye images in urban traffic contexts. Extensive experiments demonstrated the effectiveness of the zoom augmentation approach, showing that the OPP-Net performed well in urban traffic scenarios. However, no experiments have been conducted to assess the model's performance on real-world fisheye datasets that exhibit natural distortion effects, raising questions about its generalizability.
Kumar et al. [19] introduced a multi-task visual perception network designed to process unrectified fisheye images, enabling vehicles to sense their surroundings, including semantic segmentation tasks based on the standard methods. Their approach featured a shared encoder for significant computational efficiency and synergized decoders that allow multiple visual perception tasks, such as object detection, semantic segmentation depth estimation, etc., to support each other. They also proposed a novel camera geometry-based adaptation mechanism to encode the fisheye distortion model during training and inference. However, they mitigated boundary distortion artifacts by cropping the different fields of view with varying dimensions of Woodscape data. Moreover, multi-task visual perception models takes benefit from representation learning from different associated tasks instead of learning representation in the encoder part, which is computationally expensive and requires costly annotated labeled data.
Saez et al. [66] developed a fisheye dataset for semantic segmentation by leveraging the Cityscapes dataset [65] and proposed a real-time semantic segmentation technique that adapts the Efficient Residual Factorized Network (ERFNet) for fisheye road scenes by leveraging only synthetic image instead of real-world dataset. Moreover, their approach needs more improvement to segment smaller classes, such as traffic signals and poles, while testing on real-world data.
To tackle the distortion characteristics of fisheye images, Paul et al. [67] proposed a novel training method by employing Transfer Learning (TL) on existing semantic segmentation models that concentrate on a single task and modality. To achieve this, they investigated six different fine-tuning configurations using the WoodScape fisheye image segmentation dataset [68]. Furthermore, they introduced a pre-training stage that learns from perspective images instead of fisheye data by applying a fisheye transformation before employing transfer learning. One limitation of this study is the lack of exploration into training strategies during the pre-training stage, which could be explored for better mIoU scores during semantic segmentation. Moreover, transfer learning does not always deliver optimal results in all cases.
Further, Koki et al. [69] introduced an approach to exploit the distortion characteristics of fisheye data by first extracting distorted class labels based on an object's distance from the center of the image. They investigated how a contrastive learning methodology can enforce a model's representation space to reflect the distortion and semantic interaction inherent within fisheye data. However, they evaluated their approach explicitly on the object recognition task rather than on the semantic segmentation task.
Paul et al. [70] introduced FishSegSSL, a novel semi-supervised framework for fisheye image segmentation. Their framework features three key components: pseudo-label filtering, dynamic confidence thresholding, and robust substantial augmentation. They evaluated their approach using the real-world Woodscape dataset [12]. However, they utilized a benchmark framework initially developed for perspective images to address the data scarcity problem in fisheye image segmentation through semi-supervised learning rather than focusing on fisheye distortion characteristics.
Deng et al. [71] proposed a Restricted Deformable Convolution (RDC) to deal with the distortion issues in fisheye images. They allow effective geometric transformation modeling by learning the shape of the convolutional filter through applying Deformable convolution [72] on the last convolution layers based on the input feature map. In addition, the author presented a zoom augmentation technique for converting perspective images into fisheye images. This facilitates the creation of a large-scale training set of surround-view camera images. A multi-task learning (MTL) strategy is used to train for real-world surround-view camera images by combining real-world and transformed images. The models were trained on Cityscapes [65], fisheyeCityScapes [35], and SYNTHIA [73] datasets and tested on transformed fisheye images. However, no evaluation has been done on real-world surround-view images.
Yaozu et al. [35] presented a 7-degrees-of-freedom (7-DoF) augmentation technique for converting rectilinear perspective images into fisheye images. It includes a spatial relationship between the real-world fisheye coordinate system (6-DoF) and the virtual fisheye camera's focal length (1-DoF). During the training phase, rectilinear images are turned into fisheye images in 7-DoF to replicate those taken by original cameras with various locations, orientations, and focal lengths. This way, the author improves the model's accuracy and robustness while dealing with distorted fisheye data. The 7-DoF augmentation provides a generic solution for semantic segmentation for fisheye cameras and definite parameter settings for augmentation of autonomous driving and created the fisheyeCityScapes [35] dataset. However, [35] mainly focuses on the data augmentation method of fisheye semantic segmentation and does not design the network structure according to the actual characteristics of fisheye images.
To address the performance degradation due to camera viewpoint changes and effectively consider fisheye spatially variant distortion characteristics. Cho et al. [10] proposed a viewpoint augmentation technique to enhance fisheye image semantic segmentation. Using the fisheye camera projection model, they unprojected the image into 3D space on a unit sphere. They then simulated camera orientation and position changes by transforming the 3D points on the unit sphere. Finally, they generated augmented images by reprojecting the transformed 3D points back to 2D using the fisheye projection model. They evaluated their method on the real-world WoodScape dataset [12]. However, the process involves extensive pre-processing to incorporate the distortion characteristics of the fisheye dataset into the training.
Clement et al. [74] demonstrated that Deformable convolutions can be applied to existing CNNs without altering their pre-trained weights. This approach benefits systems that utilize multiple image modalities by allowing reliable adjustments without the need for retraining from scratch. They showed that Deformable components could be trained independently without fine-tuning, eliminating the need for large datasets of labeled fisheye images. Despite achieving high performance, their work has two limitations: it lacks direct validation on real-world surround-view images and does not compare their network with other modern techniques.
El Jurdi et al. [75] investigated Deformable convolutions for fisheye image segmentation, demonstrating their effectiveness in modeling fisheye image's geometric transformations and distortions. Their study explored multiple integration strategies of Deformable convolutional blocks within the residual U-NeT, highlighting the importance of using surround-view data as an augmentation method for front-view fisheye image segmentation. Their experiments showed that adding more Deformable convolutional blocks to the front view improves results, and training on surround-view images yields better performance for front-view segmentation than training solely on front-view photos. Additionally, they found that replacing rectangular convolutions with Deformable convolutions in an entirely residual U-Net enhances efficiency on both the WoodScape [12] and SynWoodScape [76] datasets. They propose that using a Class-reweighting scheme and integrating Deformable convolutions into the transformer framework could improve the model's ability to handle fisheye image deformations while maintaining computational efficiency, which could be considered for our study.
To address the scarcity of fisheye image segmentation data, Lee et al. [77] introduced a novel data augmentation algorithm that combines camera images with point cloud data from LiDAR sensors. Their approach involves stitching multiple images from different datasets to generate synthetic fisheye images. Using both CNNs and ViT architectures, they demonstrated that their augmentation technique significantly enhances model performance, particularly in detecting small objects within fisheye segmentation tasks. However, while this algorithm effectively improves semantic segmentation through data augmentation, it relies heavily on LiDAR-based augmentation rather than leveraging real-world fisheye data, which may limit its applicability in scenarios where LiDAR data is unavailable or impractical.
F. Dynamic CNN Models for Spherical Image Semantic Segmentation
Literature studies indicate that spherical images exhibit geometric distortions similar to those observed in fisheye images [24]. Consequently, research efforts to address distortions in spherical image semantic segmentation using dynamic CNN architectures may also potentially improve semantic segmentation in fisheye images. To address the challenges of spherical image distortion, researchers have developed spherical CNNs. Zhao et al. [78] introduced a method that samples different non-regular regions of spherical images and applies the same convolution kernel across various locations to account for the varying distortion effects. This approach also tackles the boundary problem in image classification, a common issue for spherical images. They conducted experiments using the CIFAR [79] and MNIST [80] datasets, and their evaluation was limited to classification tasks. They did not explore semantic segmentation with introduced distortions nor validate their network on real-world surround-view images that exhibit significant radial distortion. Additionally, given the complexity of automotive lenses, the spherical model can be suitable. Thus, it would be a promising direction to generalize spherical CNNs to handle more complex fisheye surfaces effectively [1].
Zang et al. [81] introduced a novel method to perform CNN operations on spherical images and addressed semantic segmentation on omnidirectional images using icosahedron spheres by proposing an orientation-aware CNN framework HexUNET. They evaluated their method on [82] and Omni-SYNTHIA [73]. However, they have evaluated their network performance on the synthetic rather than the real-world dataset.
Yang et al. [25] proposed a spherical annular semantic segmentation framework (PASS), incorporating a data augmentation method that distorts perspective images in the training set. They employed planar ERF-PSPNet after unfolding and partitioning the spherical images. Their evaluation involved testing different fields of view (FoVs) in the panorama. The analysis revealed that as the FoV increases, the per-segment results tend to decrease. Specifically, the
Yang et al. [26] proposed an unsupervised omnidirectional semantic segmentation framework that integrates multiple data sources. Their multi-source data distillation approach significantly enhances learning efficiency. It allows efficient CNNs, such as ERF-PSPNet, to effectively handle the large field of view of omnidirectional images for semantic segmentation. The results demonstrate that a single model can detect various object classes, improving the semantic understanding of real-world environments. Xu et al. [83] created a dataset of spherical images by stitching synthetic images captured from different directions using the SYNTHIA dataset. Their results demonstrate that spherical images enhance segmentation performance. Additionally, they found that training the model with spherical images featuring a 180-degree field of view (FoV) leads to improved performance. Moreover, models trained with spherical images exhibit better resilience to distortion.
Orhan et al. [84] addressed the distortion in surround-view spherical images by proposing UNet-equiconv, which employs equirectangular convolution to explicitly model the offsets in the convolution kernel, thereby mitigating the effects of distortion. They conducted several experiments to compare the performance of UNet with standard convolution against UNet-equiconv. Their results indicate that equirectangular convolution leads to significant improvements, particularly for small-sized objects that are often weakly represented. The authors primarily trained their network using the Cityscapes dataset [65] and subsequently fine-tuned it on spherical images, finding that training on spherical images further enhanced the results.
Hu et al. [85] introduced a novel semantic segmentation network designed for spherical images of outdoor scenes in autonomous applications. The network employs an encoder-decoder architecture, where the encoder includes a distortion convolutional module (DCM), a residual network, and atrous spatial pyramid pooling (ASPP). The DCM is responsible for correcting image distortion, while the decoder utilizes a deep feature aggregation network (DFAN) to combine low- and high-level features, enhancing segmentation accuracy. Their experiments demonstrate the network's robust performance across various outdoor scenes, highlighting its potential value for measurement applications. However, the DCM is not lightweight, which may affect computational efficiency in scenarios that demand high processing speeds.
In addition, Sekkat et al. [36] conducted a comprehensive comparative analysis of different deep learning approaches for processing various representations of omnidirectional images, including perspective, equirectangular, spherical, and fisheye formats. The study aims to identify the most effective method for segmenting road scenes in omnidirectional images, utilizing real-world perspective images, synthetic perspective images, simulated fisheye images, and a test set of real-world fisheye images. Their qualitative and quantitative analyses indicate that models employing planar convolutions outperform those using spherical convolutions. Furthermore, the findings suggest that models trained on omnidirectional images transfer more effectively to standard perspective images than vice versa. However, the study does not evaluate the performance of state-of-the-art ViTs on different omnidirectional image formats.
Cao et al. [86] proposed a novel Occlusion-Aware Seamless Segmentation (OASS) framework, leveraging the UnmaskFormer architecture. This framework integrates distortion- and occlusion-aware features within a transformer-based design to address the challenges posed by distortions and occlusions in panoramic image segmentation. To improve unsupervised domain adaptation and mitigate the effects of diverse occlusions, they introduced the Amodal-oriented Mix (AoMix) method, effectively bridging the gap between pinhole and panoramic image domains. Furthermore, they developed the BlendPASS dataset to support their framework. Evaluating the OASS framework on diverse autonomous driving datasets, particularly those containing surround-view fisheye images, could provide valuable insights into its scalability and generalization capabilities.
Recently, Zheng et al. [21] introduced an Open Panoramic Semantic Segmentation (OPS) method to address the challenge of open-vocabulary panoramic semantic segmentation for unrestricted scene understanding. Their approach trains models on pinhole images in the source domain within an open-vocabulary setting and evaluates them on open panoramic images in the target domain. Their approach enables zero-shot panoramic segmentation by incorporating a Deformable Adapter Network (DAN) and a novel data augmentation technique called Random Equirectangular Projection (RERP). Despite the advantages of open-vocabulary model, Zheng et al. acknowledged that the performance of open-vocabulary models lags behind that of close-vocabulary settings. Moreover, the architectural design does not fully leverage the
G. ViTs for Surround-View Fisheye Images
With advancements in ViTs focused on narrow fields of view (FoV), significant progress has recently been made in developing ViTs for the semantic segmentation of spherical images in autonomous driving. However, using standard CNNs or ViTs on such data presents challenges due to projection and distortion errors that arise when mapping spherical images to a rectangular grid. In [87], Shi et al. introduced FishDreamer. This framework enhances ViTs by incorporating a Polar-aware Cross-Attention (PCA) module to manage polar distributions and distortions in fisheye images. They utilized the Cityscapes-BF [65] and KITTI360-BF [15] datasets for training and evaluation. The PCA effectively addresses the unique polar distortions and improves semantic content generation. However, the generalization of FishDreamer to real-world fisheye images remains untested.
Carlsson et al. [88] developed the HEAL-SWIN transformer, designed to operate on high-resolution spherical data by integrating the HEALPix grid with an adapted SWIN transformer [27]. They leverage the hierarchical similarities between HEALPix and SWIN to implement effective windowing and shifting mechanisms, particularly useful for handling partial spherical data. Their work explicitly addresses fisheye datasets [12] and [16], processing them as distortion-free spherical signals to correct fisheye image distortions. However, they did not quantify class-wise Intersection over Union (mIoU) scores for smaller objects, focusing instead on overall mIoU scores for both datasets, which the dominant class results can bias. Moreover, they note that their Swin-UNet [89] architecture, which serves as the basis for the HEAL-SWIN setup, is not state-of-the-art in semantic segmentation tasks. Implementing a modern ViT decoder head with HEALPix could potentially enhance performance. Furthermore, their model lacks equivariance to spherical rotations, as discussed in [90]. For tasks requiring equivariance, such as semantic segmentation, incorporating rotational equivariance could significantly improve performance and provide valuable insights into local transformation equivariance. Sekkat et al. [31] introduced the Trans4PASS model, designed to address geometric distortions in spherical images through Deformable Patch Embedding (DPE) and Deformable MLP (DMLP) components. Despite its lightweight architecture, comprising approximately 14 million parameters, Trans4PASS achieves a notable mean Intersection over Union (mIoU) score of 51.2%, comparable to larger models like ResNet-101, which has around 44 million parameters and a mIoU of 52.0%. This model shows promise for adaptation across various imaging domains, including pinhole, fisheye, and spherical images.
Building on this work, [24] further enhances Trans4PASS with a more efficient decoder featuring the DMLPv2 module and parallel token mixing mechanisms, naming it Trans4PASS+. They also proposed a Mutual Prototypical Adaptation (MPA) strategy to transfer semantic information from label-rich to label-scarce domains and introduce the SynPASS dataset for supervised training. This dataset supports a Synthetic-to-Real (SYN2REAL) domain adaptation paradigm, providing a new perspective compared to the Pinhole-to-Spherical (PIN2PAN) scenario. Continued research is needed to fully evaluate Trans4PASS's effectiveness across diverse datasets, particularly in fisheye domains with more significant distortions.
Furthermore, [91] employed the Trans4PASS+ network discussed in [31], taking into account the 3D properties of the original
Zheng et al. [92] introduced a computationally efficient and straightforward Unsupervised Domain Adaptation (UDA) framework that effectively addresses distortion issues in spherical semantic segmentation. They propose a Distortion-Aware Attention (DA) mechanism that captures the distribution of neighboring pixels without relying on geometric constraints. Additionally, they present a class-wise feature aggregation (CFA) module that iteratively refines feature representations using a memory bank. However, their UDA approach is limited to adapting from a single source dataset to a single target dataset. Exploring multi-source domain adaptation for spherical semantic segmentation could yield valuable insights. Furthermore, investigating how to leverage different spherical projections for enhanced knowledge transfer would be beneficial.
Recently, Zhang et al. [93] introduced the GoodSAM model, which aims to create an efficient spherical semantic segmentation model using the Segment Anything Model (SAM) [94]. This approach addresses the challenges of applying SAM directly to spherical segmentation. The GoodSAM model incorporates Distortion-Aware Rectification (DAR) and Multi-Level Knowledge Adoption (MKA) modules to produce reliable ensemble logits and facilitate effective knowledge transfer for spherical images. Fine-tuning GoodSAM could enhance its applicability as a foundational segmentation model for various surround-view images.
The developments mentioned above highlight that while prior work has attempted to reduce fisheye distortion through architectural adjustments and training augmentations, most efforts have focused on data augmentation, multitask learning, modifying model layers, preprocessing, or generating synthetic fisheye data to improve semantic segmentation performance. However, no existing work has guided models to learn a representation space that captures the interaction between distortion and semantic context in fisheye data. Spherical models have addressed this issue with distortion-aware modules in the network, which could be adapted for the fisheye domain.
Methodology for Fisheye Semantic Segmentation Using Spherical Methods
Building on existing research in fisheye and spherical image segmentation methods, we address the distortions introduced by fisheye lenses in semantic segmentation tasks by leveraging advanced Spherical CNN techniques [25], [26] and Spherical ViT-based methods [24], [31]. In Table 1, we provide an overview of spherical models, providing a link to the GitHub repos where they have been made available. We used real-world and synthetic fisheye datasets for our analysis to ensure a robust evaluation. The real-world dataset included Woodscape [12], while the synthetic datasets comprised SynWoodscape [16] and Synthetic Cityscapes-Fisheye [35]. To enable a comprehensive comparison, we evaluated various models, including both spherical and non-spherical approaches. The spherical models included Spherical CNNs (PASS and OOSS) and Spherical ViTs (Trans4PASS and Trans4PASS+). To provide some non-spherical baselines for non-spherical approaches, we evaluated Swin-UNet [89] and UNet [41]. We have also included networks designed/trained for fisheye in our WoodScape results. We include the results from our previous work on a deformable UNet architecture for fisheye [95] and from [67], where a SegFormer model [52] is presented with a bespoke fine-tuning method across multiple datasets (final fine-tuning on WoodScape). Both of these methods provided results only for fisheye. Table 1 summarizes all of the state-of-the-art models that we found that are used for spherical image semantic segmentation, highlighting their main characteristics and code availability. This table underpins the rationale behind our model selection for the proposed approach, evaluation, and analysis. Finally, the assessment compared methods using both quantitative metrics and qualitative analysis, with particular emphasis on visualizing the segmentation of smaller objects during test set evaluations.
A. Dataset
This subsection defines the fisheye datasets used to evaluate our approach. The selected datasets provide semantic RGB images and their corresponding semantic annotation masks, simplifying the training process by eliminating the need for additional annotation efforts. We evaluated our approach using both real-world and synthetic datasets (Fig. 7). Below is a complete description of each dataset and its characteristics.
1) WoodScape
The WoodScape dataset [12] is a real-world collection of 8,234 annotated fisheye images captured from four distinct vehicle viewpoints: Front View (FV), Mirror-View Right (MVR), Mirror-View Left (MVL), and Rear View (RV). These images, recorded in diverse driving environments, provide semantic annotations for 10 classes: road, lane markings, curb, person, rider, car, bicycle, motorcycle, traffic sign, and void. Notably, many annotated classes, particularly those located near the periphery of the fisheye images, exhibit significant distortion, making their segmentation more challenging. Additionally, the dataset suffers from a pronounced class imbalance, with dominant classes such as road, lane markings, and curb being overrepresented compared to underrepresented classes like bicycles, motorcycles, and traffic signs. This imbalance and peripheral distortion highlight the need for models capable of effectively handling these disparities.
2) SynWoodScape
The SynWoodScape dataset [16] is a synthetic adaptation of the real-world WoodScape dataset [12] and provides 2000 data samples. It includes 25 distinct categories, such as unlabeled, building, fence, other, pedestrian, pole, road line, road, sidewalk, vegetation, four-wheeler vehicle, wall, traffic sign, sky, ground, bridge, rail track, guard rail, traffic light, water, terrain, two-wheeler vehicle, static, dynamic, and ego vehicle, for the semantic segmentation task.
The dataset features a diverse set of annotated fisheye images captured from multiple viewpoints, including FV, MVR, MVL, and RV, effectively simulating real-world driving scenarios. However, despite its diversity, the SynWoodScape dataset has notable limitations, particularly in terms of the limited number of available samples and class imbalance. These issues may hinder the development of models that generalize well across all segmentation classes. Consequently, a few categories—such as unlabeled, bridge, dynamic, guard rail, and terrain—are missing and were excluded from this study.
3) SynCityscape Fisheye
The Cityscapes dataset [65] is a renowned benchmark dataset widely used in autonomous driving research. It contains street scenes captured from 50 cities, offering 5,000 samples of urban environments. The dataset defines 19 semantic classes: road, sidewalk, building, wall, fence, pole, traffic light, traffic sign, vegetation, terrain, sky, person, car, truck, bus, train, motorcycle, and bicycle for semantic segmentation tasks. This diverse set of categories supports comprehensive scene understanding for urban driving scenarios. Due to the limited availability of fisheye datasets for semantic segmentation tasks, we applied the seven-degree-of-freedom (seven-DoF) data augmentation method proposed in [35]. This technique converts the original rectilinear images into synthetic fisheye images, effectively simulating the distortion patterns typical of fisheye cameras. The original images, with a resolution of
B. Spherical Models Used
To evaluate our approach to fisheye datasets, we selected two Spherical CNNs and two Spherical ViTs, chosen for their proven efficiency in spherical image semantic segmentation task and their optimized network parameters. This selection was also guided by the limited accessibility of certain spherical networks, as the full configurations detailed in their corresponding research studies were not entirely available.
1) Spherical PASS Network
To evaluate the fisheye datasets, we utilized the ERFPSPNet (PASS) model, an efficient spherical CNN designed for omnidirectional image processing [25]. This model employs an encoder-decoder architecture specifically optimized for learning spherical image features. The encoder is adapted from the well-established ERFNet [97], offering a balance between inference speed and segmentation accuracy. It integrates a pyramid pooling module from PSPNet [46], where the feature pyramid is upsampled and fused with input features. This design enables the network to capture larger spatial contexts, improving its receptive field and overall segmentation performance.
2) Spherical OOSS Network
Another spherical CNN method we employed for training fisheye datasets is the ERFPSPNet+scSE(OOSS) network [26], which is based on the ERFPSPNet(PASS) [25] encoder-decoder architecture with some key modifications. The primary enhancement involves integrating lightweight attention modules with the pyramid structure to extract global contextual information in full-view spherical images, resulting in the ERFPSPNet+scSE model. This variation appends the pyramidal pooling module with concurrent spatial and channel attention [98], recalibrating feature maps independently across the channel and spatial dimensions to make them more informative. The scSE (spatial and channel Squeeze-Excitation) mechanism extends the traditional squeeze-and-excitation attention [99] by complementarily squeezing along the channels and exciting spatial regions. This concurrent recalibration relaxes the local context constraint, emphasizing relevant spatial features more, which is crucial for effectively interpreting spherical images.
3) Spherical Distortion-Aware Trans4PASS Network
In this study, we adopt the distortion-aware Trans4PASS [31] architecture to address fisheye geometric distortions in semantic segmentation tasks. Building on traditional transformers [28], [52], Trans4PASS employs a multi-scale pyramid feature structure with two model variants: tiny (T) and small (S). The tiny model uses {2, 2, 2, 2} layers, while the small model uses {3, 4, 6, 3} layers across its four stages. The encoder progressively down-samples input features
Deformable Patch Embedding (DPE): Applied in both encoder and decoder, it enables uniform feature extraction despite distortions.
Deformable MLP (DMLP): Works alongside DPE in the decoder to adaptively mix and interpret feature tokens without relying on a “squeeze and excite” module or parallel channels.
4) Spherical Distortion-Aware Trans4PASS+ Network
Another spherical distortion-aware architecture used for our approach evaluation is the Trans4PASS+ architecture [24], which is an upgraded version of the previously defined Trans4PASS [31] to address the challenge of fisheye geometric distortion in the semantic segmentation task. This network tackles spherical image distortions through two key design elements: (1) a Deformable Patch Embedding (DPE) module and (2) a Deformable MLP (DMLP) module, which collaborates with the DPE in the decoder by adaptively mixing and interpreting the feature tokens extracted via the DPE. The Trans4PASS+ model enhances the DMLP module with a parallel token mixing mechanism, improving its ability to handle complex distortion patterns. The details of the DPE and enhanced DMLPv2 modules are provided in the following sections.
Deformable Patch Embedding (DPE): Unlike the standard patch embedding (PE) used in conventional ViTs [28], [47], [100], which suffer from severe shape distortions in the projected surround-view image due to the equirectangular projection process, the standard PE module with fixed patch positions leads to the transformer model overlooking the shape distortions of objects and spherical geometric distortions. Inspired by deformable convolution [72] and overlapping patch embeddings [48], the DPE is applied separately to the input images in the encoder and the feature maps in the decoder during the patchifying process. The DPE module enables the model to learn a distortion-aware offset, which helps capture spatial relationships among objects within distorted patches. This learnable mechanism predicts adaptive offsets based on the input feature map, enabling the model to manage distortions effectively. The adaptive offsets are predicted through a mechanism that learns to adjust based on the input, ensuring distortion-aware processing of surround-view spherical images.
Deformable MLP (DMLP): In contrast to vanilla MLP-based modules [27] that lack adaptivity and thus weaken token mixing for surround-view images, the DMLP-based decoder, connected with the DPE module, performs adaptive token mixing throughout the feature parsing process. This approach accounts for the deformation properties of
The CX facilitates space-consistent but channel-wise feature re-weighting, enhancing the features by emphasizing informative channels.
The multi-scale PX and DMLP focus on spatial-wise sampling using fixed or adaptive offsets, resulting in mixed tokens that are highlighted in relevant positions.
These enhancements improve the model's ability to model discriminative information, boosting its generalization ability against domain shifts. Additionally, the efficient token mixing provided by PX reduces model complexity, employing a single MLP layer for a more lightweight yet effective model for semantic segmentation.
C. Non-Spherical Models
In our experiments, we also employ two non-spherical models, one based on the CNN and the other on the ViT, to serve as baselines for comparison with the spherical models. A brief discussion of these models is provided below
1) U-Net
The U-Net [42] is a foundational semantic segmentation model renowned for its nested architecture, which enhances feature extraction across multiple scales and facilitates the integration of features from various levels of the network. Its distinctive U-shape, combined with nested convolutional layers and residual connections, supports the reuse of features and reduces information loss as data progresses through the network. This design significantly improves the accuracy of segmentation tasks by maintaining detailed information throughout the entire process. Therefore, we also considered this network for comparison.
2) Swin Transformer
The Swin Transformer (SWIN) [27] is a highly adaptable architecture that efficiently handles images of varying resolutions using a shifted window-based self-attention mechanism. It divides images into fixed-size windows, performs self-attention within each, and progressively reduces feature map resolution while increasing dimensionality. This approach captures both fine details and broader features. We considered a UNet-like variant of the SWIN Transformer known as Swin-UNet [89], which connects encoding and decoding layers with skip connections. The encoding layers include patch-merging for better feature extraction, while decoding layers replace patch-merging with patch-expansion to maintain the original resolution.
D. Evaluation Metrics
To assess the effectiveness of our semantic segmentation approach, we used the mean Intersection over Union (mIoU) and pixel Accuracy (Acc) as our evaluation metric because these matrices effectively capture the quality of a model's predictions in terms of both the overall segmentation accuracy and the spatial precision of each pixel's classification. IoU measures the overlap percentage between the predicted segmentation areas and the ground truth and is known as the key indicator of semantic segmentation performance. For each class, IoU is calculated by comparing the overlap area to the total area covered by both the prediction and the ground truth. The mIoU is then derived by averaging these scores across all classes, offering a detailed overview of the model's effectiveness across the entire dataset. (1) is the computation formula for the Intersection over Union (IoU) score calculated for a single class, (2) provides the formula for the mean IoU score calculated across all classes, and (3)) presents the formula for Pixel Accuracy, which evaluates the accuracy of the segmentation across all classes.
\begin{align*}
{\text{IoU}}_{c} =& \frac{{\text{TP}}_{c}}{{\text{TP}}_{c} + {\text{FP}}_{c} + {\text{FN}}_{c}} \tag{1}\\
{\text{mIoU}} =& \frac{1}{C} \sum _{c=1}^{C} {\text{IoU}}_{c} \tag{2}\\
{\text{Pixel Accuracy}} =& \frac{\sum _{i=1}^{N} {\text{TP}}_{i}}{\sum _{i=1}^{N} \left({\text{TP}}_{i} + {\text{FP}}_{i} + {\text{FN}}_{i} + {\text{TN}}_{i}\right)} \tag{3}
\end{align*}
Experiment and Results
In this section, we present the performance results of our approach, including a detailed analysis that covers both quantitative and qualitative aspects. The quantitative analysis delivers a statistical overview, focusing on semantic segmentation performance with and without the Spherical CNNs and ViT methods. Meanwhile, the qualitative analysis provides detailed insights and a contextual understanding of the results.
A. Experiment Setup
1) Dataset Description
For the experiments and analyses, we partitioned the Woodscape and SynWoodscape datasets into training, validation, and test sets with splits of 80%, 10%, and 10%, respectively, ensuring a uniform distribution of all views across the partitions. Additionally, our generated Synthetic Cityscapes-fisheye dataset contains 5,000 finely annotated frames, with 2,975 frames for training, 500 for validation, and 1,525 for testing. For our experiments, we utilized 2,975 frames for training and 500 frames for validation.
2) Training Settings
In this work, we train models using an NVIDIA GeForce RTX 3080 with 10 GB of memory, leveraging the PyTorch framework.
All models are trained with an initial learning rate of
Additionally, image augmentation techniques, as outlined in Table 2(a) are applied during training to enhance the generalization ability of the models. These include random resizing and horizontal flipping, which simulate diverse conditions such as varying image scales, orientations, and perspectives. To further improve robustness, color-based transformations are used, including random brightness and contrast adjustments, CLAHE, and hue-saturation-value changes, applied with an overall probability of 0.9. Gaussian noise is also added with a ratio of 0.2 to mimic real-world imaging conditions. For computational efficiency, all evaluation models were adapted to process fisheye images resized to
Moreover, we train and test all models using multi-view data (i.e., FV, MVL, MVR, RV) for all analyses to effectively capture the wide range of distortions and variances characteristic of fisheye images during training and testing.
To address the highly imbalanced nature of the datasets, we fine-tuned models using Cross-Entropy Loss [102], Focal Loss [103], and OHEM-Cross-Entropy Loss [104]. We also incorporated a class-weighting scheme from [105] to emphasize underrepresented classes. Initial training phases utilized standard Cross-Entropy and Focal Loss, while subsequent phases applied OHEM-Cross Entropy loss to focus on challenging examples and improve segmentation of less dominant classes. We trained the spherical ViTs (Trans4PASS and Trans4PASS+) base models, with a layer configuration of {3, 4, 6, 3} [24], over fisheye datasets to enhance semantic segmentation. Table 2(b) summarizes the key parameters for training the Spherical ViT models.
Furthermore, Spherical ViTs [24], [31] are known to be data-hungry and often require fine-tuning for optimal performance. To address this, two training approaches were investigated for enhanced segmentation. The first approach involved training spherical ViT models from scratch, without utilizing pre-trained weights, to evaluate their inherent generalization capabilities. The second approach leveraged SegFormer pre-trained weights [52] to fine-tune the spherical ViT models. By comparing the performance of models trained with and without pre-trained weights, we assess the impact of transfer learning in the context of fisheye segmentation using spherical models.
We evaluated the model's effectiveness using mean Intersection over Union (mIoU), class-wise IoU, and pixel accuracy (Acc) metrics over the test set. These metrics provided a framework for assessing the semantic segmentation performance across the fisheye datasets.
B. Quantitative Fisheye Semantic Segmentation Results
This section compares fisheye image semantic segmentation using per-class mean Intersection over Union (mIoU) and pixel accuracy (Acc) results. The evaluation focuses on spherical and non-spherical methods across three datasets: Woodscape, Synwoodscape, and the synthetic cityscapes-fisheye test set.
1) Woodscape
Table 3 highlights that the spherical ViT (Trans4PASS+) is the best-performing model on the Woodscape real-world dataset, which presents substantial distortions due to the fisheye camera lens setup. With a mIoU of 0.67 and an average accuracy of 0.95 (highlighted in green), it significantly outperforms other architectures. This superior performance suggests that the ViT-based model, enhanced by the Trans4PASS+ framework, excels in capturing complex spatial relationships and finer details critical for segmenting objects in urban driving environments. The model's ability to handle smaller, more challenging categories such as Person, Rider, and Motorcycle and traffic sign is particularly notable in Fig. 11. These categories are typically challenging to segment due to their smaller size, irregular shapes, and frequent occlusion in dynamic driving scenes.
This figure presents a clear visualization of the fisheye datasets. The first column shows the original images and their respective segmentation and overlay masks. The first row features a frame from the Woodscapes [12] dataset, the second row displays a frame from the SynCityscape Fisheye [35] dataset, and the third row presents a frame from the SynWoodscape [16] dataset.
Illustration of the training and validation loss curves for the spherical model (Trans4PASS+) during training on the woodscape fisheye dataset, without fine-tuning.
Train vs. Validation Loss curves for different models trained on the SynWoodscape dataset. (a) and (b) show the curves for Trans4PASS and Trans4PASS+, both exhibiting overfitting, as evidenced by the divergence of validation loss from training loss after a few epochs. This suggests the models struggle to generalize well to unseen validation data, likely due to the limited number of available SynWoodscape samples. In contrast, (c), (d), (e), and (f) display the Train vs Validation Loss curves for OOSS, PASS, UNeT, and SwinUNeT, respectively. These models show no divergence between the training and validation loss during the training process, indicating better generalization.
Qualitative results on the Woodscape test set, obtained using a spherical model across Multiviews (FV, MVL, MVR, RV), demonstrate the model's generalization on a real-world dataset. The performance is highlighted with yellow-mustard boxes, focusing on the deformed object shapes and distorted edges compared to the ground truth masks.
Qualitative comparison of regular methods vs. spherical models on Woodscape for semantic segmentation, where mustard boxes indicate performance degradation with regular transformers due to spherical distortions, and green boxes highlight regions of improved segmentation accuracy using spherical transformers.
In contrast, architectures like Swin-UNet and UNet exhibit consistently poor performance across these smaller categories, particularly in Motorcycle, Rider, and Person. Their reliance on localized receptive fields limits their ability to model long-range dependencies, which are essential for understanding complex, distorted scenes in fisheye imagery. This limitation results in poor IoU scores, often below 0.11 for these categories, indicating that these models struggle to generalize well in fisheye imaging scenarios.
Although Spherical CNN (PASS) and Spherical CNN (OOSS) offer improvements over UNet and Swin-UNet, they still fall short compared to the Trans4PASS+ model. While their architecture is better suited to handle spherical distortions, they lack the global attention mechanisms that allow them to effectively capture finer details and distant object relationships in urban driving scenes. To compare with other fisheye segmentation methods, Table 3 also presents the published results from [95], (De-UNet) which uses deformable convolutional layers to address fisheye distortions, and [67], which fine-tunes the SegFormer model with MiT-B3 and MiT-B2 backbones on the WoodScape dataset to improve fisheye segmentation performance. While both approaches are tailored for fisheye semantic segmentation, they fall short compared to the spherical model, particularly in terms of class-wise IoU and mIoU. These findings highlight the potential of spherical techniques in tackling the challenges of fisheye semantic segmentation.
Moving to pre-trained weights, as seen in Table 4, both spherical models, Trans4PASS+ and Trans4PASS, show significant improvement when fine-tuned using SegFormer pre-trained weights [52] on Woodscape dataset. However, Trans4PASS+ outperforms Trans4PASS, particularly in complex, smaller categories like Person, Rider, and Traffic Sign. These gains can be attributed to the transformer-based architecture of Trans4PASS+, which enhances feature extraction and spatial understanding.
When comparing the results of training from scratch with those using pre-trained weights (Tables 3 and 4), it becomes evident that training from scratch with spherical ViTs yields better per-class IoU scores on real-world WoodScape dataset. The reason is that pre-trained models often perform better when the target dataset shares characteristics with the pretraining dataset. However, the domain mismatch between standard pretraining datasets (such as ImageNet) and the WoodScape dataset (which involves unique urban environments and fisheye image distortions) hinders the transferability of learned features. Moreover, pre-trained models may overfit patterns learned from the pretraining data, limiting their ability to learn the distinct patterns required by a new domain. This is particularly evident in challenging categories where pre-trained models struggle, such as smaller or occluded objects in urban driving environments. In contrast, models trained from scratch can adapt better to these domain-specific challenges. As shown in Fig. 8 the validation loss decreases over epochs when training from scratch, indicating that the Trans4PASS+ model is effectively learning without the signs of overfitting.
To further analyze the spherical ViT results, we compared them with the recently proposed open-vocabulary spherical ViT (OOOPS) [21]. We trained and tested the OOOPS model using the same setup described in their paper (see Table 3). This setup enabled us to evaluate the performance of the OOOPS model in a closed vocabulary setting with the real-world Woodscape dataset [12], as all models were trained and evaluated in closed settings. In this context, the model underperformed on boundary classes, as acknowledged by Zheng et al. [21]. It struggled to produce segmentation results for motorcycles and traffic signs—objects commonly found at the boundary of fisheye images. However, OOOPS achieved better results than other spherical and non-spherical models in object categories like ‘Person’ (0.57) and ‘Bicycle’ (0.77).
2) SynCityscapes-Fisheye
Table 5 provides a comparative performance analysis of different spherical and non-spherical segmentation models applied to the SynCityscapes-fisheye dataset across multiple object categories.
From Table 5, Spherical ViT (Trans4PASS+) stands out as the best performer, achieving the highest mean Intersection over Union (mIoU) of 0.54. The model recognizes larger, well-defined classes such as Road, Sidewalk, Building, Sky, and Vegetation, consistently achieving high IoU scores. Additionally, it demonstrates considerable improvements in more challenging categories such as Traffic Light (0.30), Traffic Sign (0.39), and Motorcycle (0.25). These results indicate that Trans4PASS+ is particularly adept at capturing crucial but less frequently occurring objects in urban scenes, reflecting its capacity to model complex spatial relationships and dynamic objects.
In contrast, traditional models such as U-Net and Swin-UNet struggle with smaller and more dynamic categories like Bicycle and Motorcycle, where their IoU scores drop significantly. This suggests that these models have limitations when handling complex and highly variable objects in scenes with dynamic elements. Although spherical CNN (PASS) and Spherical CNN (OOSS) offer modest improvements over baseline architectures, they still underperform when compared to the transformer-based approach of Trans4PASS+.
The results in Table 5 demonstrate that the pre-trained spherical model generally outperforms those trained from scratch. This performance gap can be attributed to the visual similarities between the City-Fisheye dataset and other datasets like ImageNet, on which the SegFormer models were originally pre-trained. These similarities facilitate better feature transfer, enabling pre-trained models to leverage previously learned patterns and features for segmentation tasks, leading to enhanced performance compared to models without such prior training.
Moreover, the last column in Table 5 presents the class-wise Intersection over Union (IoU) scores and the mean IoU (mIoU) for the SynCityscapes-fisheye dataset when evaluated with the Spherical-ViT (Trans4PASS+) model trained from scratch. Notably, the model struggles to effectively learn smaller classes, showing no results for some of these classes during evaluation. For instance, it has difficulty segmenting traffic signs, motorcycles, and riders, which are typically harder to segment due to their smaller size and less frequent appearance in datasets.
This highlights that the spherical model still requires fine-tuning to handle the SynCityscapes-fisheye dataset effectively, which is the transformed fisheye dataset using 7-DoF data augmentation technique due to the limitation of fisheye datasets. Thus, the SynCityscapes-fisheye dataset exhibits different or less accurate distorted datasets than real-world fisheye datasets, posing a further challenge for the spherical-ViT (Trans4PASS+) model. Although this model is designed to handle inherent distortions better, it may struggle to learn effective representations when trained from scratch, as it was specifically built for managing image distortions. These factors contribute to the lower performance observed in this case.
3) SynWoodscape
Table 6 provides a comparative performance analysis of different spherical and non-spherical segmentation models applied to the SynWoodscape dataset across multiple object categories.
A close analysis of Table 6 reveals that U-Net performs relatively well on specific classes, especially for larger, well-structured objects like Road (0.88), Static (0.92), and Ego Vehicle (0.99). However, it struggles with smaller, more complex objects like Fence (0.20) and Pole (0.33). This is likely due to U-Net's simpler convolutional architecture, which may not be sufficient for capturing fine details and irregular boundaries in these smaller objects. Nevertheless, U-Net maintains a higher overall mIoU score than other models, indicating better performance for larger, easier-to-segment classes. In addition, the spherical models (PASS and OOSS) outperform Swin-UNet for harder-to-classify classes, particularly those that exist near boundaries. Fig. 9(c), (d), and (f) illustrate the training versus validation loss and IoU curves during the training on the SynWoodscape dataset.
On the other hand, the Spherical-ViT architecture performs more poorly than expected, particularly in challenging categories such as Pedestrian (0.05), Traffic Light (0.07), and Two-Wheeler (0.06). The primary reason for this underperformance is that Spherical ViT requires large datasets to train effectively. However, the SynWoodscape dataset contains only approximately 2000 images, which is insufficient for data-hungry models like Trans4PASS+. Consequently, the spherical models overfit to the SynWoodscape data due to the limited training data, which, we hypothesize, is insufficient for effectively learning the spatial relationships and distortions characteristic of fisheye images. With a limited number of samples, the models struggle to generalize to the diverse and complex scenes in SynWoodscape. Fig. 9(a), (b), and (e) illustrates the training versus validation loss curves for Trans4PASS, and Trans4PASS+ and UNet, respectively, during the training process on the SynWoodscape dataset (i.e., 2000 samples), providing a better comparison of their performance. In particular, Fig. 8 demonstrates that overfitting does not occur when Trans4PASS+ is trained on the larger WoodScape dataset (compared with Fig. 9(b) curve for the SynWoodscape dataset).
C. Qualitative Results
For a more detailed qualitative analysis of different side views in the WoodScape dataset, Fig. 10 showcases segmentation results from the WoodScape test set processed by the spherical ViT (Trans4PASS+ model). These results demonstrate its capacity to handle fisheye image distortions, especially at the image boundaries. The left column displays the original fisheye images, characterized by their wide Fields of View (FoV) and the inherent distortions typical at the edges of such images. The middle column features the corresponding segmentation maps, where different colors represent various street elements, including roads, vehicles, and pedestrians. Despite the severe distortions introduced by fisheye lenses, the Trans4PASS+ model effectively detects and segments these objects precisely. In the right column, areas highlighted with yellow boxes focus on regions where the model's performance is particularly impressive, accurately segmenting objects even in the distorted boundaries. This success is largely due to the spherical projection technique employed by Trans4PASS+ through its Deformable Patch Embedding (DPE) module, which maps the distorted 2D fisheye images onto a spherical surface. Considering the fisheye's distorted characteristics, this technique normalizes the distortions and preserves spatial relationships, allowing for more accurate segmentation.
For a comparative qualitative analysis, Fig. 11 contrasts the results of non-spherical vs. spherical models, highlighting the challenges and solutions in addressing fisheye distortions. The first and second columns show the original fisheye images with their respective ground truth masks, providing a reference for comparison. The subsequent columns display the segmented outputs from various non-spherical and spherical models. Notably, the results from Swin-UNet (highlighted with yellow boxes) illustrate its inability to accurately segment objects, particularly near the distorted edges of the fisheye images. Swin-UNet struggles with these distortions due to its fixed Patch Embedding (PE) and MLP layers, which hinder its ability to adapt to the deformations in object shapes caused by the fisheye lens.
In contrast, the fourth column presents the results from the spherical transformer (Trans4PASS+), a model specifically designed to handle such distortions. This spherical model leverages Deformable Patch Embedding (DPE) and Deformable Multilayer Perception (DMLP) to learn object shapes and account for fisheye distortions during training, leading to much better segmentation results. The yellow and green boxes over the predicted images emphasize areas where the spherical-ViT model excels compared to U-Net and Swin-UNet.These visual cues highlight the substantial improvements in handling fisheye distortions, showcasing Trans4PASS+’s superior performance in segmenting distorted objects.
Likewise, Figs. 12 and 13 present a comparative qualitative analysis of various spherical and non-spherical models applied to the SynCityscapes-fisheye and SynWoodscape datasets. In the SynCityscapes-fisheye qualitative test set evaluation, the spherical-ViT(Trans4PASS+) model demonstrates superior performance, particularly in detecting smaller classes, especially along the edges of the images. This capability highlights its effectiveness in handling fisheye distortions and accurately segmenting intricate objects within complex scenes. Conversely, in the SynWoodscape test set, where the available data is limited to only 2000 images, the U-Net-trained model outperforms others in identifying smaller objects at the edges. This performance can be attributed to U-Net's established strengths in segmentation tasks, particularly in datasets with fewer examples, where it can leverage its architectural advantages to extract meaningful features effectively. These figures illustrate how different models exhibit varying strengths based on the dataset's characteristics and the nature of the challenges presented by fisheye distortions.
Qualitative comparison of regular methods vs. spherical models on SynCityscapes-fisheye for semantic segmentation.
Qualitative comparison of regular methods vs. spherical models on Synwoodscape for semantic segmentation.
This comparison highlights the critical importance of specialized architectures, such as Spherical CNNs (e.g., PASS and OOSS) and ViTs like Trans4PASS and Trans4PASS+, in processing fisheye images. Unlike traditional models like U-Net and Swin-UNet, which are not explicitly designed to handle non-linear manifolds (e.g., fisheye data), spherical models tailored for spherical data significantly improve segmentation accuracy and robustness in fisheye imagery. Notably, state-of-the-art models like Trans4PASS+ effectively address the challenges of fisheye distortions, particularly in applications like autonomous driving.
D. Complexity and Real-Time Performance
Table 7 summarizes the real-time performance of various spherical and non-spherical models used in fisheye semantic segmentation for autonomous driving applications, presenting their Average Inference Time per Image, inference speed, and parameter counts, measured on an NVIDIA GeForce RTX 3080 with 10 GB of memory. With lower parameter counts, PASS (2.5 M) and OOSS (2.6 M) achieve excellent real-time speeds of 142.23 FPS and 137.32 FPS, making them ideal for resource-constrained, real-time applications like autonomous driving. Trans4PASS and Trans4PASS+, with higher parameter counts (14 M and 25 M), balance efficiency and advanced feature extraction while maintaining real-time performance at 49.12 FPS and 41.07 FPS.
In contrast, non-spherical models are less suitable for fisheye data. U-Net, though fast at 282.48 FPS, cannot handle fisheye distortions effectively, while Swin-UNet, with a larger parameter count of 27.3 M, falls below the real-time threshold at 27.67 FPS. The design and real-time capabilities of spherical models make them better suited for practical applications requiring both efficiency and accuracy.
However, we have also shown that Spherical models (particularly Spherical ViTs) have a greater tendency to overfit when trained on smaller datasets (such as SynWoodscape) compared to non-spherical models. One must take this into account when training models.
From the observations and analyses, it is clear that spherical models outperform non-spherical models when tested on both real-world and synthetic datasets. This underlines the promising adaptability of spherical models for processing fisheye data, a frequent requirement in autonomous driving scenarios. However, to fully unlock the potential of Spherical ViTs, careful consideration must be given to the size and diversity of fisheye datasets used for training. Ensuring adequate data complexity is essential for these models to accurately capture and learn the intricate patterns inherent in fisheye imagery.
Conclusion
In this paper, we presented semantic segmentation techniques for fisheye images using spherical approximations and compared their performance with non-spherical methods. Fisheye cameras, which capture a wide field of view (FOV) ranging from
To tackle the distortions of the fisheye segmentation task, we implemented spherical models based on Spherical CNNs and Spherical ViTs.We compared their performance against traditional non-spherical models. Our evaluation utilized datasets including Woodscape, SynCityFisheye, and SynWoodscape.
Our comparison reveals that spherical models outperform non-spherical methods in detecting smaller objects, particularly near the image boundaries where distortion is most pronounced. Notably, spherical approximations improved segmentation accuracy without the need for unsupervised domain adaptation techniques. Spherical segmentation methods narrow the performance gap between standard fisheye and spherical segmentation models, delivering improved segmentation accuracy while maintaining computational efficiency. Hence, these models exhibit flexibility, effectively handling both spherical and fisheye data, making them versatile for various applications.
Despite these promising results, spherical models have certain limitations. Firstly, these models tend to overfit when trained on smaller datasets. Secondly, while still capable of real-time runtime performance, they tend to be slower than the non-spherical methods. Finally, from a more theoretical perspective, the reliance on spherical approximations may not fully capture the unique characteristics of fisheye lens distortion, especially the exact shape of objects in more complex or dynamic environments, such as those encountered in autonomous driving.
For future work, we propose extending this research to additional tasks such as object detection and panoptic segmentation in fisheye images. Fine-tuning spherical models with edge-aware loss functions could improve object shape predictions and enhance segmentation accuracy. Additionally, we aim to develop a dedicated FishNet framework specifically tailored for fisheye data. This framework could incorporate both extrinsic and intrinsic parameters of fisheye lenses to leverage spherical approximations better. Given that spherical models are inherently distortion-aware, this advancement holds the potential to drive future innovations and set a new benchmark for fisheye semantic segmentation in the autonomous driving industry.
ACKNOWLEDGMENT
The authors would like to thanks to Paul Bourke for providing the images in Fig. 5.