Introduction
360°video/image, including panoramic, spherical or OmniDirectional Video (ODV), is a new type of multimedia that provides users with an immersive experience. For example, people can wear a Virtual Reality (VR) headset and move their heads to look around in a virtual world. The use of 360° video is then mandatory to achieve a real-time, realistic rendering of the scene. These techniques are useful in real-world applications, e.g., for displaying panoramic images/videos for vehicle driving, shopping, sightseeing and so on. In order to collect the training data, specific equipment such as Yi Halo and GoPro Odyssey, is generally required. We can also use RGB-D cameras to capture depth for 3-DoF or 6-DoF rendering, enabling depth estimation [1], [2], semantic segmentation [3]–[5] and salience prediction [6]–[8].
Another related topic is neural view rendering. The idea is to use deep learning approaches to explicitly [18]–[22] or implicitly [24]–[28] discover the 3D structure of objects for novel view synthesis [18], [22], [24] or neural rerendering [29]–[31]. In contrast with 360° video processing, novel view rendering usually takes an object as the view center. By pointing the camera toward it, different viewpoints of the same object are captured, possibly using a depth camera to get more geometric information. These views are then interpolated using a trained neural network. Many public synthetic datasets [32]–[34] are used for this task.
In contrast with both 360° video and novel view rendering, but bridging the gap between them, our goal is to achieve camera centered, 360° panoramic novel view interpolation. More precisely, our method allows using a single, ordinary camera to capture a few reference views of the real world and predict the intermediate views in between, achieving horizontal 360° view synthesis from this sparse input. Hence we refer to our proposed approach as See360. Note that as opposed to 360° video, we do not need any complex device to capture omnidirectional views for scene understanding. Our method also differs from novel view rendering since our goal is to capture the 3D structure of the surroundings rather than the structure of a single object. The problem we are tackling is thus more challenging. Since it is not possible to predict an unseen view without any prior, we use two references (left and right) views, to “interpolate” the intermediate views. The angle distance between references can be 60° or even up to 120° (note that the most common setting of field of view (FOV) for cameras is 55°, which ensures that the input views will overlap). All occluded and missing scene parts in between are estimated by our method, while achieving smooth view transition. In addition, we can achieve view interpolation for 360° panoramic videos/animations. Note that our proposed method is purely image-based view synthesis without requiring any knowledge of depth or 3D information, hence it is low cost on data requirement that common cameras can be used for view synthesis, but there might be spatial misalignment caused by the large moving objects, low scene overlapping or sudden scene changes.
Figure 1 illustrates our method: we attach a single camera to a electronic tripod head (Figure 1(a)) which is able to automatically rotate the camera over the full 360° range, to capture T sparse reference views and a dense set of S (S>T) intermediate views. For training, we place the camera at the four corners of a street to collect 4 such data-sets (at each location, we capture 6 images to cover 360° view), enabling the model to learn the 3D structure of the street. At the testing stage, we randomly place the camera in-between, for instance here at the center of the street (see, respectively, the red dot in Figure 1(b)). We then use T=2 views at 120° and 180° from this new position as references, to predict views from the whole angular range in between. As shown in Figure 1(c), the generated images seamlessly change along with the camera pose and are perceptually similar to ground truth. We also depict the residual difference between prediction and ground truth, magnified by 5 for better visualization. This shows that the very small differences are mainly located around edges, which indicates that the global information matches in the low frequency domain, despite of some high frequency information losses. This good pixel fidelity enables to use the generated images for other applications, such as semantic segmentation (see Figure 1(d)).
(a) 360° capture device. (b) The coordinates where we place the camera for data collection. Given reference views at 120° and 180°, our proposed See360 model can render views at different angles from left to right. (c) Novel views comparison (include our prediction, ground truth and residues between them) and (d) Segmentation results comparison. The view changes from the left and right references can be observed from the buildings and trees.
To summarize, we implicitly discover 3D correlations between reference views by using explicit 2D affine transformation to render novel views. Our key contributions are:
To generate high-quality, photo-realistic images without requiring 3D information, we propose a Multi-Scale Affine Transformer MSAT) to render reference views in the feature domain using 2D affine transformation. Instead of learning one-shot affine transform to reference views, we learn multiple affine transforms in a coarse-to-fine manner to match the reference features for view synthesis.
Furthermore, to allow users to interactively manipulate views at any angle, we introduce the Conditional Latent space AutoEncoder (C-LAE). It consists of 1) patch based correlation coefficients estimation and 2) conditional angle encoding. The former enables finding global features for 3D scene coding and the latter introduces target angles as one-hot binary codes for view interpolation.
In addition, we provide two different types of datasets to train and test our model. One is the synthetic 360° images collected from the virtual world, including UrbanCity360 and Archinterior360. Another is the real 360° images collected from real-world indoor and outdoor scenes, including HungHom360 and Lab360. Semantic segmentation maps are also provided for all datasets. Our tests in the wild show that See360 can also be used, to some extent, with unknown real scenes. With a small number of training images (about 24 images) required, it takes 10 mins training to reconstruct the 360° view rendering.
Related Work
In this section, we give a detailed review of previous related works on 1) 360° video processing, 2) neural view rendering and 3) 3D-aware view synthesis. Note that our work is also related to some classic image processing problems like image warping [59], [60], image stitching [61], [62] and inpainting [63], [64]. However, these methods do not study camera pose guided view interpolation, where our proposed method predict novel views given random camera poses.
A. 360° Video/Image Processing
360° video/image has been increasingly popular and drawn great attention. With the available, commercial head mounted displays (HMDs), users can move freely their heads to have immersive experiences. New challenges recently raised for 360° video/image processing: 1) storage and transmission and 2) viewpoint-centric processing. For the first question, a video/image of very high resolution is required to achieve a good covering of the whole
Furthermore, to avoid viewers’ motion sickness [10], a high frame rate is required. Many organizations have developed compression standards for 360° video/image such as MPEG-I [11] and JPEG-360 [12]. However, 360° video/image compression is still an ongoing research topic, since visual quality assessment (VQA) is needed to evaluate the degradation of compression. In real-world environments, the generation of 360° video/image follows equi-rectangular projection (ERP) corresponding to the spherical coordinate system.
Therefore, the resulting images need to be projected to the standard view for artifact-free visualization. With the development of deep learning, 360° videos have been well studied in many fields, such as depth estimation, or semantic segmentation. For example, Zioulis et al. [15] designed an autoencoder to model depth on omnidirectional imagery. Eder introduced plane-aware loss for dense depth estimation [17], as well as tangent transform to mitigate spherical distortion for segmentation [3]. Projecting images to the icosahedron spheres [16] is also another choice for 360° video/image processing.
B. Neural View Rendering
Neural view rendering is the creation of photo-realistic imagery of virtual worlds. Learning the 3D scene representation, rendering methods can render images for a variety of complex real-world phenomena. Let us focus on two applications: 1) neural rerendering and 2) novel view synthesis. Among neural rerendering methods, Neural Rerendering in the Wild [29] uses a neural network to synthesize realistic views of tourist landmarks with various lighting conditions. Pittaluga et al. [30] proposed to learn an invert reconstruction from point clouds to realistic novel views with unknown key-point scale, orientation and multiple image sources. Neural avatar [31] is a deep network that produces body renderings with various body poses and camera positions. For novel view synthesis, the idea is to use multiple images or 3D models to render a new view of the object. [35]–[38], [67]–[69] propose generative adversarial networks for unsupervised learning of 3D representations. It can achieve random 3D pose rendering by a rigid-body transformation of the 3D features. By combining depth, point cloud or voxel information, [39]–[41] achieve better 3D reconstruction via an encoder-decoder structure. [42], [43] create the mosaics via an online deep blending pipeline from multi-view stereo images. Dupont et al. [44] set a new path of neural rendering by enforcing equivariance between the change in viewpoint and change in the latent space.
C. 3D-Aware View Synthesis
After introducing 360° video/image processing and neural view rendering, let us further differentiate our proposed approach from other works: 3D-awareness is the core idea of different view synthesis approaches. In order to achieve better visual fidelity, the generative adversarial network is the most popular model, and has been successfully used in many works [19], [20], [24], [29]–[31]. Therefore, our method should be compared with three specific, key architectures: conditional GAN [23], HoloGAN [35], and ATSal [9].
As shown in Figure 2, we show the three architectures, where
Comparison of generative image models. G is the generator and D is the discriminator. “real” and “fake” are the ground truth and generated images. (a) and (b) are used for object-centered novel view rendering. (c) is used for salience prediction for 360° video. (d) is our proposed See360 model for view rendering. (a) Conditional GAN, it takes the target poses as label for training, (b) HoloGAN, it explicitly explores the camera pose by using directly the 3D rigid-body transformation. (c) ATSal, it uses cubemap projection (CMP) to project panorama into multiple views for further processing. (d) See360, it transfers 3D pose parameters to 2d affine parameters for view synthesis.
Though it achieves good performance on view synthesis, it is not suitable for 360° view synthesis, for two reasons. First, the 3D convolution is built on whatever the changes of view, the 3D representation remains the same to the given object. This works well for predicting different views of a given 3D object, but the method cannot predict new views for new 3D objects without some new training. Second, HoloGAN works well on simple objects with simple textures and no background, like chairs or faces. For complex scenes in real 3D worlds, it results in very blurry images [44]. ATSal is a model designed for 360° salience prediction. To find the spot where visual attention should focus, it processes panoramic, spherical videos by 2D convolution. It needs forward and backward cubemap projection to decompose the video at the different view and synthesize back the generated result. One of the problems is that such panoramas are not commonly used. They require specific collection, storage and processing mechanisms.
For comparison, we propose the See360 to achieve 360° view synthesis via a generative adversarial network. Instead of learning the complex 3D feature representation of the surroundings, we take two reference views and the target viewing angle to implicitly learn the 2D affine transformation which is equivalent to the new 3D view to be synthesized. Hence we tackle 3D view synthesis as a 2D image fusion problem with explicit pose control, and complement it with a model enabling to fill the missing/occluded regions.
Method
To render a novel view in a given camera pose, See360 extends traditional GANs by introducing a Conditional Latent space AutoEncoder (C-LAE) that maps the 3D camera pose to 2D image projection. It consists of 1) a novel cross-patch convolution and 2) an effective conditional angle encoding. Specifically, See360 fuses the two reference images by learning an equivalent 2D affine transform from their 3D camera poses, enabling the generated images to be a weighted interpolation of the two references.
To ensure the realistic visual quality of the novel view, See360 also includes a Multi-Scale Affine Transformer (MSAT) that renders edges and textures in a coarse-to-fine manner. Eventually, the generator will generate realistic views that can fool the discriminator. Similar to recent works [45]–[47], the GAN structure in See360 relies on a regular CNN as the discriminator, the key contribution being our new generator.
Figure 3 shows the complete structure of the generator of our proposed See360. It takes two references (left and right views) to render the novel view corresponding to the target camera pose
The complete structure of the proposed generator of See360. It take left and right view as references to render a novel view with the target camera pose. It consists of 1) Encoder, Conditional Latent space AutoEncoder (C-LAE) and 2) Multi-Scale Affine Transformer (MSAT). Inside the C-LAE, there is a Cross Patch convolution to find view projection from left to right and from right to left, respectively. The target camera pose is injected for explicit view control via Conditional Angle Encoding. Then the affine transform is used in MSAT for reconstruction.
A. Conditional Latent Space AutoEncoder
There are two key processes in our proposed Conditional Latent space AutoEncoder (C-LAE): 1) Cross Patch Convolution (the pink box in Figure 3) and 2) Conditional Angle Encoding (the cyan box in Figure 3). The former is for feature matching to learn the corresponding maps, while the latter is for camera pose manipulation.
1) Cross Patch Convolution:
Given the input left and right views, there are 2D regions in the images that overlap in terms of content, and can thus be used as spatial clues to learn the hidden 3D correlation. In order to find the 2D overlapping regions, we use both left and right feature maps to search for pattern similarity. In other words, instead of looking for spatial correlation in the image space, we look for feature space similarity in a trainable convolution process.
To achieve this, as shown in Figure 4, we perform Cross patch convolution. We firstly split the left features \begin{align*} left\rightarrow right \quad \mathit {S_{i,j}^{left}}=&\sum _{a=0}^{h-1} \sum _{b=0}^{w-1} x_{a,b}\cdot Y_{i-a, j-b} \\ right\rightarrow left \quad \mathit {S_{i,j}^{right}}=&\sum _{a=0}^{h-1} \sum _{b=0}^{w-1} y_{a,b}\cdot X_{i-a, j-b} \tag{1}\end{align*}
The use of the Cross patch convolution has two merits in our framework: 1) we can significantly reduce the feature dimension to find the key features in left and right views, and 2) we symmetrically apply Cross patch convolution for both left and right views so that we can find two independent feature correspondence maps.
2) Conditional Angle Encoding:
After obtaining the two feature correspondence maps, we reshape them as 1D vectors and use one fully connected layer to find the implicit 3D representation as \begin{equation*} h\left ({\mathit {g}(S) }\right) = \mathit {g}(S) * (1 + \sigma (z)) + \mu (z) \tag{2}\end{equation*}
B. Multi-Scale Affine Transformer
For view generation, we propose to use a Multi-Scale Affine Transformer (MSAT) to reconstruct the image from coarse to fine scales. The idea is to parameterize the rigid-body transform by 2D rotation, scaling, shearing followed by bilinear resampling. As shown in Figure 5, the affine transformation is a simple image warping model which uses a six-degree parameter matrix T to describe the parametric planar transformation.
Note that we do not consider the translation in this work. Details on making this for pose sampling are discussed in Section IV. To adapt feature maps at different scales, instead of learning one uniform affine parameter for feature maps at different scales. In other words, the network is not designed for one-shot projection but multi-scale projection. Hence at each feature scale, we can finely adjust the projection to fuse them to the final result. Mathematically, we can describe the process as, \begin{equation*} F_{out}=\sum _{j=1}^{3} Conv\big (T_{left}^{j} F_{left}^{j} \oplus T_{right}^{j} F_{right}^{j} \big) \tag{3}\end{equation*}
In Figure 3, we illustrate that we learn three sets of affine matrices (
C. Loss Functions
Our new See360 model works as an image generator (G) to fuse two references into a novel view. To encourage the model to generate images with photo-realistic visual quality, we designed a simple 2D CNN network for multi-scale style discriminator (D) following the same idea as in [45], [46] to supervise the generation. Specifically, we use 2 discriminators (\begin{align*}&\hspace {-0.5pc} \min _{G} \max _{D_{1}, D_{2}} \sum _{k=1,2} \mathcal {L}(G(\theta, I^{left}, I^{right}), \\&\qquad\qquad\qquad\qquad\qquad D_{k}(I_{\theta }^{pre}, I^{left}, I^{right}, M_{\theta }^{gt})) \tag{4}\end{align*}
1) Generator Loss:
The generator is designed to match the image contents and style between the prediction and ground truth. For predicted \begin{equation*} \mathcal {L}_{pd}= \sum _{j=1} W_{p}(\phi (I_{\theta }^{pre}), \phi (I_{\theta }^{gt})) \tag{5}\end{equation*}
\begin{align*}&\hspace {-0.5pc}\mathcal {L}_{feat}= \frac {1}{C}\sum _{j=1}^{C} \lVert D_{1,2}^{j}(I_{\theta }^{gt}, I^{left}, I^{right}, M_{\theta }^{gt}) \\&\qquad\qquad\qquad\qquad\qquad\qquad - D_{1,2}^{j}(I_{\theta }^{pre}, I^{left}, I^{right}, M_{\theta }^{gt}) \rVert \tag{6}\end{align*}
\begin{align*}&\hspace {-0.5pc}\mathcal {L}_{G}= \mathcal {L}_{adv}(G, D_{1,2}) + \lambda _{ssim}(1-\mathcal {L}_{ssim}) \\&\qquad\qquad\qquad\qquad+ \lambda _{pd}\mathcal {L}_{pd} + \lambda _{feat}\mathcal {L}_{feat} + \lambda _{lap}\mathcal {L}_{lap} \tag{7}\end{align*}
\begin{align*}&\hspace {-0.5pc} \mathcal {L}_{adv}= \frac {1}{2}\sum _{j=1}^{2} log[D_{j} (I_{\theta }^{gt}, I^{left}, I^{right}, M_{\theta }^{gt})] \\&\qquad\qquad\qquad+ (1-log[D_{j}(I_{\theta }^{pre}, I^{left}, I^{right}, M_{\theta }^{pre})]) \tag{8}\end{align*}
D. Data Collection and Generation
Our target is to render 360° views for both indoor and outdoor environments. Most of the existing datasets are either collected as object-centred views from virtual machine [32] or indoor environment [33], [34]. To demonstrate the versatility and efficiency of the proposed See360 model, we propose two different types of datasets for training and evaluation: 1) virtual-world data, 2) real-world data. The basic process was to fix the camera at a given location. Then we horizontally rotated (along the y-axis) the camera to capture the images of 360° views. The angular step was set to 5°, enabling us to collect 72 views at each location. For each scene, we also collected the segmentation map. Note also that the datasets were captured to avoid many moving objects so that the static 3D structure can be learned by the proposed model.
For virtual-world data, we suggest using an open source project (unrealCV [50]) to render indoor and outdoor views from virtual worlds. Based on the available toolkit, we randomly placed the camera (distances range from 5m to 100m) and controlled it to capture images at different angles. For the indoor environment, we used the realistic virtual world “ArchinteriorsVol2Scene2” for data collection. It describes a scene of a house with 1 bedroom and 1 bathroom. We refer it to as “Archinterior360”. For the outdoor environment, we used another realistic virtual world called “UrbanCity” for data collection, where the scene is a street block. We refer it to as “UrbanCity360”. For “UrbanCity360” and “Archinterior360”. We randomly placed the camera at 100 different locations to capture the scenes. We used another 10 locations to capture the images for evaluation. In summary, each dataset includes 7200 training images and 720 testing images. We also collect semantic segmentation maps from UnrealCV [50] for training.
For real-world data, as shown in Figure 1a and b, we installed the camera on an electronic tripod head to collect different views. We also collected both indoor and outdoor datasets for estimation (with distances ranging from 5m to 30m). For the outdoor dataset, we placed the camera on a street of Hung Hom, Hong Kong. We randomly chose 14 locations to collect a total of
As shown in Figure 4, the four datasets (UrbanCity360, Archinterior360, HungHom360 and Lab360) cover a large variety of indoor and outdoor environments with a large variety of contents. For example, UrbanCity360 contains views of a whole block of street. There are buildings, trees, street lamps and many other outdoor objects. More importantly, it has very strong lighting changes, such as shadows on the ground. Archinterior360 has more structural textures from the interior design, requiring a view rendering machine with a good 3D spatial awareness. HungHom360 contains natural lighting with a little fog. Because it was captured from an open area, the depth of the view changes a lot, which makes the rendering of novel views quite difficult. For Lab360, the challenge is that the ceiling lights cause uneven lighting conditions. Moreover, the number of small objects makes view prediction difficult.
Experiments
A. Datasets
Our goal being to achieve 360° neural view rendering without any prior about the 3D environment (e.g. no depth map, point cloud or other information), the left and right views we provide as input need to overlap slightly. In our experiments, we set the angle between them to 60° (which is also the FOV limit for most of the common cameras). This ensures limited overlap, enabling the images to show quite different scenes. To complete a 360° view, we can use 6 such references, to interpolate a maximum of 72–6 = 66 intermediate views. As introduced in Section III-A, we set the code length
UrbanCity360 and Archinterior360 are virtual-world rendered datasets that consist of 7200 training images and 720 testing images of size
HungHom360 and Lab360 are real-world rendered datasets with size
B. Implementation Details
1) Network Architecture:
As shown in Figure 3, the generator of the proposed See360 method firstly downsamples the feature maps 3 times by a factor of 2. Each downsampling unit consists of 1 convolution layer (with kernel size
2) Methods to Compare With:
This work being the first one, to our best knowledge, on camera-centered 360° neural view rendering, there is no existing approach tackling exactly the same problem that we can compare with. However, as shown in Section II, there are a few works that explored novel view rendering by proposing different network architectures. For comparison, we choose three state-of-the-art methods to implement their structures to resolve our problem: Conditional GAN [23], HoloGAN [35], Pix2pixHD [45] as shown in Figure 2. We followed the authors of these methods to design the networks, with a few modifications to achieve our goal. For Conditional GAN, we concatenated the left and right references as input, and the camera pose was directly introduced in the generator and discriminator as a condition for generation. For HoloGAN, we replaced the input 3D model with 2 reference images and added 4 layers of convolution to extract image features, then we followed the design in Figure 2B to add 3D convolution, projection and 2D convolution to generate images. For Pix2pixHD, we used the same network as the Conditional GAN with four modifications: 1) added Instance normalization on every convolution layer, 2) added
C. Evaluation Metrics
1) Quantitative Evaluation:
To objectively evaluate the data fidelity of view rendering, we have used PSNR, SSIM and running time. To avoid the boundary effect, the computation of PSNR and SSIM excludes a region of 8 pixels wide around the image boundaries.
2) Qualitative Evaluation:
To better evaluate the visual quality of view rendering than just visualizing the generated images, we used LPIPS [54] to measure the deep feature similarity. It is widely used in image processing tasks [55]–[57]. Using HRNet [53], we also estimated the semantic segmentation map from the generated image to compare with the ground truth by averaging pixel accuracy (%).
D. Comparisons With Prior Works
Let us compare our approach and other methods on 4 different datasets, using the 5 different evaluation metrics just discussed. Results are shown in Table I. As illustrated in Section III-B, we sparsely selected 6 references (0°, 60°, 120°, 180°, 240° and 300°) to cover the whole 360° view. For each two adjacent references, every 5° angle, we predicted one view leading to the estimation of
For visual comparison, Figure 7 shows two examples from virtual-world datasets and two examples from real-world datasets (where C-GAN stands for Conditional GAN). We visualize both the generated images as well as the segmentation maps. In Figure 7A, we can see that the proposed See360 can reconstruct the details of the buildings, while others distorted the textures. For example, though the 30° views predicted by C-GAN and Pix2pixHD look alike real scenes, they fail to predict the correct angles of the scene. In Figure 7B, we can see clear differences among the different methods. The challenge of this scene is that the lighting condition is vividly rendered by the virtual machine and the interior structure is highly correlated. C-GAN, Pix2pixHD and HoloGAN all fail to reconstruct the scenes. Hugin’s results, on the other hand, have the problem of filling the holes of the new scenes. In addition, from the example for 30° and 90° of Archinterior360, it fails to either fill the whole scene or predict the unseen contents. In contrast, See360 can clearly reconstruct the wooden ceiling and windows. For real-world examples in Figure 7C and D, we can see that C-GAN, Pix2pixHD and HoloGAN fail to adjust their models to these real scenes. Hugin still has the problem of filling the holes of the new scenes. It even misses the building from the 90° view of HungHom360. Our proposed See360 can still predict the real scenes (outdoor or indoor) at different viewing angles. For example, in Figure 7C, on the view at 210°, See360 can well predict the trees on the left and the colorful statue on the right. In Figure 7D, See360 does not only estimate the workplace, but also renders the lighting conditions, such as generating shadows cast by the lights.
Examples of our proposed UrbanCity360 (A), Archinterior360 (B), HungHom360 (C) and Lab360 (D) datasets. The UrbanCity360 and Archinterior360 are rendered from the virtual world. The HungHom360 and Lab360 are rendered from the real world.
These results show that our proposed See360 method successfully achieves 360° view synthesis, on both virtual-world and real-world scenes. In contrast, the other three methods fail. For example, we can consider that the Conditional GAN and Pix2pixHD can resolve the problem to some extent but do not provide the necessary 3D awareness to render from a specific viewing angle. The reason is that Conditional GAN and Pix2pixHD use the camera pose as an extra input of the network to implicitly control the view rendering. On the contrary, our model explicitly ports the camera pose into the network as 2D affine transformation on the feature space. For HoloGAN, it does not work on different datasets because it uses 3D convolution to explicitly discover the 3D representation. However, the real scene is not object centered. Since views from different angles describe different 3D structures, 3D convolution cannot uniformly project different scenes to a single 3D space for estimation. In contrast, our proposed See360 method learns to use equivalent 2D affine transform to estimate the 3D model for novel views of different angles.
E. Ablation Study
In the ablation study, we consider the effect of three key components: 1) Cross patch correlation (CPC), 2) Multi-Scale Affine Transformer (MSAT) and 3) multi-scale style discriminator with or without semantic segmentation (SS). To make a comparison, we refer to these three situations to as CPC, MSAT and SS. In Table II, we compute the evaluation metrics PSNR, SSIM and LPIPS on each dataset, to quantitatively measure the importance of their effects.
In Table II, “Full Pipeline” is the our complete See360 model. To test the three key components, we successively removed the key components from the full pipeline to train the model. We can find that the most important component is the use of Multi-Scale Affine Transformer (MSAT). The reason is that we can reconstruct the scene from coarse to fine to fill the details at different scales. Meanwhile, the proposed MSAT adaptively estimates different 2D affine parameters for features at different scales to fuse the reference views. Similar to “CPC”, we use it to extract the global feature correspondence. It can also improve the performance by about 0.3 dB and 0.01 in terms of PSNR and SSIM, respectively. For “SS”, we consider it as an extra constraint on the image generation. As discussed in Equation 8, using semantic segmentation as layout information (the contour of the whole scene) and combining it with generated images, the discriminator can learn to distinguish whether the generated image has similar structural information as ground truth.
In Table III, we did another ablation study using different losses introduced in Section III-C. The key losses we discussed include Laplacian loss (lap loss), Projection Distribution loss (pd loss) and VGG feature loss (VGG loss). See360— indicates models using different combinations of losses and See360 indicates the best model that makes use of all three losses. We measured PSNR, SSIM and LPIPS on UrbanCity360 dataset. It can be found that using lap loss improves the SSIM because it computes structural similarity of images using the differences of Gaussians. Using pd loss and VGG loss improves the LPIPS but affects PSNR and SSIM scores because of the trade-off effect between pixel distortion and visual quality [66]. Combining all three loss terms, See360 achieves the best LPIPS as well as the second best PSNR and SSIM, which indicates that our proposed See360 can well synthesize photo-realistic novel views.
F. Robust and Flexible 3D Awareness
As discussed in Section II, our proposed See360 method can recognize different camera poses to synthesize unique views. In our setting for training, the largest angle interval (distance) between left and right references was 60° and our proposed See360 could predict novel views every 5°. The two examples in Figure 8 further demonstrate the flexible 3D awareness of our method. They show that the proposed See360 model can recognize every camera pose to render novel scenes with different viewing angles. The visual quality of predicted views is good enough that can be used for view interpolation. It inspires us that it has the potential to be used for 360° scenery video. We provide a video demo at the link https://youtu.be/P1JHx7ViSpI.2Please note the smooth motion transition and view changes. There could be some blurry effects or distortions due to sudden big structural changes or additional noises on the reference images. Visual quality can be further improved by using larger datasets and image pre-processing.
Visualization of view synthesis of fine camera poses on UrbanCity360 and HungHom360 datasets.
To test the limit of the See360 model, we also investigated whether it can use references with a larger angle distance, while still predicting the novel view correctly. We used UrbanCity360 as an example to show the results of using references with different
Visualization of view synthesis using references with different angle distances. The upper figure shows the PSNR results against the camera pose. Different colors of the lines represent the different references for testing. On the lower side, we show two predicted views at camera pose 210° and 330°. The average PSNR is listed on the right top corner.
Another aspect of robust 3D awareness is the real-time rendering ability for practical applications. That is, we directly use our pre-trained model on new scenes without any fine-tuning. We collected real-world datasets to evaluate such versatility for the proposed See360 model. Since real-world datasets contain much fewer images for training, we firstly pre-trained the model on virtual-world datasets and then fine-tuned them on the targeted real-world datasets for better prediction. As shown in Table I, we only fine-tuned our models for 100 epochs to adjust to real-world datasets, which required about 10 mins on a computer with one GPU. Figure 10 illustrates the robustness of See360 to different datasets. To further illustrate the robustness, we show the intermediate results on fine-tuning See360, using the HungHom360 dataset. With the training epochs increase, we can see that the visual quality is getting better, e.g. see the stairs and trees. In addition, we can find that even when we trained the model for 5 epochs, it can already achieve reasonable results.
Visualization of view synthesis on HungHom360 dataset. We demonstrate the intermediate training results at different epochs.
G. Novel View Synthesis in the Wild
So far, we presented evaluations on the few datasets on which the model was trained. In addition, the proposed model works well in the wild. For example, given two reference views of an unknown scene, we do not even need to know the angle between these left and right references: We represent each intermediate view as a unique one-hot code, so all we need is the one-hot code of the target camera pose. To illustrate this, we collected a few views inside a university by randomly rotating the camera and we used any two views as references to interpolate the middle view. E.g., we input the one-hot camera pose code as ”0000001000000“ for prediction. Note again that we do not train on unknown scenes.
We directly use the pre-trained UrbanCity360 and Archinterior360 models for prediction. As shown in Figure 11, though UrbanCity360 and Archinterior360 are virtual-world datasets, we can find that both models can render novels view with similar structures. Another interesting finding is that using UrbanCity360 model can achieve better results than Archinterior360. The possible reason is that UrbanCity360 is also an outdoor dataset with diverse contents including buildings and trees. Therefore, it is a better match for unknown outdoor scenes.
Visualization of view synthesis on unknown scenes captured inside a campus. We do not train the model but directly use two pre-trained models for estimation: 1) pre-trained UrbanCity360 model and 2) pre-trained Archinterior360 model.
However, when we have photos with views that highly differ from our training data, like photos taken from other countries, the proposed See360 model may not work properly. Since we cannot actually go around different places to test different views, as shown in Figure 12, we used Google map to locate three landmarks in Paris, London and New York, as Louvre museum, British museum and Manhattan street. Google map can provide panorama views, so we captured the panorama and cut it into three sub-images to mimic the multi-view image generation. We took the left and right images as references to predict the center view. We directly applied the proposed See360 (learned from our training data) to these three scenes. It shows that See360 can predict some of the views but with some artifacts. However, there are clear directions for improvement, for two reasons: 1) the panorama consists of a wide view computed from several blended images, so it contains discontinuous lens distortion. Therefore, the extracted left and right references images do not meet the criteria of our setting for training. 2) The captured views are from around the world, therefore are very different from the views we used for training. We used these challenging examples to show that the architecture of See360 is solid and it can be further improved when larger datasets are used for training.
Novel view synthesis of unknown scenes captured around the world, including Louvre museum in Paris, British museum in London and Manhattan street in New York. The left and right views are cut from a panorama, and the center view is used for prediction.
Conclusion
In this paper, we open up a new research direction of using neural rendering to generate views at different viewing angles from only two overlapping input images, a key contribution helping people understand their surroundings. Our novel view synthesis method, called See360, builds on two carefully designed components: 1) Multi-Scale Affine Transformer (MSAT) and 2) Conditional Latent space AutoEncoder (C-LAE). The key insight is that we transfer 3D view rendering as an equivalent 2D affine transformation. We contribute further by providing two types of datasets, respectively consisting of synthetic and real images, for training and evaluation. Our See360 model has the potential to be used in 360° video processing for virtual reality. In future work, we plan to explore the extension of See360 to combine both 2D and 3D information for high-resolution images/videos rendering.
ACKNOWLEDGMENT
The authors thank Dr. Li-Wen Wang for helping data collection and equipment setting.