Introduction
The acquisition of 3D models is a frequent problem in computer graphics and vision. Most existing methods, such as laser scanning and multi-view reconstruction, are based on observations of surface color. Consequently, the surface is assumed to be opaque and approximately Lambertian. These methods cannot be directly applied to transparent objects because the appearance of a transparent object is indirectly observed owing to the complex refraction and reflection light paths at the interface between air and transparent materials.
A core technical challenge in 3D transparent object reconstruction is that of handling the dramatic changes in appearance that occur when observing an object in a multi-view setting. Slight changes in an object's shape can lead to nonlocal changes in appearance owing to the complexity of light paths. To address this issue, we utilized ray-pixel correspondence (i.e., the correspondence between a camera ray and a pixel on a static background pattern displayed on a monitor) and ray-ray correspondence (i.e., the correspondence between a camera ray and the incident ray from the background pattern) to provide light path constraints to facilitate 3D transparent object reconstruction [1]–[3]. A differentiable refraction-tracing technique can be applied to reduce the complexity of the capture setting, and the 3D shape can be recovered through ray-pixel correspondences as shown in Ref. [4]. However, in this method, a transparent object should be placed on a turntable under controlled lighting conditions. Li et al. [5] trained a physics-based neural network to handle complex light paths for 3D transparent objects. The network was trained on a synthetic dataset with a differentiable path tracing rendering technique. This method optimizes surface normals in a latent space; thus, it can reconstruct 3D transparent objects under natural lighting conditions when receiving an environment map and a few images as input. However, this frequently produces overly smooth reconstruction results.
In this study, we consider how to combine the advantages of explicit meshes and multilayer perceptron (MLP) networks, a hybrid representation, to address the problem of reconstructing transparent objects under natural lighting conditions using images captured with a handheld camera. This representation can be reconstructed through optimization using a differentiable path-tracing rendering technique. The key idea is to use MLP to encode a vertex displacement field (VDF) defined on a base mesh to reconstruct surface details, wherein the base mesh is created using multi-view silhouette images. Our design is motivated by two observations. First, the representation of functions using MLP has been demonstrated to be efficient in optimization and robust to noise [6]–[8]. The MLP network parameterizes the VDF with weight parameters globally. Hence, it implicitly provides global constraints on changes in VDF. Second, defining the MLP-parameterized VDF on the base mesh reduces the search space during optimization [9]. This significantly accelerates the optimization process compared to MLP-based volumetric representation.
The advantage of our hybrid representation is that it allows for relaxation of the capture setting. Because the global smoothness constraints between vertex displacements are implied in MLP weights, the ray-pixel correspondence required in the optimization can be significantly relaxed to a ray-cell correspondence in our pipeline. Consequently, we can simplify the background pattern design and develop a robust single-image environment matting (EnvMatt) algorithm for handling images captured under natural lighting conditions. Compared to the capture settings used in Wu et al. [2] and Lyu et al. [4]'s work, our handheld capture setting is low-cost and simple. Moreover, we propose to represent VDF using a small number of local MLPs. Each MLP is responsible for encoding a local VDF. This strategy enables the design of small-scale MLPs to further accelerate the optimization process. A fusion module is designed to disperse the gradient information of the displacement vectors of vertices to their neighboring local MLPs. This module helps maintain global constraints between local MLPs and produces high-quality reconstruction results.
The contributions of this study are summarized as follows.
We present a hybrid representation that employs explicit mesh and local-MLP based functions to represent the detailed surface for transparent objects. This approach enables us to design small-scale MLPs to accelerate our optimization algorithm's convergence and achieve high-quality 3D reconstruction results for transparent objects.
We propose a ray-cell correspondence as a relaxed representation of the light path constraint. The ray-cell correspondence is easier to capture, leading to a simplified capture setting under natural lighting conditions. Furthermore, it also eases the implementation of the EnvMatt algorithm.
The experimental results demonstrate that our method can produce 3D models with details for a variety of transparent objects, as illustrated in Fig. 1. With our simplified capture setting under natural light conditions, our reconstruction results were superior to those of state-of-the-art 3D reconstruction algorithms for transparent objects.
Our reconstruction results paired with the associated renderings of three transparent objects. The fine surface details can be reconstructed well via our method using images captured with a handheld camera under natural lighting conditions.
Related Work
Our algorithm is designed on the basis of considerable previous research. Here, we review the literature most related to the present work, including studies on transparent object reconstruction, differentiable rendering, and EnvMatt.
Transparent object reconstruction. Many transparent object reconstruction techniques utilize special hardware setups, including polarization [10]–[12], time-of-flight cameras [13], tomography [14], moving point light sources [15], [16], and light field probes [17]. The proposed algorithm is most closely related to shape-from-distortion and light-path triangulation. Kutulakos and Steger [1] formulated the reconstruction of a refractive or mirror-like surface as a light path triangulation problem. Given a function that maps each point in an image onto a 3D “reference point” that is indirectly projected onto it, the authors characterized a set of reconstructible cases that depended only on the number of points along a light path. The mapping function can be estimated using the EnvMatt algorithm with a calibrated acquisition setup, denoted by ray-point correspondences. A ray-ray correspondence can be uniquely determined with two distinct reference points along the same ray.
In accordance with light path triangulation, one reconstructible case is that of single-refraction surfaces [18]–[20], particularly fluid surfaces [21]–[23]. Another tractable case is that of transparent objects when rays undergo refraction twice [24]–[26]. Wu et al. [2] recently reconstructed the full shape of a transparent object by first extracting ray-ray correspondences and then performing separate optimization and multi-view fusion. Lyu et al. [4] proposed the extraction of per-view ray-point correspondences using the EnvMatt algorithm in Ref. [27], and utilized differentiable rendering to progressively optimize an initial mesh.
In addition to optimization-based methods, deep learning techniques can also be incorporated to resolve depth-normal ambiguities [28], [29]. Li et al. [5] suggested performing optimization in the feature space to obtain surface normals. Subsequently, they performed multi-view feature mapping and 3D point-cloud reconstruction to obtain a 3D shape. Their method works on a simple acquisition setting with only one known environment map and approximately 10 captured images. However, their reconstructed transparent object may lose some details owing to the domain gap between the real-world images and synthetic training data.
Differentiable rendering. In accordance with the simulation level of light transport, differentiable rendering algorithms in computer graphics can be roughly divided into three categories: differentiable rasterization [30]–[34], differential volumetric rendering [7], [8], [35], [36], and differentiable ray-tracing [37]–[43]. Differentiable rasterization can be used to optimize a mesh itself or its features, and the neural network parameters defined on the mesh. Differentiable volumetric rendering can be used to optimize implicit shape representations, such as the implicit occupancy function [44], [45], signed distance function (SDF) [7], [46], and unsigned distance function [47]. Differentiable rendering has also been used to optimize deep surface light fields [9]. This method represents per-vertex view-dependent reflections using an MLP. While we also utilize surface-based MLPs, our focus is different; our method employs local MLPs to represent VDF locally to reconstruct surface details and design a fusion layer to avoid discontinuities at the overlapped surface areas.
Considering that a light path with refraction is determined by the front and back surfaces of a transparent object, the geometry can be optimized in an iterative manner with forward ray tracing and backward gradient propagation. To this end, our algorithm exploits differential ray tracing to handle the light path of the reflected and refracted rays on the surface of transparent objects.
Environment matting. EnvMatt, which captures how an object refracts and reflects environment light, can be viewed as an extension of alpha matting [48], [49]. Image-based refraction and reflection are represented as pixel-texel (texture pixel) correspondences, in which environments are represented as texture maps. The seminal work of Zongker et al. [27] extracted EnvMatt from a series of 1D Gray codes, assuming that each pixel is only related to a rectangular background region. Chuang et al. [50] extended this work to recover a more accurate model at the expense of using more structured light backdrops. They also proposed a simplified EnvMatt algorithm that uses only a single backdrop. A pixel-texel correspondence search can also be performed in the wavelet [51] and frequency [52] domains. The number of required patterns can be reduced by combining them with a compressive sensing technique [53]. Chen et al. [54] recently presented a deep learning framework called TOM-Net to estimate EnvMatt as a refractive flow field. The aforementioned methods require images to be captured under controlled lighting conditions (e.g., in a dark room) to avoid the influence of ambient light. Wexler et al. [55] developed an EnvMatt algorithm for handling natural-scene backgrounds. However, their method required capturing a set of images using a fixed camera and a moving background.
Overview
A transparent object reconstruction pipeline is shown in Fig. 2. This pipeline begins by reconstructing an object's rough shape (initial shape) from a collection of multi-view silhouettes. Instead of the space-carving method [56], we utilized the MLP-based signed distance function (SDF) in IDR [7] to obtain a smooth initial shape, as shown in Fig. 2. Subsequently, we employ the MLP network to represent the vertex displacement field (VDF) on the initial shape to reconstruct the surface details. This hybrid surface representation combines an explicit mesh and MLP-based neural network. In the following section, we detail the hybrid representation and optimization algorithm for reconstructing the representation from multi-view images.
Hybrid representation. We choose to encode the surface details with VDF as it is defined on a 2D manifold instead of the entire 3D space, reducing the search space of the optimization algorithm and producing high-quality reconstruction results. Moreover, we used to represent the displacement field defined on the vertices to simplify the optimization. Such hybrid representation can combine explicit vertex optimization to accelerate convergence and MLP-based neural representation, as in IDR [7] to enforce global constraints among vertices and improve the robustness of the optimization.
Rather than encoding the VDF using a single MLP, we found that representing the field with a couple of small local MLPs can achieve better results. As shown in Fig. 2, each local MLP encodes the displacement vectors of the vertices within one cluster extracted from the mesh of the initial shape using the variational shape approximation (VSA) algorithm [57]. To avoid mesh discontinuities across local MLPs, we also added a fusion layer to blend the displacement vectors of neighboring vertices based on the geodesic distances on the mesh [58].
Optimization for VDF. The VDF is optimized based on the multi-bounce (up to two-bounces) light path constraints and the consistency between the rendering of our representation and the captured RGB images. The rendering procedure was performed using a recursive differentiable path-tracing algorithm [59].
The light path constraint due to multi-bounce refraction is approximated by a mapping function that maps each pixel in the input image onto a pixel of the background pattern image, which can be obtained using the EnvMatt algorithm. We store the background image as a texture. However, we found that traditional EnvMatt algorithms are either restricted to using multiple images with a fixed camera or are sensitive to natural light conditions. Consequently, we designed a grid-based background pattern to establish the correspondence between a foreground pixel
In the remainder of this paper, we first describe our data pre-processing steps (Section 4.1), including the image acquisition setup and grid-based single-image EnvMatt algorithm. Then, we present the details of the initial shape reconstruction (Section 4.2) and surface optimization steps (Section 4.3).
Method
4.1 Pre-Processing
Data acquisition. We captured images by using a Canon EOS 60D digital single-lens reflex camera. The transparent object to be captured was placed on a desk with preprinted AprilTags [60] underneath. AprilTags were used to facilitate the image registration. To capture the ray-cell correspondences for EnvMatt, we placed an iPad as a monitor behind the transparent object to display a grid-based background pattern. The pattern displayed
Grid-based single image envmatt. Similar to Ref. [50], we can assume that a transparent object has no intrinsic color. Therefore, the correspondence between pixels that cover the object's surface and pixels on the background pattern can be calculated by searching a color space. However, we found that the color ramp pattern in Ref. [50] is sensitive to ambient light under natural light conditions because of the invertible and smooth properties of the pattern image in RGB space. Therefore, we designed grid-based background patterns in this study. The color of each grid cell was constant and was designed to create sharp boundaries between cells. As shown in Fig. 5, each pattern consists of a
The proposed EnvMatt algorithm calculates ray-cell correspondences for each pixel. Given a captured image, for each pixel \begin{equation*}\boldsymbol{u}=\begin{cases}
cr(\underset{i}{\arg\!\min}\Vert \boldsymbol{c}_{s}^{i}-\boldsymbol{c}_{\boldsymbol{p}}\Vert), & \min\limits_{i}\Vert \boldsymbol{c}_{s}^{i}-\boldsymbol{c}_{\boldsymbol{p}}\Vert < \gamma_{1}\\ \text{inf}, & \min\limits_{i,j}(\Vert \boldsymbol{c}_{h}^{i}-\boldsymbol{c}_{\boldsymbol{p}}\Vert,\\ & \Vert \boldsymbol{c}_{s}^{j}-\boldsymbol{c}_{\boldsymbol{p}}\Vert) > \gamma_{2}\\ \text{none}, & \text{otherwise}\end{cases}\tag{1}\end{equation*}
Once capturing has been performed, we use the 3D reconstruction software RealityCapture [61] to register the captured images, which enables us to trace rays in a unified coordinate frame during differentiable rendering.
Image registration and 3D reconstruction. After capturing the images (please see example images illustrated in Fig. 3), we used the 3D reconstruction software RealityCapture [61] to register the captured images, enabling us to trace rays in a unified coordinate frame during differentiable rendering. Given that the position of the iPad is changed every 60 images to provide more ray-cell correspondences, we reconstructed the 3D scene with textures that contain the iPad every 60 images independently, resulting in several components in RealityCapture. Each component recorded the information of each independently reconstructed 3D scene. All components are then registered based on AprilTags [60] beneath the object, as shown in Fig. 4.
Considering that the background pattern displayed on the iPad is changed during the capturing of every four images, incorrect matching points on the iPad's surface are produced, resulting in failure of 3D iPad plane reconstruction. To address this issue, we displayed additional AprilTags surrounding the background patterns to add extra matching points to guarantee the success of the iPad plane reconstruction.
Each 3D plane
4.2 Initial Shape Reconstruction
We utilized IDR [7] with silhouette (mask) loss to obtain the initial shape of a transparent object. The object masks were manually annotated on several selected images. The number of masks used is listed in Table 1. We only used silhouette loss, and thus, the “neural renderer” MLP in IDR was removed. As shown in Fig. 6, IDR with silhouette loss produces smoother reconstruction results than the space-carving algorithm. After reconstructing the initial mesh, we uniformly scale the mesh such that the diameter of its bounding ball equals one, extract fine-grain mesh and perform edge collapse mesh simplification. During simplification, we fixed the target-side length of each triangle to 0.005.
Scene mesh and pattern planes. The iPad is moved twice. During image registration step, images are registered to the same global coordinates.
Grid-based single image EnvMatt procedure. (a) Input image. (b) EnvMatt results: colored pixels indicate that their traced rays terminate inside the cells with designed salient colors; black pixels indicate that the rays terminate inside cells with checkboard colors, and gray pixels indicate that rays terminate outside the pattern. (c) The designed pattern. The circles indicate the centers of salient cells. (d) Chosen colors for salient and checkboard cells.
Space carving vs. IDR with silhouette loss. The space carving method produces artifact in the occluded areas (indicated by the red arrow).
4.3 Surface Optimization Through Differentiable Rendering
Given the initial shape as the base mesh, we first group the mesh triangles into several clusters and then assign each cluster to a surface-based MLP. Thus, VDF can be computed as a fusion of the output of the surface-based local MLPs. In particular, for each vertex \begin{equation*}\hat{\delta}\boldsymbol{x}_{i}=\text{MLP}_{C_{\boldsymbol{x}_{i}}}(\boldsymbol{x}_{i})\tag{2}\end{equation*}
To avoid discontinuity at the cluster boundaries, we introduce a differentiable fusion layer to obtain the final displacement vector \begin{equation*}\delta \boldsymbol{x}_{i}=\sum\limits_{j}w(\boldsymbol{x}_{i},\boldsymbol{x}_{j})\cdot\hat{\delta}\boldsymbol{x}_{j}\tag{3}\end{equation*}
The architecture of each local MLP is illustrated in Fig. 7 with two fully connected (FC) layers. For each MLP, each input 3D vertex on the surface was first mapped to a 99-dimensional feature using positional encoding. We used positional encoding frequencies
Local MLP representation. Each local MLP is responsible for representing the vertex displacements inside a VSA cluster (shown as the colored patch on the object surface). A fusion layer is used to fuse the vertex displacements output by the local MLPs into a smooth VDF on the surface.
In the following, we first describe how to extract clusters from the base mesh and then describe the details of the designed loss terms and our optimization procedure.
4.3.1 Cluster Extraction
We utilized the Variational Shape Approximation (VSA) algorithm [57] to segment the initial shape into several clusters. The VSA algorithm tends to merge co-planar vertices into the same cluster by minimizing
4.3.2 Loss Terms
We minimize the loss function to search for the weight parameters of local MLPs as given in Eq. (4).
\begin{align*}\mathcal{L}_{\text{total}}= & \lambda_{\text{rgb}}\mathcal{L}_{\text{rgb}}+ \lambda_{\text{corr}}\mathcal{L}_{\text{corr}}+ \lambda_{\text{ncorr}}\mathcal{L}_{\text{ncorr}}\\ & + \lambda_{\text{sil}}\mathcal{L}_{\text{sil}}+ \lambda_{\text{reg}}\mathcal{L}_{\text{reg}}\tag{4}\end{align*}
RGB loss. RGB loss measures the difference between the pixel color
In particular, the camera ray
Recursive ray-tracing procedure. During rendering, the refraction and reflection rays fetch colors from the scene texture and background pattern. If pixel
For each refraction ray above, the refraction color is attenuated along the light path, according to the Fresnel term \begin{align*}\mathcal{F}^{\langle t1,t2\rangle}= & \frac{1}{2}\left(\frac{\eta^{\mathrm{i}} \boldsymbol{r}_{\boldsymbol{p}}^{t1}\cdot \boldsymbol{n}_{2}-\eta^{\circ} \boldsymbol{r}_{\boldsymbol{p}}^{t2}\cdot \boldsymbol{n}_{2}}{\eta^{\mathrm{i}} \boldsymbol{r}_{\boldsymbol{p}}^{t1}\cdot \boldsymbol{n}_{2}+\eta^{\circ} \boldsymbol{r}_{\boldsymbol{p}}^{t2}\cdot \boldsymbol{n}_{2}}\right)^{2}\tag{5}\\ & +\frac{1}{2}\left(\frac{\eta^{\circ} \boldsymbol{r}_{\boldsymbol{p}}^{t1}\cdot \boldsymbol{n}_{2}-\eta^{\mathrm{i}} \boldsymbol{r}_{\boldsymbol{p}}^{t2}\cdot \boldsymbol{n}_{2}}{\eta^{\circ} \boldsymbol{r}_{\boldsymbol{p}}^{t1}\cdot \boldsymbol{n}_{2}+\eta^{\mathrm{i}} \boldsymbol{r}_{\boldsymbol{p}}^{t2}\cdot \boldsymbol{n}_{2}}\right)^{2}\\ \boldsymbol{c}_{\boldsymbol{p}}^{t1}= & \left(1- \mathcal{F}^{\langle t1,t2\rangle}\right)\left(\frac{\eta^{t1}}{\eta^{t2}}\right)^{2} \boldsymbol{c}_{\boldsymbol{p}}^{t2}\tag{6}\end{align*}
In our experiments, we set the IOR of air to 1.0003 and the IOR of the object material to 1.52. As shown in Fig. 8, the reflection color
The reflection and refraction colors are fetched from the textured scene mesh or corresponding pattern. If one reflected ray does not intersect with the scene mesh or pattern, we only set \begin{equation*}\mathcal{L}_{\text{rgb}}=\frac{1}{\vert \boldsymbol{M}^{t}\vert}\sum\limits_{\boldsymbol{p}}\boldsymbol{M}_{\boldsymbol{p}}^{t}\Vert \boldsymbol{c}_{\boldsymbol{p}}^{in}-\boldsymbol{c}_{\boldsymbol{p}}\Vert_{1}\tag{7}\end{equation*}
Correspondence loss and no correspondence loss. Correspondence loss \begin{equation*}\mathcal{L}_{\text{corr}}=\frac{1}{\vert \boldsymbol{M}^{t}\vert}\sum\limits_{\boldsymbol{p}}\boldsymbol{M}_{\boldsymbol{p}}^{t}d\left(\boldsymbol{v}_{\boldsymbol{l}_{\boldsymbol{p}}^{t2}}, \boldsymbol{u}\right)\tag{8}\end{equation*}
\begin{equation*}d(\boldsymbol{q}_{1},\boldsymbol{q}_{2})=\begin{cases}\Vert \boldsymbol{q}_{1}-\boldsymbol{q}_{2}\Vert_{2}, & \text{if}\ \Vert \boldsymbol{q}_{1}-\boldsymbol{q}_{2}\Vert_{\infty} > l/2\\ 0, & \text{otherwise}\end{cases}\tag{9}\end{equation*}
For pixels with no salient correspondence, where \begin{equation*}\mathcal{L}_{\text{ncorr}}=-\frac{1}{\vert \boldsymbol{M}^{t}\vert}\sum\limits_{\boldsymbol{p}}\boldsymbol{M}_{\boldsymbol{p}}^{t}\hat{d}\left(\boldsymbol{v}_{\boldsymbol{l}_{\boldsymbol{p}}^{t2}}, \boldsymbol{g}\right)\tag{10}\end{equation*}
\begin{equation*}\hat{d}(\boldsymbol{q}_{1},\boldsymbol{q}_{2})=\begin{cases}\Vert \boldsymbol{q}_{1}-\boldsymbol{q}_{2}\Vert_{2}, & \text{if}\ \Vert \boldsymbol{q}_{1}-\boldsymbol{q}_{2}\Vert_{\infty} < 0.5\\ 0, & \text{otherwise}\end{cases}\tag{11}\end{equation*}
Silhouette loss and regularization loss. We also added silhouette loss [63] to the annotated object masks similar to the initial shape reconstruction step (Section 4.2). Moreover, to further constrain the optimization, we added a regularization loss as Eq. (12):
\begin{equation*}\mathcal{L}_{\text{reg}}=\lambda_{\text{ls}}\mathcal{L}_{\text{ls}}+\lambda_{\text{nc}}\mathcal{L}_{\text{nc}}+\lambda_{\text{pc}}\mathcal{L}_{\text{pc}}\tag{12}\end{equation*}
\begin{align*}& \mathcal{L}_{\text{ls}}= \sum\limits_{\boldsymbol{v}_{j}\in \mathcal{N}(\boldsymbol{v}_{i})}\frac{1}{\vert \mathcal{N}(\boldsymbol{v}_{i})\vert}(\boldsymbol{v}_{j}- \boldsymbol{v}_{i})\tag{13}\\ & \mathcal{L}_{\text{nc}}= \sum\limits_{e\in \mathcal{E}}(1-\log(1+ \boldsymbol{n}_{1}^{e}\cdot \boldsymbol{n}_{2}^{e}))\tag{14}\\ & \mathcal{L}_{\text{pc}}=\frac{1}{\vert \mathcal{S}^{1}\vert} \sum\limits_{\boldsymbol{x}^{1}\in \mathcal{S}^{1}} \min\limits_{\boldsymbol{x}^{2}\in \mathcal{S}^{2}}\Vert \boldsymbol{x}^{1}-\boldsymbol{x}^{2} \Vert_{2}^{2}\\ &\qquad\quad + \frac{1}{\vert \mathcal{S}^{2}\vert} \sum\limits_{\boldsymbol{x}^{2}\in \mathcal{S}^{2}} \min\limits_{\boldsymbol{x}^{1}\in \mathcal{S}^{2}}\Vert \boldsymbol{x}^{1}-\boldsymbol{x}^{2} \Vert_{2}^{2}\tag{15}\end{align*}
Remark
To increase stability during optimization, we removed three types of camera rays: (1) nearly perpendicular to the surface normals at their intersection with the surface during recursive ray tracing
Because the light path inside a transparent object is complex, optimizing the surface shape based on local RGB loss was not sufficient, even with our pyramid loss or perceptual loss (VGG loss). Consequently, we utilized correspondence-based loss to obtain gradients to move
4.4 Implementation Details
Initial shape optimization. As described earlier, our initial shape reconstruction step is based on IDR [7]. We found that with only silhouette loss, an insufficient number of rays may cause holes on the surfaces or sometimes generate another surface beneath the surface of the object. Thus, we increased the number of rays sampled from an image to 20,800, and each batch contained rays sampled from three images. We set the learning rate as
Surface-based MLP optimization. In the surface-based MLP optimization step, we randomly cropped
View selection. We captured images and moved the iPad positions following the condition that for each high-curvature region not near silhouettes, the salient cells in the background pattern were refracted by this region more than twice among all images. Because we did not initially have the object geometry, the high-curvature regions were determined through our observation. The masks were adaptively selected for manual annotation using a view-selection algorithm. This iteratively added views with at least a 60-degree angle distance compared with all previously selected views. Subsequently, some extra views were needed in some cases, as determined by checking the visual hull.
Experiments
We applied our algorithm to reconstruct the 3D shapes of five transparent objects, as shown in Fig. 1 and Fig. 10, which were made from glass or crystals. The captured images of the five objects are shown in Fig. 9. The size of each object, the number of input images, the number of manually annotated masks, and the number of random patterns with moving frequency are listed in Table 1. In the following section, we demonstrate the advantages of our surface-based local MLP representation, perform ablation studies, and compare our method with state-of-the-art transparent object reconstruction methods.
Captured images for five transparent objects: the cat object, the cow object, the dog object, the trophy object, and the brick object with bumpy front surface.
Reconstruction results and their corresponding rendering results for a trophy object and a brick object with bumpy front surface.
To evaluate the accuracy of reconstruction quantitatively, we painted each object with DPT-5 developer as in Ref. [2] and then scanned it with a scanner to obtain a ground truth mesh in SI unit (m), as shown in Fig. 11. We compared the reconstructed results with the ground truth after aligning them using the ICP [71]. Subsequently, we evaluated the reconstruction by measuring the chamfer distance between the two point clouds.
5.1 Evaluations
5.1.1 Surface-Based Local MLP Representation
We performed a comparison to verify the importance of the surface-based local MLP representation. Thus, four other shape representations were incorporated: (1) explicit mesh vertices, as in Ref. [4], denoted by Vert; (2) mesh vertices with an advanced optimizer in Ref. [66], denoted by Vert-LS (Large Step); (3) SDF encoded by a single MLP, similar to IDR [7], denoted by Refract-IDR; and (4) SDF encoded by a single MLP with explicit flows, as in Ref. [67], denoted by SDF-EF (explicit flows). All representations were optimized using the same loss functions in Section 4.3. The implementation details of the representations used in the comparison are as follows.
Vert. This representation explicitly optimizes the position of each vertex using the same loss as ours and with an ADAM optimizer. The learning rate was set to
Vert-LS. This representation is similar to Vert but with the gradient calculation method and the new optimizer proposed in Ref. [66]. The gradient steps in Ref. [66] already involve the Laplacian energy. Therefore, we removed the explicit Laplacian smoothness loss. We set the diffusion weight
Refract-IDR. The Refract-IDR representation is a modified version of IDR [7]. We keep the mask loss and eikonal loss the same as IDR and extend the spherical ray tracing in IDR for two-bounce refract and one-bounce reflection rays. The RGB loss in IDR was replaced by our RGB loss. We also added our corr and ncorr loss, where the weight for each loss term was the same as in our study. During optimization, the gradient of the normal for each intersection point was back-propagated to the MLP through the autograd operation. The architecture of the Refract-IDR network involves eight hidden layers, with each layer having dimensions of 256. The activation function, level of positional encoding, optimizer, and learning rate were set the same as in IDR.
SDF-EF. The SDF-EF representation is the same as in Mehta et al. [67]. It also utilizes the SDF encoded by a single MLP to represent the surfaces. Unlike the Refract-IDR representation, it back-propagates the gradients to the MLP network based on explicit mesh vertices extracted using the marching cube algorithm [67], [72]. During the optimization, this representation performs a marching cube to extract the mesh at each iteration. The architecture of the SDF-MLP network, activation function, level of positional encoding, optimizer, and learning rate were set the same as in Refract-IDR. We also add the Laplacian smooth term, as in Ref. [67] to regularize the surface.
As shown in Fig. 11 and Table 2, our Surf-MLP representation outperformed other representations. Explicit mesh optimization introduces high-frequency artifacts. Although the Vert-LS method can reduce artifacts, it introduces folds into some areas. Simultaneously, the Refract-IDR and SDF-EF representations led to slow convergence and produced over-smooth results that lost some details.
The explicit mesh optimization described above yields some artifacts because explicit optimization is sensitive to noise. However, these artifacts can be reduced by optimizing the vertices in a coarse-to-fine manner. We can progressively performed remeshing during explicit mesh optimization. We denote Vert and Vert-LS with remeshing as Vert R and Vert-LS R and compared our methods with them. Similar to Lyu et al. [4], we do remeshing after every fixed iteration. (30 epochs were used in the experiment.) At each remeshing stage: The target triangle size is fixed and decreases iteratively. As shown in Fig. 12, the remeshing procedure can increase quality of the reconstructed results compared with Vert and Vert-LS. However, our method still outperformed these two methods based on the chamfer distance between the mesh and the ground truth, as shown in Fig. 12. We believe this is because the weights of the loss terms after each remeshing stage must be tuned during optimization. If the weights are not set adaptively, the results for Vert R and the Vert-LS R may suffer from oversmoothing or local high-frequency artifact.
5.1.2 Comparison With Li Et Al.[5] and Lyu Et Al.[4]
The method proposed by Li et al. [5] needs to capture images in an environment large enough to meet the distant-illumination assumption for the environment map. Thus, we chose to evaluate this method using images obtained by rendering the ground-truth mesh illuminated by an environment map (no background geometry). Specifically, we used our ground truth mesh with one environment map to render multi-view images, corresponding object masks, and mirror sphere images, as shown in Fig. 13. The generated environmental map is shown in Fig. 13. A total of 36 images were rendered for visual hull reconstruction, and 10 of them were uniformly sampled as the input images to the network. As illustrated in Fig. 14, this method produces smooth surfaces that lose detail near the object boundaries. Compared with the ground-truth mesh, our reconstructed model can preserve relatively high-precision details. Despite the results using 10 views, we also test 20 input views in our experiments. For the method proposed by Li et al. [5], we found that adding more views to the network did not improve the details of the reconstructed surface in this experiment, as shown in Fig. 14. Here, 10 or 20 views were used as the inputs of the network. The initial visual hull is calculated using all 36 images.
For comparison with Lyu et al. [4], as they only provide ray-pixel correspondences and masks in their dataset, we replace our RGB loss, corr loss, and ncorr loss with fraction loss [4]. As illustrated in Fig. 15, our reconstructed model can preserve more surface detail. Qualitative and quantitative comparison results are shown in Fig. 15 and Table 3. The “mask diff” is the mean L1 distance between rendered masks and the input masks. As shown in the fifth column of Table 3, the ground-truth mesh is not precisely consistent with the input masks. This may be due to the DPT-5 developer coating used in the scanning. As shown in Table 3, the chamfer distances of our results are slightly larger than those of this method. However, our mask difference is smaller, which means that our reconstructed model matches the input masks better. In this experiment, we added the mask difference measurement to demonstrate the advantage of our method because the silhouettes obtained by Lyu et al. [4] are much more accurate than those of the segmentation results.
5.1.3 Number of Clusters
We first performed ablation studies to evaluate the influence of the cluster number (MLP number). Figure 16 shows that our surface-based local MLP representation can obtain better results by increasing the number of clusters from 1, 50, 100, and 150. We also compared our local MLP representation to a global MLP with nine hidden layers of 256 dimensions. For the global MLP, we also test it with two different positional encoding settings, using the number of positional encoding frequencies
5.2 Ablation Study
We remove each loss term individually to evaluate its impact on the reconstruction result of the cat object, as shown in Fig. 17. Overall, the correspondence loss
In the second row of Fig. 17, we demonstrate the purpose of the RGB loss
Influence of MLP number (#MLP) and the MLP architecture measured by the chamfer distance between reconstructed mesh and ground truth mesh. PE_k/d_l: positional encoding
Ablation study of loss terms on the cat object. The numbers indicate the chamfer distances between the reconstructed mesh and the ground truth mesh.
Limitations and Future Work
Our environment-matting algorithm can only find correspondences, assuming that the transparent object has no intrinsic color. Consequently, our method cannot reconstruct colored transparent objects. Next, as the ray-cell correspondences are sparse, our method requires more views to reconstruct the surface and may miss some details, especially for surfaces with complex occlusions. Another limitation is that we need to manually annotate the object masks. In the future, it would be interesting to investigate how to integrate the variables for the color or other material properties of the transparent object to overcome the “no intrinsic color” limitation and how to extract accurate transparent object masks based on single-image EnvMatt.
Conclusions
In this study, we developed a method to reconstruct 3D shapes of transparent objects from handheld captured images under natural light conditions. Our method comprises two components: a surface-based MLP representation that encodes the vertex displacement field based on the initial shape, and a surface optimization through differentiable rendering and EnvMatt. We used an iPad as a background to provide ray-cell correspondences, a simplified capture setting, to facilitate the optimization. Our method can produce high-quality reconstruction results with fine details under natural lighting conditions.
Declaration of Competing Interest
The authors have no competing interests to declare that are relevant to the content of this article.
ACKNOWLEDGEMENTS
We thank the anonymous reviewers for their constructive comments. Weiwei Xu is partially supported by “Pioneer” and “Leading Goose” R&D Program of Zhejiang (No. 2023C01181). Jiamin Xu is partially supported by National Natural Science Foundation of China (No. 62302134) and Zhejiang Provincial Natural Science Foundation (No. LQ24F020031). This paper is supported by Information Technology Center and State Key Lab of CAD&CG, Zhejiang University.