Introduction
Computer vision has always been concerned with understanding the 3D world around us. One of the main challenges when dealing with 3D data is the representation strategy, which was addressed over the years by introducing various discrete representations, including voxel grids, point clouds, and meshes. Each representation has its advantages and disadvantages, especially when it comes to processing it through deep learning, leading to the development of a plethora of ad-hoc algorithms [1], [2], [3] for each coexisting representation. Hence, no standard way to store and process 3D data has yet emerged.
Recently, a new representation has been proposed, called Neural Fields [4] (NFs). They are continuous functions defined at all spatial coordinates, parameterized by a neural network such as a Multi-Layer Perceptron (MLP). In the context of 3D world representation, various types of NFs have been explored. Some of the most common NFs utilize the Signed/Unsigned Distance Field (SDF/UDF) [5], [6], [7], [8] and the Occupancy Field (OF) [9], [10] to represent the 3D surfaces or volumes of the objects in the scene. Alternatively, strategies seeking to capture both geometries and appearances often leverage the Radiance Field (RF), as shown in the pioneering approach NeRF [11].
Representing a 3D scene by encoding it with a continuous function parameterized as an MLP separates the memory cost of the representation from the spatial resolution. In other words, starting from the same fixed number of parameters, it is possible to reconstruct a surface with arbitrarily fine resolution or to render an image with arbitrarily high quality. Furthermore, the identical neural network architecture can be applied to learn various field functions, offering the possibility of a unified framework for representing 3D objects.
Owing to their efficacy and potential benefits, 3D NFs are garnering growing interest from the scientific community, as evidenced by the frequent publication of novel and impressive results [8], [12], [13], [14]. This leads us to speculate that, in the near future, NFs could establish themselves as a standard way to store and communicate 3D data. It is conceivable that repositories hosting digital twins of 3D objects, exclusively realized as MLPs, might become widely accessible.
The above scenario prompts an intriguing research question: can 3D NFs be directly processed using deep learning pipelines for solving downstream tasks, as it is commonly done with discrete representations such as point clouds or images? For instance, is it feasible to classify an object by directly processing the corresponding NeRF without rendering any image from it?
Since NFs are neural networks, there is no straightforward way to process them. A recent work in the field, Functa [15], fits the whole dataset with a shared network conditioned on a different embedding for each data. In this formulation, a solution could be to use such embeddings as the input for downstream tasks. Nevertheless, representing an entire dataset through a shared network poses a formidable learning challenge, as the network encounters difficulties in accurately fitting all the samples (see Section VII).
On the contrary, recent studies, including SIREN [16] and others [17], [18], [19], [20], [21], have demonstrated that it is possible to achieve high-quality reconstructions by tailoring an individual network to each input sample. This holds true even when dealing with complex 3D shapes or images. Furthermore, constructing an individual NF for each object is more adaptable to real-world deployment, as it does not require the availability of the entire dataset to fit each individual data. The increasing popularity of such methodologies suggests that adopting the practice of fitting an individual network is likely to become commonplace in learning NFs.
Therefore, in the former version of this paper [22], we explored conducting downstream tasks using deep learning pipelines on 3D data represented as individual NFs. Recently, several methods addressing this topic have been published, such as NFN [23], NFT [23], and DWSNet [24], and all of them process individual NFs, supporting this paradigm.
Using NFs as input or output data is intrinsically non-trivial, as the MLP of a single NF can encompass hundreds of thousands of parameters. However, deep models inherently present a significantly redundant parameterization of the underlying function, as shown in [25], [26]. As a result, we explore whether and how an answer to the research question mentioned earlier might be identified within a representation learning framework. We present an approach that encodes individual NFs into compact and meaningful embeddings, making them suitable for diverse downstream tasks. We name this framework
Overview of our framework. Left: NFs hold the potential to provide a unified representation of the 3D world. Center: Our framework, dubbed
Our framework has at its core an encoder designed to produce a task-agnostic embedding representing the input NF by processing only the NF weights. These embeddings can seamlessly be used in downstream deep learning pipelines as we validate for various tasks, like classification, retrieval, part segmentation, unconditioned generation, completion, and surface reconstruction. Remarkably, the last two tasks become achievable by learning a straightforward mapping between the embeddings generated using our framework, as embeddings derived from NFs exist in low-dimensional vector spaces, regardless of the underlying implicit function. For instance, we can learn the mapping between NFs of incomplete objects into NFs of normal ones. Then, we can complete shapes by exploiting this mapping, e.g., we can map the NF of an airplane with a missing wing into the NF of a complete airplane. Furthermore, we show that
This paper builds on our previous work [22], with revisions to the overall framework and thorough experiments on novel scenarios. Specifically, the key differences with [22] are:
In [22], we focused solely on neural fields representing the surfaces of 3D objects. In this extended version, we also tackle the processing of neural fields capturing objects’ geometry and appearance. Specifically, we extend our framework to perform deep learning tasks on NeRFs by directly processing their MLPs weights.
The processing of MLPs parametrizing NFs has been investigated in works published contemporaneously or subsequently to [22]. We extend our literature review by including these recent papers, and we evaluate them to foster progress in this emerging topic and facilitate future comparisons.
Overall, the summary of our work contributions is:
We propose and investigate the novel research problem of applying deep learning directly on individual NFs representing 3D objects.
We introduce
nf2vec , a framework designed to derive a meaningful and compact representation of an input NF solely by processing its weights, without the need to sample the underlying function.We demonstrate that a range of tasks, typically tackled with intricate frameworks tailored to specific representations, can be effectively executed using simple deep learning tools on NFs embedded by
nf2vec , regardless of the signal underlying the NFs.We demonstrate the versatility of
nf2vec by successfully applying it to neural fields that capture either the geometry alone or the combined information of both geometry and appearance of 3D objects.We analyze recent methods for processing NFs in terms of classification accuracy and representation quality. We build the first evaluation benchmark for NF classification.
Additional details, code, and datasets are available at https://cvlab-unibo.github.io/nf2vec.
Related Work
Neural Fields: Recent approaches have shown the ability of MLPs to parameterize fields representing any physical quantity of interest [4]. The works focusing on representing 3D shapes with MLPs rely on fitting functions such as the unsigned distance [6], the signed distance [5], [7], [10], [27], [28], [29], or the occupancy [9], [30]. Among these approaches, sinusoidal representation networks (SIRENs) [16] use periodical activation functions to capture the high-frequency details of the input data. In addition to representing shapes, some of these models have been extended to encode object appearance [11], [27], [31], [32], [33], or to include temporal information [34]. Among these recent approaches, modeling the radiance field of a scene [11] has proven to be the critical factor in obtaining excellent scene representations. In our work, we employ NFs encoding SDF, UDF, OF, and RF as input data for deep learning pipelines.
Deep Learning on Neural Networks: Several works have explored using neural networks to process other neural networks. [35] utilizes a network's weights as input and forecasts its classification accuracy. Another approach [36] involves learning a network representation through a self-supervised learning strategy applied to the
These works view neural networks as algorithms, primarily focusing on forecasting properties like accuracy. In contrast, some recent studies handle networks that implicitly represent 3D data, thus tackling various tasks directly from their weights, essentially treating neural networks as input/output data. Functa [15] tackles this scenario by acquiring priors across the entire dataset using a shared network and subsequently encoding each sample into a concise embedding employed for downstream discriminative and generative tasks. We note that in this formulation, each neural field is parametrized by both the shared network and the embedding. It is worth pointing out that, though not originally proposed as a framework to process neural fields, DeepSDF [5] learns dataset priors by optimizing a reconstruction objective through a shared auto-decoder network conditioned on a shape-specific embedding. Thus, the embeddings learned by DeepSDF may be used for neural processing tasks, as done in Functa.
However, shared network frameworks face several challenges, as they struggle to reconstruct the underlying signal with high fidelity and require an entire dataset to learn the neural field of an object. In response, recent approaches have shifted their focus on processing NFs learned on individual data, e.g., a specific object or scene. The first framework adopting this view was proposed in our previous paper version [22]. This approach leverages representation learning to condense individual NFs of 3D shapes into embeddings, serving as input for subsequent tasks. [40] has recently built upon this idea to learn a bidirectional mapping between image/text and NeRF latent spaces. Recognizing that MLPs exhibit weight space symmetries [41], where hidden neurons can be permuted across layers without altering the network's function, recent approaches such as DWSNet [24], NFN [42], and NFT [23] leverage these symmetries as an inductive bias to create innovative architectures tailored for MLPs. DWSNet and NFN design neural layers equivariant to the permutations inherent in MLPs. In contrast, NFT achieves permutation equivariance by removing the positional encoding from a Transformer architecture. A recent work by [43] overcomes the need to deal with MLP symmetries by proposing a Transformer-based architecture that processes NFs with tri-planar grid features by focusing on those discrete features only.
Recently, HyperDiffusion [44] has proposed a generative diffusion approach to synthesize NF parameters. Like us, it employs MLPs optimized to represent individual data.
Learning to Represent NFs
This paper explores the possibility and the methodology of directly utilizing NFs for downstream tasks. Specifically, can we classify an object implicitly encoded in a NF, and if so, how? As outlined in Section I, we condense the redundant information encoded in the weights of NFs into latent codes by a representation learning framework. These codes can then be efficiently processed using standard deep-learning pipelines. Our framework, dubbed
3D Neural Fields: A field is a physical quantity defined for all domain coordinates. We focus on fields describing the 3D world, thus operating on
Encoder: The encoder takes as input the weights of a NF and produces a compact embedding that encodes all the relevant information of the input NF. Designing an encoder for NFs poses a challenge in handling weights efficiently to avoid excessive memory usage. While a straightforward solution might involve using an MLP encoder to map flattened weight vectors to desired dimensions, this approach becomes impractical for larger NFs. For instance, given a 4-layer 512-neurons NF, mapping its 800K parameters to a 1024-sized embedding space would require an encoder with roughly 800M parameters, making this approach prohibitive. Thus, we focus on developing an encoder architecture that scales gracefully with the size of the input NF.
Following conventional practice [16], [17], [18], [19], [20], we consider NFs composed of an MLP with several hidden layers, each with
Encoder architecture. Left: Given a NF, we stack its weights and biases to form a matrix
Our proposed architecture scales gracefully to bigger input NFs as supported by the analysis in Table I, that reports the parameters of our encoder w.r.t. those of a generic MLP encoder while varying the input NF dimension.
It is worth observing that the randomness involved in fitting an individual NF (weights initialization, data shuffling, etc.) causes the weights in the same position in the NF architecture not to share the same role across NFs. Thus,
Decoder: When learning to encode NFs, we are interested in storing the information about the represented object rather than the values of the input weights. Therefore, the adopted decoder predicts the original field values rather than reconstructing the input weights in an auto-encoder fashion. In particular, during training, we adopt an implicit decoder inspired by [5], which takes in input the embeddings produced by the encoder and a spatial coordinate
Training and inference of
Training: We train our encoder and decoder following the input NFs training strategy. For instance, when dealing with UDF, SDF, and OF representing 3D surfaces, we supervise the framework directly using the ground truth field values computed from point clouds, voxel grids, or triangle meshes representing those surfaces. Differently, when processing NeRFs we employ volumetric rendering [11] on the radiance field values predicted by the decoder to obtain the RGB intensities of image pixels, and we supervise the framework directly with a regression loss between predicted and true RGB values.
To better understand the procedure, let us take the example where we aim to learn to represent UDFs. We create a set of 3D queries paired with the values of the UDF at those locations. The decoder takes in input the embedding produced by the encoder concatenated with the 3D coordinates of a query point and estimates the UDF for this location. The whole encoder-decoder is supervised to minimize the discrepancy between the estimated and correct UDF values.
Inference: After the overall framework has been trained end to end, the frozen encoder can be used to compute embeddings of unseen NFs with a single forward pass (see Fig. 3 right) while the implicit decoder can be used, if needed, to reconstruct the discrete representation given an embedding. We highlight that no discrete representations are required at inference time.
The presented
Latent Space Properties
We train
Reconstruction: In Fig. 4, we compare 3D shapes reconstructed from NFs unseen during training with those reconstructed by the
Interpolation: In Figs. 6 and 7, we linearly interpolate between two object embeddings produced by
Additionally, given two input NeRFs, we render images from networks obtained by interpolating their weights. In Fig. 8, we compare these results with those obtained from the interpolation of
t-SNE Visualization of the Latent Space: In Fig. 9, we provide the t-SNE visualization of the embeddings produced by
t-SNE visualizations of
Deep Learning on 3D Shapes
This section shows how several tasks dealing with 3D shapes can be tackled by working only with
General Settings: In all the experiments reported in this section, we convert 3D discrete representations into NFs featuring 4 hidden layers with 512 nodes each, using the SIREN activation function [16]. We discard the input and output layers of SIREN MLPs when processing them with
Point Cloud Retrieval: We examine the potential of using
Point cloud retrieval qualitative results. Given the
Shape Classification: We then address the problem of classifying point clouds, meshes, and voxel grids. We use three datasets for point clouds: ShapeNet10, ModelNet40, and ScanNet10 [52]. When dealing with meshes, we conduct our experiments on the Manifold40 dataset [3]. Finally, we use ShapeNet10 again for voxel grids, quantizing clouds to grids with resolution
Finally, in Fig. 11 (left), we present the baseline inference times, assuming discrete point clouds are available at test time, and compare it with that of
Time required to classify NFs encoding
Point Cloud Part Segmentation: The classification and retrieval tasks explore the potential of utilizing
Shape Generation: So far, we have validated that NF can be used as input in standard deep learning machinery thanks to
Learning a Mapping Between
Deep Learning on NeRFs
In this section, our focus shifts to processing NFs encoding both geometry and appearance of objects, i.e., NeRFs. The goal is to illustrate the efficacy of
General Settings: In all experiments detailed within this section, we learn NeRFs from images using an MLP comprising three hidden layers with 64 nodes each. We utilize the ReLU activation function between all layers except the final layer, which computes the density and RGB values without any activation function. NeRFs take as input the frequency encoding of the 3D coordinates as in [11]. NeRFs are trained using an
NeRF Retrieval. We first investigate the quality of
We also implement two baseline approaches: the single-view and multi-view baselines. Both strategies rely on a ResNet50 [58] backbone pre-trained on ImageNet [59]. We extract feature vectors with ResNet50 from each image. Given a single image for the single-view baseline or 9 images for the multi-view baseline, we find the k-nearest neighbors in the ResNet50 feature space. We compare the classes of query and retrieved objects and compute the mAP for different
NeRF retrieval qualitative results. Given the
NeRF Classification: This section investigates the task of predicting the category of an object represented by a NeRF. In this scenario, only NeRFs would be available as input data.
Our approach processes
As the discrete representations used to learn NeRFs are a set of images depicting the same object, selecting a proper baseline is not straightforward. In our experiment, we choose ResNet50 [58] as the baseline classifier. The network predicts the class for a given input image. Given this architecture, akin to the retrieval experiment, we propose two types of baseline approaches, single-view and multi-view. In the former, we train the network on a single rendering for each NeRF obtained from the same fixed pose, while for the latter, we employ 9 renderings for each NeRF from different viewpoints. At test time, regarding the single-view approach, we test the network on images rendered from unseen NeRFs employing the same training pose. Concerning the multi-view baseline, we render 9 images from the training viewpoints for each unseen NeRF, obtaining 9 distinct predictions per object, that we aggregate by taking the class predicted with the highest frequency.
We report the accuracy results on ShapeNetRender [60] in Table VI. Moreover, we also report the time required to classify NeRFs in Table VII, highlighting the impact of each of the main pipeline steps, such as the
NeRF Generation: We experiment here with the task of generating
Learning a Mapping Between Embedding Spaces: We explore here whether it is possible to learn a transfer network that maps
We run this experiment on ShapeNetRender [60], using the rendered images to learn NeRFs and the corresponding 3D models to learn UDF neural fields. Then, we train
Comparison With Recent Approaches
As outlined in Section I, several contemporary works addressing the problem of processing neural fields have been proposed recently. For all methods, the goal is to perform deep learning tasks such as classification using as input data a NF, i.e., data represented with continuous functions. We can divide these methods into two categories. Those relying on a shared network and those focusing on individual NFs. In the former case, referred to here as Shared, the NF is defined as a shared network trained on all training samples, plus a distinct vector representing each object. Typically this vector is processed to perform downstream tasks. This is the case of Functa [15] and DeepSDF [5]. In the latter case, denoted as Individual, the NF is typically an MLP trained on a single object or scene. In this scenario, the MLP weights are processed directly to perform the downstream tasks. This is the case of our framework,
In this section, we investigate the characteristics of each category of techniques, showing that Shared frameworks are problematic, as they cannot reconstruct the underlying signal with high fidelity and need a whole dataset to learn the neural field of an object. Moreover, we build the first benchmark of NF classification by comparing recent approaches in this area.
Representation Quality: We first investigate the representation quality of Shared approaches compared to Individual ones. Specifically, we compare the reconstructions of explicit meshes from SDF neural fields with ground truth meshes on the Manifold40 test set. We report the quantitative comparisons in Table VIII, using two metrics: the Chamfer Distance (CD) as defined in [49], and the F-Score as defined in [62]. We note that we use the SIREN MLP described in Section V to represent SDF with Individual frameworks. In the first two rows, we note that Shared methods achieve poor reconstruction performance. Indeed, we believe that representing a whole dataset with a shared network is a difficult learning task, and the network struggles to fit accurately the totality of the samples. Individual methods instead do not suffer from this problem and achieve very good reconstruction performance. Moreover, we believe that the approaches based on Shared networks struggle to represent unseen samples the further they are from the training distribution. Hence, in the foreseen scenario where NFs become a standard representation for 3D data hosted in public repositories, leveraging on a single shared network may imply the need to frequently retrain the model upon uploading new samples, which, in turn, would change the embeddings of all the previously stored data. On the contrary, uploading the repository with a new object would not cause any sort of issue with individual NFs, where one learns a network for each data point. Finally, we also provide a qualitative perspective of the aforementioned problem in Figs. 22 and 23. The visualizations confirm the results of Table VIII, with shared network frameworks struggling to represent properly the ground-truth shapes, while individual NFs enable high-fidelity reconstructions. We note that the quality of our DeepSDF reconstructions, where a single model is trained on the whole dataset, is inferior to the one reported in [5], where instead a different auto-decoder is trained for each class. This approach is not applicable in our case, as it would require to know in advance the class label of each shape in order to choose the right auto-decoder.
We believe that these results highlight that frameworks based on a single shared network cannot be used as a medium to represent objects as NFs, because of their limited representation power when dealing with large and varied datasets and because of their difficulty in representing new shapes not available at training time.
Classification Accuracy: We compare recent methods in the NFs classification task. The goal is to predict the category of objects represented within the input NFs without recreating the discrete signals. Specifically, we test all methods on UDF obtained from point clouds of ModelNet40 [47] and on SDF learned from meshes of Manifold40 [3]. We compare
As we can see from results reported in Table IX, Functa and
Using the Same Initialization for NFs
The need to align the multitude of NFs that can approximate a given shape is a challenging research problem that has to be dealt with when using NFs as input data. We empirically found that fixing the weights initialization to a shared random vector across NFs is a viable and simple solution to this problem.
We report here an experiment to assess if the order of data or other sources of randomness arising while fitting NFs do affect the repeatability of the embeddings computed by
L2 distances between
Seeking for a proof with a stronger theoretical foundation, we turn our attention to the recent work git re-basin [63], where authors show that the loss landscape of neural networks contains (nearly) a single basin after accounting for all possible permutation symmetries of hidden units. The intuition behind this finding is that, given two neural networks that were trained with equivalent architectures but different random initializations, data orders, and potentially different hyperparameters or datasets, it is possible to find a permutation of the network's weights such that when linearly interpolating between their weights, all intermediate models enjoy performance similar to them – a phenomenon denoted as linear mode connectivity.
Intrigued by this finding, we conducted a study to assess whether initializing NFs with the same random vector, which we found to be key to
The results of this experiment are reported for four different shapes in Fig. 25. It is possible to note that, as shown by the blue curves, when interpolating between NFs obtained from the same weights initialization, the loss value at each interpolation step is nearly identical to those of the boundary NFs. On the contrary, the red curves highlight how there is no linear mode connectivity at all between NFs obtained from different weights initializations.
Linear mode connectivity study. Each plot shows the variation of the loss function over the same batch of points when interpolating between two NFs representing the same shape. The red line describes the interpolation between NFs initialized differently, whereas the blue line shows the interpolation between NFs initialized with the same random vector.
[63] also proposes different algorithms to estimate the permutation needed to obtain linear mode connectivity between two networks. We applied the algorithm proposed in their paper in Section III-B (Matching Weights) to our NFs and observed the resulting permutations. Remarkably, when applied to NFs obtained from the same weights initialization, the retrieved permutations are identity matrices, both when the target NFs represent the same shape and when they represent different ones. Instead, the permutations obtained for NFs obtained from different initializations are far from being identity matrices.
All these results favor the hypothesis that our technique of initializing NFs with the same random vector leads to linear mode connectivity between different NFs. We believe that the possibility of performing meaningful linear interpolation between the weights occupying the same positions across different NFs can be interpreted by considering corresponding weights as carrying out the same role in terms of feature detection units, explaining why the
The experiments in this section were conducted on NF with sine and ReLU activation functions, as those are the activations used throughout this paper. To further validate the applicability of our method to SIRENs and ReLU NFs, we show in Table X the comparable results obtained by classifying
Limitations
We point out three main limitations of our approach: i) although NFs capture continuous geometric cues, in some cases deep learning on
Concluding Remarks
We have shown that it is possible to apply deep learning on individual NFs representing 3D shapes and object-centric radiance fields. Our approach leverages a task-agnostic encoder which embeds NFs into compact and meaningful latent codes without accessing the underlying function. We have shown that these embeddings can be fed to standard deep-learning machinery to solve various tasks effectively. Moreover, we have introduced the first benchmark for the task of NF classification, showing that our proposal obtains the best score (on par with Functa [15]) while preserving the ability to reconstruct the input dataset with high quality.
In the future, we plan to go beyond NFs of 3D objects by applying
We reckon that our work may foster the adoption of NFs as a unified 3D representation, overcoming the current fragmentation of 3D structures and processing architectures.