I. Introduction
Modeling 3D scenes from multi-view images drives advancements in robotic perception and navigation, enabling tasks such as inspection operations, path planning and obstacle avoidance [1], [2], [3], [4]. These scenarios demand highly accurate and high-resolution 3D representations to ensure effective performance [2]. Recent methods, like Neural Radiance Fields (NeRF) [5], have excelled in object-centered scene reconstruction and novel view synthesis, making them valuable for generating photorealistic views. Extensions of NeRF to large-scale scenes [4], [6] are particularly relevant for modeling expansive environments necessary for robotic applications. However, the significant computational overhead of these methods, such as the extended rendering times required by Mip-NeRF 360 [6], limits their practical use in robotics, where real-time processing is essential. To mitigate these demands, auxiliary explicit voxel grids have been employed to encode local features more efficiently [7], [8], [9]. While these approaches reduce computational costs, they often compromise visual quality, which is critical for precise robotic tasks.