1. Introduction
Synthesizing high-quality 3D contents is an essential yet highly demanding task for numerous applications, including gaming, film making, robotic simulation, autonomous driving, and upcoming VR/AR scenarios. With an increasing number of 3D content datasets, the computer vision and graphics community has witnessed a soaring research inter-est in the field of 3D geometry generation [2], [12], [36], [38], [40], [60], [68], [73]. Despite achieving a remarkable success in 3D geometry modeling, generating the object appearance, i.e. textures, is still bottlenecked by laborious human efforts. It typically requires a substantially long time for designing and adjustment, and immense 3D modelling expertise with tools such as Blender. As such, automatic designing and augmenting the textures has not yet been fully industrial-ized due to a huge demand for human expertise and finan-cial expenses.
We introduce SceneTex, a text-driven texture synthesis architecture for 3D indoor scenes. Given scene geometries and text prompts as input, SceneTex generates high-quality and style-consistent textures via depth-to-image diffusion priors.