1. Introduction
The topic of novel view synthesis has recently seen impressive progress due to the use of neural networks to learn representations that are well suited for view synthesis tasks. Most prior approaches in this domain make the assumption that the scene is static, or that it is observed from multiple synchronized input views. However, these restrictions are violated by most videos shared on the Internet today, which frequently feature scenes with diverse dynamic content (e.g., humans, animals, vehicles), recorded by a single camera.