1. Introduction
Motion planning in dynamic environments requires forecasting how the scene imminently evolves. What representation should we forecast to support planning? In practice, standard autonomy stacks forecast a semantic object-centric representation by building perceptual modules such as object detection, tracking, and prediction [42]. However, in the context of machine learning, training these modules comes at an enormous annotation cost, requiring massive amounts of data manually annotated with object labels, including both 3D trajectories and semantic categories (e.g., cars, pedestrians, bicyclists, etc). With autonomous fleets gathering petabytes of data, it’s impossible to label data at a rate that keeps up with the rate of data collection.