1. Introduction
Visual grounding provides machines the ability to ground a language description to the targeted visual region. The task has received wide attention in both datasets [54], [31], [19] and methods [16], [46], [53], [50]. However, most previous visual grounding studies remain on images [54], [31], [19] and videos [57], [38], [51], which contain 2D projections of inherently 3D visual scenes. The recently proposed 3D visual grounding task [1], [4] aims to ground a natural language description about a 3D scene to the region referred to by a language query (in the form of a 3D bounding box). The 3D visual grounding task has various applications, including autonomous agents [40], [47], human-machine interaction in augmented/mixed reality [20], [22], intelligent vehicles [29], [12], and so on.