CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes | IEEE Conference Publication | IEEE Xplore