Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision | IEEE Conference Publication | IEEE Xplore