1. Introduction
Computer vision involves a host of tasks, such as boundary detection, semantic segmentation, surface estimation, object detection, image classification, to name a few. While Convolutional Neural Networks (CNNs) [32] have been shown to be successful at effectively handling most vision tasks, in the current literature most works focus on individual tasks and devote all of a CNN's power to maximizing task-specific performance. In our understanding a joint treatment of multiple problems can result not only in simpler and faster models, but will also be a catalyst for reaching out to other fields. One can expect that such all-in-one, “swiss knife” architectures will become indispensable for general AI, involving, for instance, robots that will be able to recognize the scene they are in, identify objects, navigate towards them, and manipulate them.