Loading [MathJax]/extensions/MathZoom.js
Understanding Road Layout From Videos as a Whole | IEEE Conference Publication | IEEE Xplore

Understanding Road Layout From Videos as a Whole


Abstract:

In this paper, we address the problem of inferring the layout of complex road scenes from video sequences. To this end, we formulate it as a top-view road attributes pred...Show More

Abstract:

In this paper, we address the problem of inferring the layout of complex road scenes from video sequences. To this end, we formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently. In contrast to prior work, we exploit the following three novel aspects: leveraging camera motions in videos, including context cues and incorporating long-term video information. Specifically, we introduce a model that aims to enforce prediction consistency in videos. Our model consists of one LSTM and one Feature Transform Module (FTM). The former implicitly incorporates the consistency constraint with its hidden states, and the latter explicitly takes the camera motion into consideration when aggregating information along videos. Moreover, we propose to incorporate context information by introducing road participants, e.g. objects, into our model. When the entire video sequence is available, our model is also able to encode both local and global cues, e.g. information from both past and future frames. Experiments on two data sets show that: (1) Incorporating either global or contextual cues improves the prediction accuracy and leveraging both gives the best performance. (2) Introducing the LSTM and FTM modules improves the prediction consistency in videos. (3) The proposed method outperforms the SOTA by a large margin.
Date of Conference: 13-19 June 2020
Date Added to IEEE Xplore: 05 August 2020
ISBN Information:

ISSN Information:

Conference Location: Seattle, WA, USA

1. Introduction

Understanding 3D properties of road scenes from single or multiple images is an important and challenging task. Semantic segmentation [2], [3], [41], (monocular) depth estimation [9], [16], [40] and road layout estimation [10], [25], [28], [38] are some well-explored directions for single image 3D scene understanding. Compared to image-based inputs, videos provide the opportunity to exploit more cues such as temporal coherence, dynamics and context [19], yet 3D scene understanding in videos is relatively under-explored, especially with long-term inputs. This work takes a step towards 3D road scene understanding in videos through a holistic consideration of local, global and consistency cues.

Given perspective images (top left) that captures a 3D scene, our goal is to predict the layout of complex driving scenes in top-view both accurately and coherently.

Contact IEEE to Subscribe

References

References is not available for this document.