Loading [MathJax]/extensions/MathMenu.js
RGB-D camera pose estimation using deep neural network | IEEE Conference Publication | IEEE Xplore

RGB-D camera pose estimation using deep neural network


Abstract:

This paper presents a study for RGB-D camera pose estimation using deep learning techniques. The proposed network architecture is composed of two components: the convolut...Show More

Abstract:

This paper presents a study for RGB-D camera pose estimation using deep learning techniques. The proposed network architecture is composed of two components: the convolution neural network (CNN) for exploiting the vision information, and the Long Short-Term Memory (LSTM) block for incorporating the temporal information. The CNN, more precisely a RGB-D variant of GoogLeNet, functionalizes as a feature-oriented camera pose estimator, while the LSTM works as a temporal filter to model the pose transition. A modified loss function is also proposed to help regulate the convergence of the pose parameters. Experimental results show that the combination of CNN and LSTM can achieve a higher pose estimation accuracy, while the pipeline structure defined in the network can also provide flexibility for handling different scenarios.
Date of Conference: 14-16 November 2017
Date Added to IEEE Xplore: 08 March 2018
ISBN Information:
Conference Location: Montreal, QC, Canada

1. Introduction

As witnessed in recent years, the deep learning technique has successfully established itself as a well generalized foundation for solving a variety of computer intelligence problems, especially in the domains of object recognition, activity understanding and natural language processing. Although the purposes of these systems are different from each other, a common notion can be interpreted as that a complex high-level identification problem could be approximated using a deep neural network, which is mainly implemented by convolution neural networks (CNNs), and recurrent neural networks (RNNs) alike. In terms of CNN structure, one of the most accepted examples shall be accredited to the Inception module defined in [1], also known as the GoogLeNet. By constructing a cascaded module pipeline, the entire network could reach a fairly high performance in recognition rate. Earlier work could be traced back to [2], in which the AlexNet structure was defined, while recent study such as VGG network [3] and ResNet [4] also draw considerable amount of attention from the community. In fact, by investigating the interaction between the data and the network layers, one can come to an inference that the convolution layers and pooling layers are essentially acting as feature extractors, while the dense layers, a.k.a. fully connected layers, are more inclined to synthesizing the output of convolution layers into final decision. As such, typical applications of CNNs are mainly focused on static scene analysis, for example, still image analysis. On the contrary, the natural structure of RNNs defines the routing of information flow, which is more favorable to temporal processing systems. In the recent studies, a lot of efforts have been made to explore the possibility of introducing the RNN into time-sensitive processing systems. For example, Long Short-Term Memory (LSTM), a special kind of RNN, has been applied to address the problems of human action recognition [5] and [6]. Briefly speaking, the CNN excels at extensive feature perception, while the LSTM is favorable to sequential information analysis.

Contact IEEE to Subscribe

References

References is not available for this document.