What to Learn: Features, Image Transformations, or Both? | IEEE Conference Publication | IEEE Xplore

What to Learn: Features, Image Transformations, or Both?


Abstract:

Long-term visual localization is an essential problem in robotics and computer vision, but remains challenging due to the environmental appearance changes caused by light...Show More

Abstract:

Long-term visual localization is an essential problem in robotics and computer vision, but remains challenging due to the environmental appearance changes caused by lighting and seasons. While many existing works have attempted to solve it by directly learning invariant sparse keypoints and descriptors to match scenes, these approaches still struggle with adverse appearance changes. Recent developments in image transformations such as neural style transfer have emerged as an alternative to address such appearance gaps. In this work, we propose to combine an image transformation network and a feature-learning network to improve long-term localization performance. Given night-to-day image pairs, the image transformation network transforms the night images into day-like conditions prior to feature matching; the feature network learns to detect keypoint locations with their associated descriptor values, which can be passed to a classical pose estimator to compute the relative poses. We conducted various experiments to examine the effectiveness of combining style transfer and feature learning and its training strategy, showing that such a combination greatly improves long-term localization performance.
Date of Conference: 01-05 October 2023
Date Added to IEEE Xplore: 13 December 2023
ISBN Information:

ISSN Information:

Conference Location: Detroit, MI, USA

Funding Agency:


I. Introduction

The goal of long-term metric localization is to estimate the 6-DoF pose of a robot with respect to a visual map. However, long-term localization remains a challenge across drastic appearance change caused by illumination variations, such as day-night scenarios. Traditional point-based localization approaches find correspondences between local features extracted from images by applying hand-crafted descriptors (e.g., SIFT [1], SURF [2], ORB [3]), then recover the full 6-DoF camera pose. However, such hand-crafted features are not robust under extreme appearance changes due to low repeatability. To address this, experience-based visual navigation methods [4]–[6] proposed to store intermediate experiences to achieve long-term localization. For instance, Multi-Experience Visual Teach & Repeat [5], [6] retrieves the most relevant experiences to perform SURF feature matching during a more challenging repeat to bridge the appearance gap and localize to the initial taught path.

Contact IEEE to Subscribe

References

References is not available for this document.