I. Introduction
Visual tracking of objects in a sequence of images is a fundamental step for various computer vision applications. In many applications, for example in vision-based robot control [1], the final objective of visual tracking is not only the localization of the objects in the image but also the estimation of the camera displacement with respect to a reference frame. Most of the visual tracking algorithms focus on the estimation of the 2D projective transformation of the objects between the images. In this paper, we consider visual tracking methods that minimize a dissimilarity measure between a reference template and the current image using parametric models of the 2D projective transformation (see for example [2], [3], [4], [5], [6]). Then, the camera displacement between the reference frame (attached to the reference template) and the current frame (attached to the current image) can be extracted from the image transformation using the camera intrinsic parameters [7]. Instead, our objective here is to track a piecewise-planar scene (which model is a priori known) in the 2D image and, at the same time, to estimate accurately the 3D displacement of the camera. It is well known that an accurate measure of the displacement can be obtained by adding some known constraints on the camera displacement and/or on the shape of the observed objects (see for example [8], [9], [10], [11]). These constraints are generally expressed in the Euclidean space. For example, the displacement of a camera mounted on a mobile robot can be constrained to be in a plane. The shape of the observed objects may be partially known and different regions of the same rigid object have the same 3D displacement despite they do not undergo the same image transformation. For example, when the scene is piecewise-planar, several planar patches can be independently extracted and tracked. This corresponds to estimating a projective homography for each plane. In the ideal case, the computation of the homographies provides a coherent camera displacement. In practice, this hardly happens if we do not explicitly impose the constraint that all the planes are rigidly attached to each other. Indeed, it can happen that two planar patches with poor textures give two contradictory displacements. In addition, it is generally difficult to transform this additional Euclidean information into simple constraints on the 2D image transformation. To overcome these difficulties, the primary objective of this paper is to compute the explicit dependency between the 2D image transformation parameters and the 3D camera displacement parameters. This makes it easier to integrate Euclidean constraints in the visual tracking algorithm and to give a direct and more accurate camera motion with respect to a reference frame attached to the observed scene. Several papers have addressed the problem of templated-based tracking imposing some Euclidean constraints. In [11], the authors avoid the explicit computation of the Jacobian that relates the variation of the 3D pose parameters to the variation of the appearance in the image by using implicit function derivatives. In that case, the minimization of the image error is done with a coarse approximation of the inverse compositional algorithm [2], [12]. In [13], the authors extend the method proposed in [3] to a homographic warping. The method makes the assumption that the true camera pose can be approximated by the current estimated pose (i.e. the camera displacement is sufficiently small). In addition, the Euclidean constraints are not directly imposed during the tracking, but once the homography has been estimated, the rotation and the translation of the camera are extracted [7]. In [10], the authors go one step further and extend the method proposed in [3] by including the constraint that a set of control points on a three-dimensional surface undergo the same camera displacement. The aim of our work is to propose a visual tracking algorithm to be integrated into vision-based robot control systems. This implies a real-time implementation and the possibility to track objects with fast camera motions. In order to meet these constraints, we suppose that the structure of the scene is known. This limits our system to 6 unknowns (the translation and rotation of the camera). Other methods consider both the motion and the structure as unknowns (see for example [14]). In this case, the number of unknowns is considerably higher which reduces the real-time performance of the algorithm. Furthermore, even when fixing some unknowns, these methods generally use first-order approximations (e.g. the linearization involved in the Extended Kalman Filter proposed in [14]).