Journals & Magazines >IEEE Transactions on Aerospac... >Volume: 60 Issue: 3

Bridging the Domain Gap in Satellite Pose Estimation: A Self-Training Approach Based on Geometrical Constraints

Abstract:

Unsupervised domain adaptation in satellite posed estimation aimed at alleviating the annotation cost for training deep models has been gaining attention. To this end, we...Show More

Metadata

Abstract:

Unsupervised domain adaptation in satellite posed estimation aimed at alleviating the annotation cost for training deep models has been gaining attention. To this end, we propose a self-training framework based on the domain-agnostic geometrical constraints. Specifically, we train a neural network to predict the 2-D keypoints of a satellite and then use perspective-n-point (PnP) to estimate the pose. The poses of target samples are regarded as latent variables to formulate the task as a minimization problem. Furthermore, we leverage fine-grained segmentation to tackle the information loss issue caused by abstracting the satellite as sparse keypoints. Finally, we iteratively solve the minimization problem in two steps: pseudolabel generation and network training. Experimental results show that our method adapts well to the target domain. Moreover, our method won the first place on the sunlamp task of the second international Satellite Pose Estimation Competition.

Published in: IEEE Transactions on Aerospace and Electronic Systems ( Volume: 60, Issue: 3, June 2024)

Page(s): 2500 - 2514

Date of Publication: 01 March 2023

ISSN Information:

DOI: 10.1109/TAES.2023.3250385

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Estimating the pose of an uncooperative spacecraft is crucial in numerous space missions, such as debris removal [1], on-orbit servicing [2], and assets refueling [3]. Over the last two decades, numerous relative navigation systems have been proposed based on LiDARs [4], [5] and cameras [6], [7], [8], [9], [10], [11]. Active sensors, including radars and LiDARs, require larger mass and higher power consumption compared to visual sensors. Moreover, a stereo system requires a large baseline and relies on robust feature matching to obtain depth information. Therefore, monocular vision-based navigation systems have gained increasing attention from the academical and industrial fields.

Traditionally, monocular satellite pose estimation approaches require hand-crafted feature extraction [6], [7], [8], [9], which limits their performance on challenging environments, such as occlusions, harsh lighting conditions, reflective materials, and complex structures. Recently, deep learning-based methods have achieved success in satellite pose estimation [11], thanks to the powerful feature representation ability of deep models. However, training neural networks requires large-scale datasets, while collecting real images of spacecrafts and annotating their 6-DoF poses are time-consuming, notoriously laborious, and difficult. Therefore, recent deep models are trained and evaluated using synthetic images [11], [12], [13], [14], [15]. Due to the inherent discrepancy between real and synthetic images, the deep models that are fully supervised by synthetic images usually show deteriorated performance when being deployed in real scenarios. This issue is revealed by the large gap between scores on the real and synthetic leaderboards.¹

In computer vision, unsupervised domain adaptation (UDA) [16], [17], [18], [19], [20], [21] is adopted in scenarios where the labels of real samples are scarce, by training deep models using labeled synthetic images and unlabeled real images. Motivated by these practices, Park et al. [22] created the next generation spacecraft pose estimation dataset (SPEED+) with focus on the synthetic-to-real domain gap in satellite pose estimation. Moreover, based on the SPEED+ dataset, the Advanced Concepts Team (ACT) of the European Space Agency (ESA) and the Space Rendezvous Laboratory (SLAB) at Stanford University coorganized the second international Satellite Pose Estimation Competition (SPEC2021)² to boost the research of bridging the domain gap.

Unlike common UDA tasks in computer vision, a calibrated camera is used to measure the pose of the same satellite under various environments in space missions. For UDA in satellite pose estimation in SPEC2021, only the environmental settings are different across domains while the satellite structure and the camera parameters are the same (as shown in Fig. 1). Therefore, given a pose, the locations of keypoints and the satellite masks are the same in different domains, which are referred to as the domain-agnostic geometrical constraints. Besides, due to intense imaging noise, challenging illumination variations, and diverse poses, this task poses additional challenges compared with common UDA problems [17], [18], [19].

Fig. 1.

Characteristic of the UDA task in satellite pose estimation. The satellite structures are the same in different domains with different illumination conditions. The samples come from three distinct domains, including synthetic (well illuminated), lightbox (low contrast, highly noisy), and sunlamp (specular reflection). Top: Original images from [22] with the 2-D keypoints and the reprojected wireframe model. Bottom: Fine-grained masks. Note that, the visualization results on the synthetic image is obtained using the ground-truth pose, while the poses used on the lightbox and sunlamp images are predicted by our network.

Show All

Several UDA approaches [20], [21], [23], [24], [25], [26] have been proposed to explore self-training paradigms to improve performance on the target domain. Nonetheless, these methods are not specifically designed for UDA in satellite pose estimation, since the domain-agnostic geometrical constraints are not fully explored. Meanwhile, previous satellite pose estimation approaches [12], [15] represent a satellite as a set of 2-D keypoints, and then estimate the satellite pose using the perspective-n-point (PnP) algorithms [27]. However, as shown in Fig. 1, sparse keypoints stand for only semantic parts of the satellite. Such sparse representation leads to a significant loss of information, which hampers knowledge transfer across domains.

To tackle the above problems, we formulate UDA of satellite pose estimation as a minimization problem under a self-training framework. First, we formulate the geometrical constraints as a projection function, which maps the predefined 3-D keypoints onto the source and target images using the same camera parameters. Based on the projection function, we propose a basic self-training framework by taking the poses of target samples as latent variables, which are jointly optimized with the network parameters. Second, we leverage fine-grained segmentation to extend the basic framework. Specifically, we enhance the geometrical constraints with a rendering function. Similar to the projection function, the rendering function maps the 3-D mesh of the satellite to fine-grained masks, using the same camera parameters in different domains. Therefore, we take fine-grained segmentation as an auxiliary task of keypoints regression. Furthermore, as the masks provide dense descriptions with structural information, we perform adversarial training by aligning the predicted masks of the source and target samples. Finally, we iteratively optimize the network parameters and generate pseudolabels to solve the minimization problem. Experimental results demonstrate the effectiveness of our framework. Moreover, our method won the first and the third place on two leaderboards of the second international Satellite Pose Estimation Competition,³ respectively.

Our contributions can be summarized as follows.

We explore the domain-agnostic geometrical constraints to propose a self-training framework for UDA in satellite pose estimation.
We leverage fine-grained masks to address the information loss problem caused by abstracting the satellite as sparse keypoints.
Our method significantly improves the accuracy of satellite pose estimation without using real annotations.

SECTION II.

Related Work

Object pose estimation aims at recovering the 3-D position and 3-D rotation of an object in the camera-centered coordinate system. Traditional approaches [28], [29] rely on local features, suffering from texture-less objects and background clutter. Recently, CNN-based methods have dominated most object pose estimation tasks. Numerous approaches [30], [31], [32], [33], [34], [35] have been proposed to estimate poses using putative 2-D-3-D correspondences and the PnP algorithms. To achieve better efficiency, several methods [36], [37], [38], [39] are introduced to directly regress poses from monocular images. Other methods [40], [41], [42] learn the latent representations of rotation and recover poses by exploring the image retrieval paradigms. Since these methods focus on household objects in indoor scenarios [43], [44], they face significant challenges caused by wide-range depth variations and illumination changes in outer space [45].

Satellite pose estimation is a special case of object pose estimation. Spacecraft pose network (SPN) [14] is the first deep learning-based approach for satellite pose estimation. Specifically, the 3-D rotation is recovered by discretizing the viewpoint spaces into bins, and then the 3-D translation is estimated using the geometrical constraints. In other top-performing approaches, the pose estimation problem is formulated as a task of localizing semantic keypoints on the convex areas of a satellite by taking various representations, such as heatmap [12], vector [13], and set [15]. These methods crop the satellite from the input images using a well-trained object detector to address scale variations. To achieve better efficiency, Hu et al. [45] handled the scale problem in a single-stage way by introducing a sampling strategy. However, these methods are trained and tested on synthetic data. They usually undergo significant performance degradation when being applied to real images due to the domain gap [46].

Unsupervised domain adaptation aims at addressing the domain mismatch problem. It is a promising direction to circumvent the laborious and time-consuming procedures of data annotation. Several UDA paradigms have been studied for different vision tasks. Adversarial learning aligns both domains with a discriminator. The alignment can be achieved at image-level [16], feature-level [17], [18], and output-level [19]. Self-training methods are introduced to utilize target samples to train the model by generating pseudolabels [23], [24], minimizing the entropy loss [20], [25], or employing the teacher-student framework [21], [26]. Nonetheless, it is challenging to adopt these approaches in UDA of satellite pose estimation, which is different to the common UDA tasks in computer vision.

SECTION III.

Method

In this section, we first introduce the UDA task of satellite pose estimation in Section III-A and the PnP-based solution to monocular satellite pose estimation in Section III-B. Then, we formulate the task as a minimization problem in a basic self-training framework in Section III-C, which is extended by leveraging fine-grained segmentation in Section III-D. We present the solution to the minimization problem in Section III-E. Finally, we give a mathematical proof of the geometrical constraints in Section III-F. The overview of our method is shown in Fig. 2.

$Fig. 2. - Overview of our self-training framework. The satellite is represented as a set of sparse keypoints and dense fine-grained masks. The source image $\mathbf {I}_{s}$ is first transformed into a target-like source image $\mathbf {I}^{\prime }_{s}$ using a well-trained CycleGAN [48]. The neural network consists of a backbone $\mathcal {G}_{f}$, a heatmap head $\mathcal {G}_{h}$, and a mask head $\mathcal {G}_{m}$. The optimization comprises two iterative steps. In the first step, pseudomask $\tilde{\mathbf {Y}}_{t}$ and pseudoheatmap $\tilde{\mathbf {H}}_{t}$ are fixed, and the network $\mathcal {G}_{f,h,g}$ are trained using multitask losses. In the second step, the parameters of the neural network are fixed, and the predicted mask $\hat{\mathbf {Y}}_{t}$ and heatmap $\hat{\mathbf {H}}_{t}$ are used to generate pseudolabels. (Best viewed in color.).$

Fig. 2.

Overview of our self-training framework. The satellite is represented as a set of sparse keypoints and dense fine-grained masks. The source image $\mathbf {I}_{s}$ is first transformed into a target-like source image $\mathbf {I}^{\prime }_{s}$ using a well-trained CycleGAN [48]. The neural network consists of a backbone $\mathcal {G}_{f}$, a heatmap head $\mathcal {G}_{h}$, and a mask head $\mathcal {G}_{m}$. The optimization comprises two iterative steps. In the first step, pseudomask $\tilde{\mathbf {Y}}_{t}$ and pseudoheatmap $\tilde{\mathbf {H}}_{t}$ are fixed, and the network $\mathcal {G}_{f,h,g}$ are trained using multitask losses. In the second step, the parameters of the neural network are fixed, and the predicted mask $\hat{\mathbf {Y}}_{t}$ and heatmap $\hat{\mathbf {H}}_{t}$ are used to generate pseudolabels. (Best viewed in color.).

Show All

A. Problem Formulation

In UDA of satellite pose estimation, the satellite structure and the camera parameters are the same in the source and target domains. The intrinsic matrix of the camera is denoted by $\mathbf {K}$. Then, we are given $N_{s}$ source images $\mathcal {X}_{s}=\lbrace \mathbf {I}_{s}^{i}\rbrace _{i=1}^{N_{s}}$ with 6-DoF pose annotations $\mathcal {T}_{s}=\lbrace [\mathbf {R}_{s}^{i}|\bm {t}_{s}^{i}] \rbrace _{i=1}^{N_{s}}$ and $N_{t}$ unlabeled target images $\mathcal {X}_{t}=\lbrace \mathbf {I}_{t}^{j} \rbrace _{j=1}^{N_{t}}$.

We assume that the poses in the source and target domains are sampled from the same distribution \begin{equation*} P(\mathbf {R}_{s}, \bm {t}_{s}) = P(\mathbf {R}_{t}, \bm {t}_{t}). \tag{1} \end{equation*} View SourceHowever, given a pose consisting of $\mathbf {R}$ and $\bm {t}$, the source and target images are sampled from different distributions \begin{equation*} P(\mathbf {I}_{s}|\mathbf {R}_{s}, \bm {t}_{s}) \ne P(\mathbf {I}_{t}|\mathbf {R}_{s}, \bm {t}_{s}). \tag{2} \end{equation*} View SourceIt reveals that the source and target samples are sampled from different joint distributions \begin{align*} &P(\mathbf {I}_{s}, \mathbf {R}_{s}, \bm {t}_{s}) = P(\mathbf {I}_{s}|\mathbf {R}_{s}, \bm {t}_{s}) P(\mathbf {R}_{s}, \bm {t}_{s}) \\ &\ne P(\mathbf {I}_{t}, \mathbf {R}_{t}, \bm {t}_{t}) = P(\mathbf {I}_{s}|\mathbf {R}_{t}, \bm {t}_{t}) P(\mathbf {R}_{t}, \bm {t}_{t}). \tag{3} \end{align*} View SourceEquation 3 shows that the independent and identically distributed (i.i.d.) assumption is violated, which is referred to as the domain shifts in satellite pose estimation. In this article, we aim to train a neural network $\mathcal {G}(\mathbf {I})\mapsto [\mathbf {R}|\bm {t}]$, that reduces the distribution shifts across domains.

B. PnP-Based Solution

As suggested by Kisantal et al. [11], the PnP-based methods significantly outperform direct regression in monocular satellite pose estimation. Therefore, we follow the PnP-based [12], [15] method to estimate the satellite poses.

We assume the texture-less 3-D mesh $\mathcal {M}$ of the satellite is given. Hence, we select $N_{p}$ 3-D landmarks $\mathcal {P} = \lbrace \bm {P}^{k}\rbrace _{k=1}^{N_{p}}$ on the mesh surface. Given the camera parameters and a pose, the 3-D landmarks are reprojected onto 2-D keypoints $\lbrace \bm {p}^{k}\rbrace _{k=1}^{N_{p}}$, which are represented as a heatmap $\mathbf {H}$. We construct a neural network consisting of a backbone $\mathcal {G}_{f}$ and a heatmap head $\mathcal {G}_{h}$ and train the network using source samples in a fully supervised way \begin{equation*} \min _{\mathcal {G}_{f,h}} \frac{1}{N_{s}}\sum _{i=1}^{N_{s}} \mathcal {L}_{h}(\hat{\mathbf {H}}_{s}^{i}, \mathbf {H}_{s}^{i}) \tag{4} \end{equation*} View Sourcewhere $\hat{\mathbf {H}}_{s}^{i} = \mathcal {G}_{f}(\mathcal {G}_{h}(\mathbf {I}_{s}))$ is the predicted heatmap, $\mathbf {H}_{s}^{i}$ is the ground truth. For heatmap regression, $\mathcal {L}_{h}$ is adopted as the adaptive wing loss [47] with the default parameter setting. During the inference stage, the predicted heatmap $\hat{\mathbf {H}}$ is decoded into 2-D keypoints, which are used to build putative 2-D-3-D correspondences $\lbrace (\hat{\bm {p}}^{k}, \bm {P}^{k})\rbrace _{k=1}^{N_{p}}$. Finally, the pose is estimated by solving the PnP problem [15] \begin{equation*} \hat{\mathbf {R}}, \hat{\bm {t}} = \min _{\mathbf {R},\bm {t}} \sum _{k=1}^{N_{p}} \phi (\Vert \lambda _{k} \hat{\bm {p}}^{k}- \mathbf {K}(\mathbf {R}\bm {P}^{k} + \bm {t})\Vert _{2}) \tag{5} \end{equation*} View Sourcewhere $\lambda _{k}$ is the depth of landmark $\bm {P}^{k}$ and $\phi (\cdot)$ is the Huber loss.

Due to the domain shifts, the neural network that trained on the source domain usually suffers from performance degradation when being applied on target images. To tackle this issue, we train the network using labeled source samples and unlabeled target samples as described in Section III-C.

C. Basic Self-Training Framework

To fully exploit unlabeled target samples, we propose a self-training framework by leveraging the geometrical constraints. Specifically, we define the function that projects 3-D landmarks onto the 2-D heatmap as \begin{equation*} \mathbf {H}=\mathcal {F}(\mathbf {R}, \bm {t}, \mathcal {P}, \mathbf {K}) \tag{6} \end{equation*} View Sourcewhich provides the same geometrical constraints in the source and target domains, since the source and target samples share the same $\mathcal {P}$ and $\mathbf {K}$.

For source samples, we supervise the network using (4). The ground-truth heatmap $\mathbf {H}_{s}$ is obtained by projecting landmarks $\mathcal {P}$ using the function $\mathcal {F}$ with the pose annotation $[\mathbf {R}_{s}| \bm {t}_{s}]$. For unlabeled target samples, we take each target pose as a latent variable, which is sent to the function $\mathcal {F}$ to obtain the pseudoheatmap $\tilde{\mathbf {H}}_{t}$. Therefore, we can unify (4) and (5) in an objective function to exploit unlabeled target images. Specifically, the task is formulated as a minimization perform by simultaneously optimizing the network parameters and the target poses \begin{align*} &\min _{\mathcal {G}_{f,h}, \mathcal {T}_{t}} \frac{1}{N_{s}}\sum _{i=1}^{N_{s}} \mathcal {L}_{h}(\hat{\mathbf {H}}_{s}^{i}, \mathbf {H}_{s}^{i}) + \frac{1}{N_{t}}\sum _{j=1}^{N_{t}} \mathcal {L}_{h}(\hat{\mathbf {H}}_{t}^{j}, \tilde{\mathbf {H}}_{t}^{j}) \\ &\begin{array}{lll}\rm {s.t.} \quad &\tilde{\mathbf {H}}_{t}^{j} = \mathcal {F}(\mathbf {R}_{t}^{j}, \bm {t}_{t}^{j}, \mathcal {P}, \mathbf {K}) \\ &\mathcal {T}_{t}^{i} = [\mathbf {R}_{t}^{j}|\bm {t}_{t}^{j}] \in \mathbb {SE}_{3} \cup \lbrace \mathbf {0}\rbrace, \; j = 1, 2,\ldots, N_{t} \end{array}\tag{7} \end{align*} View Sourcewhere the zero set $\lbrace \mathbf {0}\rbrace$ is introduced to tackle the issues of less confident predictions on target samples, which are ignored during the network training by assigning their poses to $\mathbf {0}$. The criterion to select confident predictions is presented in Section III-E.

Equation 7 can be solved iteratively. During optimization, the network $\mathcal {G}_{f,h}$ is encouraged to predict accurate heatmaps on source and target images. Meanwhile, the target poses should converge to the ground-truth. However, the heatmap-based representation of sparse keypoints introduces severe information loss. Consequently, the optimization process is prone to converging at local optima, leading to deteriorated performance.

D. Extended Framework With Multitask Learning

To tackle the above problem, we note that the fine-grained masks in Fig. 1 provide rich dense descriptions with domain-agnostic structural context. Therefore, we apply segmentation as an auxiliary task of heatmap regression to improve pose estimation performance. In the following paragraphs, we first perform output-level alignment [19] using adversarial training, and then extend the minimization problem in (7) with the auxiliary task.

As shown in Fig. 2, we extend the basic network with a mask head $\mathcal {G}_{m}$ to predict the fine-grained mask $\hat{\mathbf {Y}} = \mathcal {G}_{m}(\mathcal {G}_{f}(\mathbf {I}))$ for each image. These masks contain domain-agnostic information, which can be used to align the source and target samples by adversarial training. Hence, we build a discriminator $\mathcal {G}_{d}$ to predict the domain label of each sample. The discriminator $\mathcal {G}_{d}$ receives the detached mask predictions from both domains and is trained using the BCE loss \begin{align*} \mathcal {L}_{d}(\hat{\mathbf {Y}}_{t}, \hat{\mathbf {Y}}_{s}) =& -\frac{1}{HW}\sum _{i=1}^{H}\sum _{j=1}^{W} \log (\mathcal {G}_{d}(\hat{\mathbf {Y}}_{s})(i, j, 1)) \\ &+\log (\mathcal {G}_{d}(\hat{\mathbf {Y}}_{t})(i, j, 0)) \tag{8} \end{align*} View Sourcewhere $\hat{\mathbf {Y}}_{s}$ and $\hat{\mathbf {Y}}_{t}$ are the predicted masks of the source and target samples, respectively. To bridge the domain gap, we expect the mask predicted by the network $\mathcal {G}_{f,m}$ to fool the discriminator. Therefore, we perform an adversarial loss on the predicted masks of the target images and backpropagate the gradient \begin{equation*} \mathcal {L}_{adv}(\hat{\mathbf {Y}}_{t}) = -\frac{1}{HW}\sum _{i=1}^{H}\sum _{j=1}^{W}\log (\mathcal {G}_{d}(\hat{\mathbf {Y}}_{t})(i, j, 1)). \tag{9} \end{equation*} View Source

Furthermore, we introduce a rendering function $\mathcal {R}$ to enhance the geometrical constraints. Given the pose $[\mathbf {R}|\bm {t}]$, the mask $\mathbf {Y}$ can be obtained through the rendering function \begin{equation*} \mathbf {Y} = \mathcal {R}(\mathbf {R}, \bm {t}, \mathcal {M}, \mathbf {K}). \tag{10} \end{equation*} View SourceSimilar to the projection function $\mathcal {F}$, the rendering function $\mathcal {R}$ also provides the same geometrical constraints on source and target samples. This is because the 3-D mesh of the satellite and the camera parameters are the same in different domains. For source samples, the network $\mathcal {G}_{f,m}$ is supervised using the ground-truth masks. For unlabeled target images, the pseudomasks are obtained by sending the latent variables in (7), i.e., the target poses, to the rendering function $\mathcal {R}$. Therefore, the minimization problem depicted in (7) is extended with the tasks of segmentation and adversarial training \begin{align*} &\min _{\mathcal {G}_{f,h,m}, \mathcal {T}_{t}} \max _{\mathcal {G}_{d}} \frac{1}{N_{s}}\sum _{i=1}^{N_{s}} \left[ \mathcal {L}_{h}(\hat{\mathbf {H}}_{s}^{i}, \mathbf {H}_{s}^{i}) + \lambda _{m} \mathcal {L}_{m}(\hat{\mathbf {Y}}_{s}^{i}, \mathbf {Y}_{s}^{i})\right] \\ &+ \frac{1}{N_{t}}\sum _{j=1}^{N_{t}} \left[ \mathcal {L}_{h}(\hat{\mathbf {H}}_{t}^{j}, \tilde{\mathbf {H}}_{t}^{j}) + \lambda _{m} \mathcal {L}_{m}(\hat{\mathbf {Y}}_{t}^{j}, \tilde{\mathbf {Y}}_{t}^{j}) + \lambda _{a} \mathcal {L}_{adv}({\hat{\mathbf {Y}}_{t}^{j}}) \right] \\ \\ &\rm {s.t.}\quad \tilde{\mathbf {H}}_{t}^{j} = \mathcal {F}(\mathbf {R}_{t}^{j}, \bm {t}_{t}^{j}, \mathcal {P}, \mathbf {K}),\; \tilde{\mathbf {Y}}_{t}^{j} = \mathcal {R}(\mathbf {R}_{t}^{j}, \bm {t}_{t}^{j}, \mathcal {M}, \mathbf {K}) \\ &\qquad\;\, \mathcal {T}_{t}^{i} = [\mathbf {R}_{t}^{j}|\bm {t}_{t}^{j}] \in \mathbb {SE}_{3} \cup \lbrace \mathbf {0}\rbrace, \; j = 1, 2,\ldots, N_{t} \tag{11} \end{align*} View Sourcewhere $\lambda _{m}$ and $\lambda _{a}$ are the weights for the mask loss and the adversarial loss, respectively. $\mathcal {L}_{m}$ is the cross-entropy loss for segmentation. We refer to $\tilde{\mathbf {Y}}$ as the pseudomask. Again, by optimizing the loss in (11), the network $\mathcal {G}_{f,h,m}$ should perform well on the target domain, while the pseudolabels, including pseudoheatmaps, masks, and poses, should approximate the ground truth.

E. Iterative Optimization and Pseudolabel Generation

We observe that the variables in (11) can be divided into two classes: the network parameters and the poses of target samples. Following [23], we adopt iterative procedures to optimize (11).

Fix the pseudolabels (or initialize $\mathcal {T}_{t}$ as $\lbrace \mathbf {0}\rbrace$) and train the network $\mathcal {G}_{f,h,m,d}$.
Fix the network, optimize poses $\mathcal {T}_{t}$, and generate pseudolabels.

We take the combination of these two steps as one round and take several rounds to optimize (11).

In the first step, when the pseudoheatmaps and pseudomasks are fixed, the minimization problem in (7) is simplified as \begin{align*} &\min _{\mathcal {G}_{f,h,m}} \max _{\mathcal {G}_{d}} \frac{1}{N_{s}}\sum _{i=1}^{N_{s}} \left[ \mathcal {L}_{h}(\hat{\mathbf {H}}_{s}^{i}, \mathbf {H}_{s}^{i}) + \lambda _{m} \mathcal {L}_{m}(\hat{\mathbf {Y}}_{s}^{i}, \mathbf {Y}_{s}^{i})\right] \\ &+ \frac{1}{N_{t}}\sum _{j=1}^{N_{t}} \left[ \mathcal {L}_{h}(\hat{\mathbf {H}}_{t}^{j}, \tilde{\mathbf {H}}_{t}^{j}) + \lambda _{m} \mathcal {L}_{m}(\hat{\mathbf {Y}}_{t}^{j}, \tilde{\mathbf {Y}}_{t}^{j}) + \lambda _{a} \mathcal {L}_{adv}({\hat{\mathbf {Y}}_{t}^{j}}) \right]. \tag{12} \end{align*} View SourceIt is equal to train the neural network using the labeled source samples and the pseudolabeled target samples. In the second step, when the network parameters are fixed, the minimization problem in (11) is simplified as \begin{align*} &\min _{\mathcal {T}_{t}} \frac{1}{N_{t}}\sum _{j=1}^{N_{t}} \left[ \mathcal {L}_{h}(\hat{\mathbf {H}}_{t}^{j}, \tilde{\mathbf {H}}_{t}^{j}) + \lambda _{m} \mathcal {L}_{m}(\hat{\mathbf {Y}}_{t}^{j}, \tilde{\mathbf {Y}}_{t}^{j}) \right] \\ &\begin{array}{rl}\rm {s.t.}\quad &\tilde{\mathbf {H}}_{t}^{j} = \mathcal {F}(\mathbf {R}_{t}^{j}, \bm {t}_{t}^{j}, \mathcal {P}, \mathbf {K}),\; \tilde{\mathbf {Y}}_{t}^{j} = \mathcal {R}(\mathbf {R}_{t}^{j}, \bm {t}_{t}^{j}, \mathcal {M}, \mathbf {K}) \\ &\mathcal {T}_{t}^{i} = [\mathbf {R}_{t}^{j}|\bm {t}_{t}^{j}] \in \mathbb {SE}_{3} \cup \lbrace \mathbf {0}\rbrace, \; j = 1, 2,\ldots, N_{t}. \\ \end{array}\tag{13} \end{align*} View SourceIt is equal to optimize the target poses to minimize the loss, by simultaneously aligning the heatmaps and masks generated using functions $\mathcal {F}$ and $\mathcal {R}$ with the ones predicted by the network. However, for the classification loss $\mathcal {L}_{m}$, there is a trivial solution by ignoring all pseudoheatmaps, i.e., $\mathcal {T}_{s}=\lbrace \mathbf {0}\rbrace$. Therefore, we leverage the predicted heatmaps to estimate poses and then generate pseudolabels.

Fig. 3 shows the pipeline to generate pseudoheatmap and mask using pose estimation based on the geometrical constraints. Specifically, we first fix the network and send the target image into the network to obtain the predicted heatmap $\hat{\mathbf {H}}_{t}$. Then, $\hat{\mathbf {H}}_{t}$ is decoded into the 2-D coordinates of keypoints, which are then used to generate putative correspondences $\lbrace (\hat{\bm {p}}_{t}^{k}, \bm {P}^{k})\rbrace _{k=1}^{N_{p}}$. Next, the estimated pose $[\hat{\mathbf {R}}|\hat{\bm {t}}]$ and the number of inliers $N_{in}$ can be obtained by solving (5). We kindly refer to the readers to our previous work [15] for the detail solution to (5). Furthermore, we take $N_{in}$ as the confidence of the estimated pose and select the estimated pose as a pseudopose if $N_{in} \geq N_{th}$. Otherwise, the pseudopose is set to $\mathbf{0}$ and the corresponding sample is excluded during network retraining. A smaller $N_{th}$ encourages more estimated poses to be used to generate pseudolabels for model training. In addition, for target samples with $N_{in} < N_{th}$, we generate pseudomasks using the approach proposed by [24].

Fig. 3.

Pipeline to generate pseudoheatmap and mask based on the geometrical constraints.

Show All

F. Discussion

Our self-training framework is built upon the geometrical constraints, which are provided by the projection function $\mathcal {F}$ and the rendering function $\mathcal {R}$. We try to provide a mathematical proof of the geometrical constraints in this section from the perspective of probability distribution. Since the functions $\mathcal {F}$ and $\mathcal {R}$ have the same formation, we only present the proof of the function $\mathcal {F}$ in the remainder of this section.

Equation (6) has four input parameters, including satellite pose $[\mathbf {R}|\bm {t}]$, 3-D landmarks $\mathcal {P}$, and camera intrinsic matrix $\mathbf {K}$. Since $\mathcal {P}$ and $\mathbf {K}$ are constant, if the poses of the satellite in the source and target domain are equal, the heatmaps in both domains are the same \begin{align*} &{}[\mathbf {R}_{s}|\bm {t}_{s}]=[\mathbf {R}_{t}|\bm {t}_{t}] \Rightarrow \\ & \mathcal {F}(\mathbf {R}_{s}, \bm {t}_{s}, \mathcal {P}, \mathbf {K}) = \mathcal {F}(\mathbf {R}_{t}, \bm {t}_{t}, \mathcal {P}, \mathbf {K}) \Rightarrow \\ & \mathbf {H}_{s} = \mathbf {H}_{t}. \tag{14} \end{align*} View SourceTherefore, when conditioned on the satellite pose, the heatmap distributions in source and target domains are the same \begin{equation*} P(\mathbf {H}_{s}|\mathbf {R}_{s}, \bm {t}_{s}) = P(\mathbf {H}_{t}|\mathbf {R}_{t}, \bm {t}_{t}). \tag{15} \end{equation*} View SourceOn the other hand, the heatmap distributions can be marginalized on the satellite poses as following: \begin{align*} P(\mathbf {H}_{s}) &= \int \int P(\mathbf {H}_{s}|\mathbf {R}_{s}, \bm {t}_{s}) P(\mathbf {R}_{s}=\mathbf {R}, \bm {t}_{s}=\bm {t}) \;\mathrm{d}\mathbf {R}\mathrm{d}\bm {t}\\ P(\mathbf {H}_{t}) &= \int \int P(\mathbf {H}_{t}|\mathbf {R}_{t}, \bm {t}_{t}) P(\mathbf {R}_{t}=\mathbf {R}, \bm {t}_{t}=\bm {t}) \;\mathrm{d}\mathbf {R}\mathrm{d}\bm {t}.\\ \tag{16} \end{align*} View SourceBy combining (1), (15), and (16), we can get the conclusion that the heatmaps in the source and target domain have the same distribution, i.e., \begin{equation*} P(\mathbf {H}_{s}) = P(\mathbf {H}_{t}). \tag{17} \end{equation*} View SourceTherefore, the geometrical constraint provided by the projection function $\mathcal {F}$ is domain-agnostic.

What'more, the similar conclusion also holds for the rendering function $\mathcal {R}$, i.e., $P(\mathbf {Y}_{s}) = P(\mathbf {Y}_{t})$. The fine-grained mask provides dense representation, while the keypoints heatmap is sparse and leads to significant information loss. Hence, we perform output-level adaptation to match mask distributions between source and target domain in Section III-D.

SECTION IV.

Experimental Results

We present experimental details and results in this section. We first introduce the dataset and the metrics used in the experiments in Section IV-A. Then, Section IV-B presents the details of implementation. Next, the key components of the proposed approach are studied in Section IV-D. We compare our approach with the state-of-the-art methods in Section IV-E. Finally, Section IV-F presents the runtime analysis.

A. Dataset and Metrics

Dataset: We conduct experiments on the SPEED+ [22] dataset to demonstrate the effectiveness of our method. The SPEED+ [22] dataset comprises images of the Tango spacecraft from the PRISMA [49] mission, consisting of three distinct domains, i.e., synthetic, lightbox, and sunlamp. Each image has a resolution of $1920 \times 1200$ and contains a single object. The synthetic domain comprises 59 960 images labeled with poses, which are generated using an OpenGL-based stimulator. The lightbox and sunlamp domains contain 6740 and 2791 images of a model of the same spacecraft captured in a robotic simulation environment. The satellite in the lightbox domain is illuminated by several lightboxes to approximate the diffuse light of Earth, while the same object in the sunlamp domain is exposed to an arc lamp to simulate the direct sunlight. Since the annotations for lightbox and sunlamp domains are not released, there are two UDA tasks with respect to SPEED+, including synthetic$\to$lightbox and synthetic$\to$sunlamp.

Metrics: We adopt the metrics used in SPEC2021. The rotation error is defined as the angle between the predicted quaternion $\hat{\bm {q}}$ and the ground truth $\bm {q}$, i.e., $E_{q} = 2\arccos (\hat{\bm {q}}^{T} \bm {q})$. The translation error is defined as the difference between the predicted value $\hat{\bm {t}}$ and the ground truth $\bm {t}$, i.e., $E_{t} = \Vert \hat{\bm {t}} - \bm {t} \Vert _{2}$. Given an image, the scores for rotation and translation are defined as ${S}_{q}=E_{q}$ and ${S}_{t}=E_{t}/\Vert \bm {t}\Vert _{2}$, respectively. The overall score is given as \begin{equation*} S = {\begin{cases}0, & \text{if}\quad {S}_{q} < \theta _{q} \; \text{and} \; {S}_{t} < \theta _{t} \\ {S}_{q} + {S}_{t}, & \text{otherwise} \end{cases}} \tag{18} \end{equation*} View Sourcewhere $\theta _{q}=0.169^\circ$ and $\theta _{t}=2.173 \times 10^{-3}$ are the thresholds, which are determined by the calibration results of the facility used to create the dataset [22].

B. Implementation Details

Mesh reconstruction and data preparation: Since the 3-D mesh of the satellite is not provided in SPEED+, we reconstruct the 3-D mesh $\mathcal {M}$ and 3-D landmarks $\mathcal {P}$ using source samples. However, due to illumination variations and material discrepancies, the texture of satellite is not domain-agnostic, while the geometrical model are the same for the source and target samples. Therefore, we reconstruct the texture-less 3-D mesh of the satellite to leverage the geometrical constraints.

The pipeline for 3-D mesh reconstruction and annotation is shown in Fig. 4. Note that, the source images can have two types of background: the earth background and the black background. Since the earth background usually introduces noise during reconstruction, we first train a classifier to select images with a black background. Next, to tackle the scale issue, we select 1000 images using the criterion of 4.5 m $\leq \Vert \bm {t}\Vert _{2} \leq$ 5 m, where $\bm {t}$ is the translation of the satellite to the camera. We first reconstruct the coarse mesh using the multiview stereo (MVS) approach provided by colmap [50], [51]. Next, we use the neural implicit surfaces (NeuS) [52] approach to refine the mesh. The refined mesh is then annotated with five categories, including antenna 1–3, solar panel, and body. Meanwhile, we select 11 landmarks on the surface of the mesh, following previous works [12], [15]. Finally, given the ground-truth poses for source images and the pseudoposes for target images, we use Blender to render the mesh for fine-grained masks, and use the pinhole camera model [53] to obtain 2-D keypoints.

Fig. 4.

Pipeline of 3-D mesh reconstruction and annotation.

Show All

Architecture details: Our network comprises four modules: a backbone, a mask head, a heatmap head, and a discriminator. We construct the backbone using a transformer-based HRNet network [55], i.e., HRFormer-S with 7.8 M parameters. The backbone extracts feature maps at $\frac{1}{4}$ resolution of the input images. The output channel number is 32. The heatmap head and the mask head are constructed using an atrous-spatial-pyramid-pooling (ASPP) module [54], respectively. As shown in Fig. 5, the module consists of five parallel branches: a global average pooling layer, a 1 × 1 convolution layer, and three 3 × 3 atrous convolution layers with rates of (6, 12, 18). Then, all feature maps are concatenated into one feature, whose channel number is adjusted using a 1 × 1 convolution layer. The discriminator has $4\times 4$ convolutional layers with channel numbers (16, 32, 64, 128, 1). The first and the second layers have a stride of 1 while others have a stride of 2. Each convolutional layer except the last one is followed by a leaky ReLU parameterized by 0.2.

Fig. 5.

Prediction head based on ASPP [54].

Show All

Experimental details: We implement the network using the PyTorch library and train our model using the AdamW [56] optimizer. All images are resized to the resolution of $640 \times 400$. The source images are first translated into target-like images using a CycleGAN [48] to reduce the bias towards the source domain. During training, we apply different data augmentation strategies on source and target samples using Albumentations [57] with the default parameter setting. The data augmentation on target samples composes of random horizontal and vertical flipping, random translating, scaling, and rotating. On source samples, the additional augmentation includes random gaussian noise and random blur. We adopt a multilevel learning strategy to enhance domain adaptation by applying the prediction heads after stages 3 and 4 of the backbone. The balance parameters of the losses in stages 3 and 4 are set to 0.1 and 1, respectively. During pseudolabel generation, threshold $N_{th}$ is empirically set to 8.

C. Source-Only Validation

We first carry out synthetic-only experiments to analyse the influence of the domain gap between source and target domains. To this end, we only use the training split of synthetic domain to train our network, and then test the model on the validation split of synthetic domain, lightbox and sunlamp domains. Note that the GycleGAN [48] is not applied as well.

Table I lists the experimental results of ours and 3 methods reported by Park et al. [22]. Specifically, SPN [14] and KRN [13] require a separate detector to crop the satellite before keypoints regression. In the experiment, the detection stage is skip and the ground-truth bounding boxes are used to crop the target. KRN [13] regresses a $1\times \text{2}N$ vector to encode the locations of 2-D keypoints and then constructs putative 2-D-3-D correspondences, which are used to estimate the pose using RANSAC [58] and PnP [59]. HigherHRNet [14]+EPnP [14] directly regresses the keypoints from the images and then estimates the poses using EPnP [14]. Our method and HigherHRNet [14]+EPnP [14] obtain the best performance on the synthetic images, illustrating that heatmap-based keypoints regression is capable of fitting the training data. However, all these methods generalize less to the real images. It demonstrates that the domain gap can lead to significant performance degradation.

TABLE I Synthetic-Only Performances of the Baseline Models, Tested on Synthetic Validation, Lightbox, and Sunlamp

D. Ablation Study

We conduct a series of ablation experiments to investigate the critical components of our approach, including self-training, adversarial training, mask prediction, and pseudolabel generation with the geometrical constraints. Due to the unavailability of pose labels of target samples, we manually annotate 100/50 images from the lightbox/sunlamp domain by selecting semantic keypoints and solving the PnP problem. We adopt the manually annotated samples as the validation set. For each setting, we take three rounds to optimize (11) and define the first round as the pretraining stage. During pretraining, we set the initial learning rate to 0.001 and train the network for 12 epochs with a batch size of 5. For the second and the third rounds, we decrease the learning rate to 0.0005 and optimize the network for 10 epochs, which contains 10 000 steps. The results are reported in Table II. The neural network trained after round $j$ under the $i$th setting is denoted by $\mathcal {G}_{i}^{j}$. Especially, model $\mathcal {G}^{0}_{1}$ is the trivial baseline, as it is trained on target-like images transformed using a CycleGAN and no other UDA method is used.

TABLE II Ablation Study of the Key Components on the SPEED+ Dataset

Self-training: In each setting, models $\mathcal {G}^{1}$ and $\mathcal {G}^{2}$ achieve smaller pose estimation errors than model $\mathcal {G}^{0}$. This demonstrates that the self-training framework can promote model performance for the UDA task in satellite pose estimation.

Adversarial training: We construct a baseline $\mathcal {G}_{1}$ within the basic framework by optimizing (7). Then, we extend the framework with adversarial training, by aligning the features extracted by the backbone. The model trained in the extended framework is denoted by $\mathcal {G}_{2}$. For each round, model $\mathcal {G}_{2}^{i}$ significantly outperforms the baseline $\mathcal {G}_{1}^{i}$, $i=0,1,2$. After the pretraining stage, model $\mathcal {G}_{2}^{0}$ reduce the translation error from 1.0069 m/0.9986 m to 0.2793 m/0.5211 m for the lightbox/sunlamp domain. When round = 2, model $\mathcal {G}_{2}^{2}$ reduces the scores by half for both domains. These results demonstrate the effectiveness of adversarial training.

Mask prediction: We further extend the baseline with multitask learning by appending a mask head after the backbone. The mask head predicts a binary mask (in model $\mathcal {G}_{3}$) or a fine-grained mask (in model $\mathcal {G}_{5}$) of the satellite. Different from model $\mathcal {G}_{2}$, we apply adversarial training by aligning the predictions of the mask head. Model $\mathcal {G}_{3}$ achieves more accurate pose estimation results than model $\mathcal {G}_{2}$ in all rounds except the last one in terms of total score $S$. Another experimental evidence is provided by the comparison between model $\mathcal {G}_{3}$ and model $\mathcal {G}_{5}$. When round = 0, model $\mathcal {G}_{5}^{0}$ significantly reduces the estimation error by nearly 30% and 50% in terms of score $S$ for both domains. The key reason is that the mask head in model $\mathcal {G}_{5}$ provides more fine-grained predictions and thus effectively enhances structural constraints and contextual information. Moreover, using the proposed self-training framework, model $\mathcal {G}_{5}^{2}$ achieves the best pose estimation results in all metrics and on all tasks. Therefore, leveraging fine-grained segmentation as an auxiliary task has positive impacts on 2-D keypoints regression, resulting in better performance of satellite pose estimate across domains.

To better analyze the function of the fine-grained segmentation, we compare the mean square error (MSE) between the ground-truth and the predicted keypoints on the validation set after each epoch. Specifically, we report the MSEs of the pretrained models, including $\mathcal {G}_{1}^{0}$, $\mathcal {G}_{3}^{0}$, and $\mathcal {G}_{5}^{0}$, and the results are shown in Fig. 6. Note that, model $\mathcal {G}_{1}^{0}$ always achieves the largest MSEs on lightbox and sunlamp domains, while $\mathcal {G}_{3}^{0}$ benefits from the binary segmentation task. Moreover, $\mathcal {G}_{5}^{0}$ shows the highest accuracy by predicting the fine-grained masks. It illustrates that the fine-grained segmentation can effectively improve the domain adaptation performance, and thus, prevent the optimization of (11) converging at local optima.

Fig. 6.

Mean square errors (MSE) between the predicted and ground-truth keypoints in the pretraining stage.

Show All

Geometrical constraints: We also study the role of the geometrical constraints, which are used during pseudolabel generation. We directly generate pseudoheatmaps and pseudomasks according to model predictions. The model trained using this setting is denoted by $\mathcal {G}_{4}$. In terms of three metrics on both domains, models $\mathcal {G}_{4}^{1}$ and $\mathcal {G}_{4}^{2}$ show degraded performance compared to $\mathcal {G}_{5}^{1}$ and $\mathcal {G}_{5}^{2}$, respectively. The performance degradation can be ascribed to the annotation noise in pseudolabels. With the geometrical constraints, models $\mathcal {G}_{5}^{1}$ and $\mathcal {G}_{5}^{2}$ are trained using more clean and more accurate pseudolabels, and thus, show superior performance. The visual comparison between pseudolabels generated with and without geometrical constraints is illustrated in Fig. 7.

Fig. 7.

Visualization of pseudolabels of target samples. The first column shows the pseudokeypoints generated without and with the geometrical constraints in green crosses and blue points, respectively. The red vectors illustrate the differences between the two types of pseudokeypoints. The second/last column shows pseudomasks generated without/with the geometrical constraints.

Show All

Multitask learning: We adopt adversarial training and mask segmentation to promote keypoint heatmap regression. In Fig. 8, we study the effectiveness of multitask learning strategies, by comparing the heatmap losses of model $\mathcal {G}_{0}^{0}$, $\mathcal {G}_{2}^{0}$, and $\mathcal {G}_{5}^{0}$ during training. On each domain, model $\mathcal {G}_{0}^{0}$ overfits more to source samples and generalizes less to target images, because it is only trained using source data. Model $\mathcal {G}_{2}^{0}$ performs feature level adaptation using adversarial training. Although model $\mathcal {G}_{2}^{0}$ has larger heatmap losses, it achieves better pose estimation accuracy than model $\mathcal {G}_{0}^{0}$. In contrast, $\mathcal {G}_{5}^{0}$ performs output-level adaptation by introducing fine-grained segmentation, resulting in the smallest losses and the best pose estimation accuracy on each domain. The reason is that the feature level adaptation is performed in the high-dimensional space, leading to the alignment of easier patterns [19]. Consequently, the feature distributions cannot be effectively matched. This demonstrates the effectiveness of our multitask learning strategies.

$Fig. 8. - Comparisons of the training curves of model $\mathcal {G}_{0}^{0}$, $\mathcal {G}_{2}^{0}$, and $\mathcal {G}_{5}^{0}$.$

Fig. 8.

Comparisons of the training curves of model $\mathcal {G}_{0}^{0}$, $\mathcal {G}_{2}^{0}$, and $\mathcal {G}_{5}^{0}$.

Show All

E. Comparison With the State-of-The-Art Methods

We take KRN [13] and SPNv2 [60] as the baseline methods. SPNv2 [60] is based on EfficientDet [61] and comprises three prediction heads: the EfficientPose head [62] for object presence, bounding box, target rotation and translation; the heatmap head for the 2-D heatmaps; the segmentation head for the binary mask of the satellite. Other technologies employed by SPNv2 include multiscale design [61], [62], extensive data augmentation [63], style augmentation [64], AdaBN [65], and entropy minimization. Different from these methods, we utilize the geometrical constraints to develop a self-training framework and explore the fine-grained segmentation to boost performance.

To achieve better performance, we add an upsampling layer followed by a $3\times 3$ convolution after the backbone and thus increase the feature resolution by a scaling factor 2. We take multiple rounds to optimize (11) and then compare our method with KRN [13], SPNv2 [60], and top-performing methods in SPEC2021. The results are listed in Table III. On the sunlamp domain, our approach outperforms all other methods in terms of translation and rotation scores, taking the first place in the challenge. More importantly, our method surpasses KRN [13] trained using real annotations by more than half in terms of score $S$. On the lightbox domain, our approach shows competitive performance and has won the third place in the challenge. Fig. 9 visualizes results estimated by the models trained with and without the proposed framework. Note that, our model can handle intense imaging noise, complicated illuminations, surface reflection, and pose variations.

TABLE III Comparison With the State-of-The-Art Methods and the Top-Performing Methods in SPEC2021

Fig. 9.

Visualization of pose estimation on target samples. The results achieved with and without our framework are shown in green and red colors, respectively.

Show All

F. Runtime Performance

We follow the experimental settings in ablation study and compare the runtime performance of the models using different backbones and feature map resolutions. The experiments are conducted on a PC with an Nvidia GTX 3090 GPU. Table IV reports the results. We first replace the transformer-based backbone with a CNN-based backbone, i.e., HRNet [66]. Although the running time is shorter at the expense of larger parameter sizes and memory consumption, the pose estimation performance drops from 0.221/0.099 to 0.253/0.163 on lightbox/sunlamp in terms of score $S$. Furthermore, we add an upsampling layer followed by a $3\times 3$ convolution after the backbone, to increase the feature resolution by a scaling factor 2. We observe significant improvements of pose estimation at the expense of increased computation. However, the increase in running time is only about 0.35% (which is negligible) since the upsampling module and the prediction heads are very simple.

TABLE IV Run-Time Analysis With Different Backbones and Feature Resolutions

SECTION V.

Limitation and Conclusion

A. Limitation

One apparent limitation of our method is that only the predicted heatmaps are used to generate pseudoposes in Section III-E, while the predicted masks are ignored. Another limitation would be the sparse representation of the satellite using a set of keypoints. Fig. 10 visualizes the worst two predictions on each domain. Note that the pose estimation could be less stable when these keypoints are invisible due to truncation, low light, or high reflection. The third limitation of this work is that the implementations of the projection function $\mathcal {F}$ and the rendering function $\mathcal {R}$ are nondifferentiable and performed offline. Future work is needed, specifically in incorporating a differentiable rendering engine to simultaneously achieve online self-training and pose refinement based on fine-grained masks.

Fig. 10.

Illustration of failure cases. The keypoints are reprojected using the predicted pose.

Show All

B. Conclusion

This article explores the domain-agnostic geometrical constraints to achieve unsupervised domain adaptation in satellite pose estimation. The task is formulated as a minimization problem in a self-training framework by taking the target poses as latent variables. Meanwhile, the fine-grained segmentation is introduced as an auxiliary task to improve performance. The experimental results demonstrate that our method achieves superior performance.

References is not available for this document.

Bridging the Domain Gap in Satellite Pose Estimation: A Self-Training Approach Based on Geometrical Constraints

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Related Work