Introduction
Estimating the pose of an uncooperative spacecraft is crucial in numerous space missions, such as debris removal [1], on-orbit servicing [2], and assets refueling [3]. Over the last two decades, numerous relative navigation systems have been proposed based on LiDARs [4], [5] and cameras [6], [7], [8], [9], [10], [11]. Active sensors, including radars and LiDARs, require larger mass and higher power consumption compared to visual sensors. Moreover, a stereo system requires a large baseline and relies on robust feature matching to obtain depth information. Therefore, monocular vision-based navigation systems have gained increasing attention from the academical and industrial fields.
Traditionally, monocular satellite pose estimation approaches require hand-crafted feature extraction [6], [7], [8], [9], which limits their performance on challenging environments, such as occlusions, harsh lighting conditions, reflective materials, and complex structures. Recently, deep learning-based methods have achieved success in satellite pose estimation [11], thanks to the powerful feature representation ability of deep models. However, training neural networks requires large-scale datasets, while collecting real images of spacecrafts and annotating their 6-DoF poses are time-consuming, notoriously laborious, and difficult. Therefore, recent deep models are trained and evaluated using synthetic images [11], [12], [13], [14], [15]. Due to the inherent discrepancy between real and synthetic images, the deep models that are fully supervised by synthetic images usually show deteriorated performance when being deployed in real scenarios. This issue is revealed by the large gap between scores on the real and synthetic leaderboards.1
In computer vision, unsupervised domain adaptation (UDA) [16], [17], [18], [19], [20], [21] is adopted in scenarios where the labels of real samples are scarce, by training deep models using labeled synthetic images and unlabeled real images. Motivated by these practices, Park et al. [22] created the next generation spacecraft pose estimation dataset (SPEED+) with focus on the synthetic-to-real domain gap in satellite pose estimation. Moreover, based on the SPEED+ dataset, the Advanced Concepts Team (ACT) of the European Space Agency (ESA) and the Space Rendezvous Laboratory (SLAB) at Stanford University coorganized the second international Satellite Pose Estimation Competition (SPEC2021)2 to boost the research of bridging the domain gap.
Unlike common UDA tasks in computer vision, a calibrated camera is used to measure the pose of the same satellite under various environments in space missions. For UDA in satellite pose estimation in SPEC2021, only the environmental settings are different across domains while the satellite structure and the camera parameters are the same (as shown in Fig. 1). Therefore, given a pose, the locations of keypoints and the satellite masks are the same in different domains, which are referred to as the domain-agnostic geometrical constraints. Besides, due to intense imaging noise, challenging illumination variations, and diverse poses, this task poses additional challenges compared with common UDA problems [17], [18], [19].
Characteristic of the UDA task in satellite pose estimation. The satellite structures are the same in different domains with different illumination conditions. The samples come from three distinct domains, including
Several UDA approaches [20], [21], [23], [24], [25], [26] have been proposed to explore self-training paradigms to improve performance on the target domain. Nonetheless, these methods are not specifically designed for UDA in satellite pose estimation, since the domain-agnostic geometrical constraints are not fully explored. Meanwhile, previous satellite pose estimation approaches [12], [15] represent a satellite as a set of 2-D keypoints, and then estimate the satellite pose using the perspective-n-point (PnP) algorithms [27]. However, as shown in Fig. 1, sparse keypoints stand for only semantic parts of the satellite. Such sparse representation leads to a significant loss of information, which hampers knowledge transfer across domains.
To tackle the above problems, we formulate UDA of satellite pose estimation as a minimization problem under a self-training framework. First, we formulate the geometrical constraints as a projection function, which maps the predefined 3-D keypoints onto the source and target images using the same camera parameters. Based on the projection function, we propose a basic self-training framework by taking the poses of target samples as latent variables, which are jointly optimized with the network parameters. Second, we leverage fine-grained segmentation to extend the basic framework. Specifically, we enhance the geometrical constraints with a rendering function. Similar to the projection function, the rendering function maps the 3-D mesh of the satellite to fine-grained masks, using the same camera parameters in different domains. Therefore, we take fine-grained segmentation as an auxiliary task of keypoints regression. Furthermore, as the masks provide dense descriptions with structural information, we perform adversarial training by aligning the predicted masks of the source and target samples. Finally, we iteratively optimize the network parameters and generate pseudolabels to solve the minimization problem. Experimental results demonstrate the effectiveness of our framework. Moreover, our method won the first and the third place on two leaderboards of the second international Satellite Pose Estimation Competition,3 respectively.
Our contributions can be summarized as follows.
We explore the domain-agnostic geometrical constraints to propose a self-training framework for UDA in satellite pose estimation.
We leverage fine-grained masks to address the information loss problem caused by abstracting the satellite as sparse keypoints.
Our method significantly improves the accuracy of satellite pose estimation without using real annotations.
Related Work
Object pose estimation aims at recovering the 3-D position and 3-D rotation of an object in the camera-centered coordinate system. Traditional approaches [28], [29] rely on local features, suffering from texture-less objects and background clutter. Recently, CNN-based methods have dominated most object pose estimation tasks. Numerous approaches [30], [31], [32], [33], [34], [35] have been proposed to estimate poses using putative 2-D-3-D correspondences and the PnP algorithms. To achieve better efficiency, several methods [36], [37], [38], [39] are introduced to directly regress poses from monocular images. Other methods [40], [41], [42] learn the latent representations of rotation and recover poses by exploring the image retrieval paradigms. Since these methods focus on household objects in indoor scenarios [43], [44], they face significant challenges caused by wide-range depth variations and illumination changes in outer space [45].
Satellite pose estimation is a special case of object pose estimation. Spacecraft pose network (SPN) [14] is the first deep learning-based approach for satellite pose estimation. Specifically, the 3-D rotation is recovered by discretizing the viewpoint spaces into bins, and then the 3-D translation is estimated using the geometrical constraints. In other top-performing approaches, the pose estimation problem is formulated as a task of localizing semantic keypoints on the convex areas of a satellite by taking various representations, such as heatmap [12], vector [13], and set [15]. These methods crop the satellite from the input images using a well-trained object detector to address scale variations. To achieve better efficiency, Hu et al. [45] handled the scale problem in a single-stage way by introducing a sampling strategy. However, these methods are trained and tested on synthetic data. They usually undergo significant performance degradation when being applied to real images due to the domain gap [46].
Unsupervised domain adaptation aims at addressing the domain mismatch problem. It is a promising direction to circumvent the laborious and time-consuming procedures of data annotation. Several UDA paradigms have been studied for different vision tasks. Adversarial learning aligns both domains with a discriminator. The alignment can be achieved at image-level [16], feature-level [17], [18], and output-level [19]. Self-training methods are introduced to utilize target samples to train the model by generating pseudolabels [23], [24], minimizing the entropy loss [20], [25], or employing the teacher-student framework [21], [26]. Nonetheless, it is challenging to adopt these approaches in UDA of satellite pose estimation, which is different to the common UDA tasks in computer vision.
Method
In this section, we first introduce the UDA task of satellite pose estimation in Section III-A and the PnP-based solution to monocular satellite pose estimation in Section III-B. Then, we formulate the task as a minimization problem in a basic self-training framework in Section III-C, which is extended by leveraging fine-grained segmentation in Section III-D. We present the solution to the minimization problem in Section III-E. Finally, we give a mathematical proof of the geometrical constraints in Section III-F. The overview of our method is shown in Fig. 2.
Overview of our self-training framework. The satellite is represented as a set of sparse keypoints and dense fine-grained masks. The source image
A. Problem Formulation
In UDA of satellite pose estimation, the satellite structure and the camera parameters are the same in the source and target domains. The intrinsic matrix of the camera is denoted by
We assume that the poses in the source and target domains are sampled from the same distribution
\begin{equation*}
P(\mathbf {R}_{s}, \bm {t}_{s}) = P(\mathbf {R}_{t}, \bm {t}_{t}). \tag{1}
\end{equation*}
\begin{equation*}
P(\mathbf {I}_{s}|\mathbf {R}_{s}, \bm {t}_{s}) \ne P(\mathbf {I}_{t}|\mathbf {R}_{s}, \bm {t}_{s}). \tag{2}
\end{equation*}
\begin{align*}
&P(\mathbf {I}_{s}, \mathbf {R}_{s}, \bm {t}_{s}) = P(\mathbf {I}_{s}|\mathbf {R}_{s}, \bm {t}_{s}) P(\mathbf {R}_{s}, \bm {t}_{s}) \\
&\ne P(\mathbf {I}_{t}, \mathbf {R}_{t}, \bm {t}_{t}) = P(\mathbf {I}_{s}|\mathbf {R}_{t}, \bm {t}_{t}) P(\mathbf {R}_{t}, \bm {t}_{t}). \tag{3}
\end{align*}
B. PnP-Based Solution
As suggested by Kisantal et al. [11], the PnP-based methods significantly outperform direct regression in monocular satellite pose estimation. Therefore, we follow the PnP-based [12], [15] method to estimate the satellite poses.
We assume the texture-less 3-D mesh
\begin{equation*}
\min _{\mathcal {G}_{f,h}} \frac{1}{N_{s}}\sum _{i=1}^{N_{s}} \mathcal {L}_{h}(\hat{\mathbf {H}}_{s}^{i}, \mathbf {H}_{s}^{i}) \tag{4}
\end{equation*}
\begin{equation*}
\hat{\mathbf {R}}, \hat{\bm {t}} = \min _{\mathbf {R},\bm {t}} \sum _{k=1}^{N_{p}} \phi (\Vert \lambda _{k} \hat{\bm {p}}^{k}- \mathbf {K}(\mathbf {R}\bm {P}^{k} + \bm {t})\Vert _{2}) \tag{5}
\end{equation*}
Due to the domain shifts, the neural network that trained on the source domain usually suffers from performance degradation when being applied on target images. To tackle this issue, we train the network using labeled source samples and unlabeled target samples as described in Section III-C.
C. Basic Self-Training Framework
To fully exploit unlabeled target samples, we propose a self-training framework by leveraging the geometrical constraints. Specifically, we define the function that projects 3-D landmarks onto the 2-D heatmap as
\begin{equation*}
\mathbf {H}=\mathcal {F}(\mathbf {R}, \bm {t}, \mathcal {P}, \mathbf {K}) \tag{6}
\end{equation*}
For source samples, we supervise the network using (4). The ground-truth heatmap
\begin{align*}
&\min _{\mathcal {G}_{f,h}, \mathcal {T}_{t}} \frac{1}{N_{s}}\sum _{i=1}^{N_{s}} \mathcal {L}_{h}(\hat{\mathbf {H}}_{s}^{i}, \mathbf {H}_{s}^{i}) + \frac{1}{N_{t}}\sum _{j=1}^{N_{t}} \mathcal {L}_{h}(\hat{\mathbf {H}}_{t}^{j}, \tilde{\mathbf {H}}_{t}^{j}) \\
&\begin{array}{lll}\rm {s.t.} \quad &\tilde{\mathbf {H}}_{t}^{j} = \mathcal {F}(\mathbf {R}_{t}^{j}, \bm {t}_{t}^{j}, \mathcal {P}, \mathbf {K}) \\
&\mathcal {T}_{t}^{i} = [\mathbf {R}_{t}^{j}|\bm {t}_{t}^{j}] \in \mathbb {SE}_{3} \cup \lbrace \mathbf {0}\rbrace, \; j = 1, 2,\ldots, N_{t} \end{array}\tag{7}
\end{align*}
Equation 7 can be solved iteratively. During optimization, the network
D. Extended Framework With Multitask Learning
To tackle the above problem, we note that the fine-grained masks in Fig. 1 provide rich dense descriptions with domain-agnostic structural context. Therefore, we apply segmentation as an auxiliary task of heatmap regression to improve pose estimation performance. In the following paragraphs, we first perform output-level alignment [19] using adversarial training, and then extend the minimization problem in (7) with the auxiliary task.
As shown in Fig. 2, we extend the basic network with a mask head
\begin{align*}
\mathcal {L}_{d}(\hat{\mathbf {Y}}_{t}, \hat{\mathbf {Y}}_{s}) =& -\frac{1}{HW}\sum _{i=1}^{H}\sum _{j=1}^{W} \log (\mathcal {G}_{d}(\hat{\mathbf {Y}}_{s})(i, j, 1)) \\
&+\log (\mathcal {G}_{d}(\hat{\mathbf {Y}}_{t})(i, j, 0)) \tag{8}
\end{align*}
\begin{equation*}
\mathcal {L}_{adv}(\hat{\mathbf {Y}}_{t}) = -\frac{1}{HW}\sum _{i=1}^{H}\sum _{j=1}^{W}\log (\mathcal {G}_{d}(\hat{\mathbf {Y}}_{t})(i, j, 1)). \tag{9}
\end{equation*}
Furthermore, we introduce a rendering function
\begin{equation*}
\mathbf {Y} = \mathcal {R}(\mathbf {R}, \bm {t}, \mathcal {M}, \mathbf {K}). \tag{10}
\end{equation*}
\begin{align*}
&\min _{\mathcal {G}_{f,h,m}, \mathcal {T}_{t}} \max _{\mathcal {G}_{d}} \frac{1}{N_{s}}\sum _{i=1}^{N_{s}} \left[ \mathcal {L}_{h}(\hat{\mathbf {H}}_{s}^{i}, \mathbf {H}_{s}^{i}) + \lambda _{m} \mathcal {L}_{m}(\hat{\mathbf {Y}}_{s}^{i}, \mathbf {Y}_{s}^{i})\right] \\
&+ \frac{1}{N_{t}}\sum _{j=1}^{N_{t}} \left[ \mathcal {L}_{h}(\hat{\mathbf {H}}_{t}^{j}, \tilde{\mathbf {H}}_{t}^{j}) + \lambda _{m} \mathcal {L}_{m}(\hat{\mathbf {Y}}_{t}^{j}, \tilde{\mathbf {Y}}_{t}^{j}) + \lambda _{a} \mathcal {L}_{adv}({\hat{\mathbf {Y}}_{t}^{j}}) \right] \\
\\
&\rm {s.t.}\quad \tilde{\mathbf {H}}_{t}^{j} = \mathcal {F}(\mathbf {R}_{t}^{j}, \bm {t}_{t}^{j}, \mathcal {P}, \mathbf {K}),\; \tilde{\mathbf {Y}}_{t}^{j} = \mathcal {R}(\mathbf {R}_{t}^{j}, \bm {t}_{t}^{j}, \mathcal {M}, \mathbf {K}) \\
&\qquad\;\, \mathcal {T}_{t}^{i} = [\mathbf {R}_{t}^{j}|\bm {t}_{t}^{j}] \in \mathbb {SE}_{3} \cup \lbrace \mathbf {0}\rbrace, \; j = 1, 2,\ldots, N_{t} \tag{11}
\end{align*}
E. Iterative Optimization and Pseudolabel Generation
We observe that the variables in (11) can be divided into two classes: the network parameters and the poses of target samples. Following [23], we adopt iterative procedures to optimize (11).
Fix the pseudolabels (or initialize
as$\mathcal {T}_{t}$ ) and train the network$\lbrace \mathbf {0}\rbrace$ .$\mathcal {G}_{f,h,m,d}$ Fix the network, optimize poses
, and generate pseudolabels.$\mathcal {T}_{t}$
In the first step, when the pseudoheatmaps and pseudomasks are fixed, the minimization problem in (7) is simplified as
\begin{align*}
&\min _{\mathcal {G}_{f,h,m}} \max _{\mathcal {G}_{d}} \frac{1}{N_{s}}\sum _{i=1}^{N_{s}} \left[ \mathcal {L}_{h}(\hat{\mathbf {H}}_{s}^{i}, \mathbf {H}_{s}^{i}) + \lambda _{m} \mathcal {L}_{m}(\hat{\mathbf {Y}}_{s}^{i}, \mathbf {Y}_{s}^{i})\right] \\
&+ \frac{1}{N_{t}}\sum _{j=1}^{N_{t}} \left[ \mathcal {L}_{h}(\hat{\mathbf {H}}_{t}^{j}, \tilde{\mathbf {H}}_{t}^{j}) + \lambda _{m} \mathcal {L}_{m}(\hat{\mathbf {Y}}_{t}^{j}, \tilde{\mathbf {Y}}_{t}^{j}) + \lambda _{a} \mathcal {L}_{adv}({\hat{\mathbf {Y}}_{t}^{j}}) \right]. \tag{12}
\end{align*}
\begin{align*}
&\min _{\mathcal {T}_{t}} \frac{1}{N_{t}}\sum _{j=1}^{N_{t}} \left[ \mathcal {L}_{h}(\hat{\mathbf {H}}_{t}^{j}, \tilde{\mathbf {H}}_{t}^{j}) + \lambda _{m} \mathcal {L}_{m}(\hat{\mathbf {Y}}_{t}^{j}, \tilde{\mathbf {Y}}_{t}^{j}) \right] \\
&\begin{array}{rl}\rm {s.t.}\quad &\tilde{\mathbf {H}}_{t}^{j} = \mathcal {F}(\mathbf {R}_{t}^{j}, \bm {t}_{t}^{j}, \mathcal {P}, \mathbf {K}),\; \tilde{\mathbf {Y}}_{t}^{j} = \mathcal {R}(\mathbf {R}_{t}^{j}, \bm {t}_{t}^{j}, \mathcal {M}, \mathbf {K}) \\
&\mathcal {T}_{t}^{i} = [\mathbf {R}_{t}^{j}|\bm {t}_{t}^{j}] \in \mathbb {SE}_{3} \cup \lbrace \mathbf {0}\rbrace, \; j = 1, 2,\ldots, N_{t}. \\
\end{array}\tag{13}
\end{align*}
Fig. 3 shows the pipeline to generate pseudoheatmap and mask using pose estimation based on the geometrical constraints. Specifically, we first fix the network and send the target image into the network to obtain the predicted heatmap
F. Discussion
Our self-training framework is built upon the geometrical constraints, which are provided by the projection function
Equation (6) has four input parameters, including satellite pose
\begin{align*}
&{}[\mathbf {R}_{s}|\bm {t}_{s}]=[\mathbf {R}_{t}|\bm {t}_{t}] \Rightarrow \\
& \mathcal {F}(\mathbf {R}_{s}, \bm {t}_{s}, \mathcal {P}, \mathbf {K}) = \mathcal {F}(\mathbf {R}_{t}, \bm {t}_{t}, \mathcal {P}, \mathbf {K}) \Rightarrow \\
& \mathbf {H}_{s} = \mathbf {H}_{t}. \tag{14}
\end{align*}
\begin{equation*}
P(\mathbf {H}_{s}|\mathbf {R}_{s}, \bm {t}_{s}) = P(\mathbf {H}_{t}|\mathbf {R}_{t}, \bm {t}_{t}). \tag{15}
\end{equation*}
\begin{align*}
P(\mathbf {H}_{s}) &= \int \int P(\mathbf {H}_{s}|\mathbf {R}_{s}, \bm {t}_{s}) P(\mathbf {R}_{s}=\mathbf {R}, \bm {t}_{s}=\bm {t}) \;\mathrm{d}\mathbf {R}\mathrm{d}\bm {t}\\
P(\mathbf {H}_{t}) &= \int \int P(\mathbf {H}_{t}|\mathbf {R}_{t}, \bm {t}_{t}) P(\mathbf {R}_{t}=\mathbf {R}, \bm {t}_{t}=\bm {t}) \;\mathrm{d}\mathbf {R}\mathrm{d}\bm {t}.\\
\tag{16}
\end{align*}
\begin{equation*}
P(\mathbf {H}_{s}) = P(\mathbf {H}_{t}). \tag{17}
\end{equation*}
What'more, the similar conclusion also holds for the rendering function
Experimental Results
We present experimental details and results in this section. We first introduce the dataset and the metrics used in the experiments in Section IV-A. Then, Section IV-B presents the details of implementation. Next, the key components of the proposed approach are studied in Section IV-D. We compare our approach with the state-of-the-art methods in Section IV-E. Finally, Section IV-F presents the runtime analysis.
A. Dataset and Metrics
Dataset: We conduct experiments on the SPEED+ [22] dataset to demonstrate the effectiveness of our method. The SPEED+ [22] dataset comprises images of the Tango spacecraft from the PRISMA [49] mission, consisting of three distinct domains, i.e.,
Metrics: We adopt the metrics used in SPEC2021. The rotation error is defined as the angle between the predicted quaternion
\begin{equation*}
S = {\begin{cases}0, & \text{if}\quad {S}_{q} < \theta _{q} \; \text{and} \; {S}_{t} < \theta _{t} \\
{S}_{q} + {S}_{t}, & \text{otherwise} \end{cases}} \tag{18}
\end{equation*}
B. Implementation Details
Mesh reconstruction and data preparation: Since the 3-D mesh of the satellite is not provided in SPEED+, we reconstruct the 3-D mesh
The pipeline for 3-D mesh reconstruction and annotation is shown in Fig. 4. Note that, the source images can have two types of background: the earth background and the black background. Since the earth background usually introduces noise during reconstruction, we first train a classifier to select images with a black background. Next, to tackle the scale issue, we select 1000 images using the criterion of 4.5 m
Architecture details: Our network comprises four modules: a backbone, a mask head, a heatmap head, and a discriminator. We construct the backbone using a transformer-based HRNet network [55], i.e., HRFormer-S with 7.8 M parameters. The backbone extracts feature maps at
Experimental details: We implement the network using the PyTorch library and train our model using the AdamW [56] optimizer. All images are resized to the resolution of
C. Source-Only Validation
We first carry out
Table I lists the experimental results of ours and 3 methods reported by Park et al. [22]. Specifically, SPN [14] and KRN [13] require a separate detector to crop the satellite before keypoints regression. In the experiment, the detection stage is skip and the ground-truth bounding boxes are used to crop the target. KRN [13] regresses a
D. Ablation Study
We conduct a series of ablation experiments to investigate the critical components of our approach, including self-training, adversarial training, mask prediction, and pseudolabel generation with the geometrical constraints. Due to the unavailability of pose labels of target samples, we manually annotate 100/50 images from the
Self-training: In each setting, models
Adversarial training: We construct a baseline
Mask prediction: We further extend the baseline with multitask learning by appending a mask head after the backbone. The mask head predicts a binary mask (in model
To better analyze the function of the fine-grained segmentation, we compare the mean square error (MSE) between the ground-truth and the predicted keypoints on the validation set after each epoch. Specifically, we report the MSEs of the pretrained models, including
Mean square errors (MSE) between the predicted and ground-truth keypoints in the pretraining stage.
Geometrical constraints: We also study the role of the geometrical constraints, which are used during pseudolabel generation. We directly generate pseudoheatmaps and pseudomasks according to model predictions. The model trained using this setting is denoted by
Visualization of pseudolabels of target samples. The first column shows the pseudokeypoints generated without and with the geometrical constraints in green crosses and blue points, respectively. The red vectors illustrate the differences between the two types of pseudokeypoints. The second/last column shows pseudomasks generated without/with the geometrical constraints.
Multitask learning: We adopt adversarial training and mask segmentation to promote keypoint heatmap regression. In Fig. 8, we study the effectiveness of multitask learning strategies, by comparing the heatmap losses of model
Comparisons of the training curves of model
E. Comparison With the State-of-The-Art Methods
We take KRN [13] and SPNv2 [60] as the baseline methods. SPNv2 [60] is based on EfficientDet [61] and comprises three prediction heads: the EfficientPose head [62] for object presence, bounding box, target rotation and translation; the heatmap head for the 2-D heatmaps; the segmentation head for the binary mask of the satellite. Other technologies employed by SPNv2 include multiscale design [61], [62], extensive data augmentation [63], style augmentation [64], AdaBN [65], and entropy minimization. Different from these methods, we utilize the geometrical constraints to develop a self-training framework and explore the fine-grained segmentation to boost performance.
To achieve better performance, we add an upsampling layer followed by a
Visualization of pose estimation on target samples. The results achieved with and without our framework are shown in green and red colors, respectively.
F. Runtime Performance
We follow the experimental settings in ablation study and compare the runtime performance of the models using different backbones and feature map resolutions. The experiments are conducted on a PC with an Nvidia GTX 3090 GPU. Table IV reports the results. We first replace the transformer-based backbone with a CNN-based backbone, i.e., HRNet [66]. Although the running time is shorter at the expense of larger parameter sizes and memory consumption, the pose estimation performance drops from 0.221/0.099 to 0.253/0.163 on
Limitation and Conclusion
A. Limitation
One apparent limitation of our method is that only the predicted heatmaps are used to generate pseudoposes in Section III-E, while the predicted masks are ignored. Another limitation would be the sparse representation of the satellite using a set of keypoints. Fig. 10 visualizes the worst two predictions on each domain. Note that the pose estimation could be less stable when these keypoints are invisible due to truncation, low light, or high reflection. The third limitation of this work is that the implementations of the projection function
Illustration of failure cases. The keypoints are reprojected using the predicted pose.
B. Conclusion
This article explores the domain-agnostic geometrical constraints to achieve unsupervised domain adaptation in satellite pose estimation. The task is formulated as a minimization problem in a self-training framework by taking the target poses as latent variables. Meanwhile, the fine-grained segmentation is introduced as an auxiliary task to improve performance. The experimental results demonstrate that our method achieves superior performance.