Introduction
Detecting, tracking, and following objects of interest is critical to several robotics use-cases, such as industrial automation, logistics and warehousing, healthcare, and security [1], [2], [3], [4]. Notably, one of the key drivers of continuous progress in providing robust object-following systems is the combination of computer vision and deep learning [5], [6], where training deep convolutional networks on large labeled datasets have made tremendous strides in this area. Specifically, the object following task relies on the video segmentation and tracking task, which can be categorized into distinct subtasks. These include interactive (scribble or click-based) video segmentation [7], where a user draws a box around or clicks on the object to segment and track, mask-guided video segmentation [8], [9], [10], [11], which assumes the presence of a mask to track, and automatic video segmentation [12], [13], [14], [15], [16], which assumes that the user does not interact with the algorithm to obtain the segmentation masks; methods should provide a set of object candidates with no overlapping pixels that span through the video sequence, however, these candidates are not specific, meaning that the segmentation will be applied to all of the seen objects, and not recognize the desired object. Thus, to automatically identify the required object to follow, numerous detection approaches have been suggested [17], [18] such as RCNN and its variants [19], [20], [21], YOLO and its variants [22], [23], [24], and more [25], [26]. However, existing robotic systems for object detection and following suffer two notable shortcomings:
They are closed-set, i.e., the set of objects to detect and follow is assumed to be available a priori (during the training phase). Thus, such systems are only able to handle a fixed set of object categories [2], [27], [28], [29], [30], limiting their adaptability; adapting to newer object categories necessitates finetuning the model.
Additionally, the objects of interest are specified (queried) only by a class label, which is often unintuitive for end-users to specify, imposing restrictions on how users interact with the system [2], [31], [32].
Deep learning is currently undergoing another wave of ever-more performant and robust model design, with the creation of increasingly big and multimodal models trained on internet scale amount data containing billions of images, text, and audio. These highly capable models (e.g., CLIP [33], DINO [34]) have demonstrated impressive performance in open-set scenarios (i.e., the objects of interest are only supplied at inference time, and not trained for a specific task) [36], [37]. Notably, recent robotics approaches using foundation models have shown impressive open-set interaction abilities [38], [39], [40], [41], [42], and extended robustly to multimodal applications [43], [44], [45], [46]. However, integrating these models into real-time resource-constrained robotic systems poses significant challenges, due to their large model size and high inference latency.
A. Our Contributions
We address the pre-discussed gaps by developing an open-set real-time any object following approach, which can flexibly adapt to categories specified at inference time, via multiple modalities including text, images, and clicks. Specifically, we present the follow anything system (FAn):
an open-set, multimodal approach to detect, segment, track, and follow any object in real-time (
FPS on a 8 GB GPU). The desired object may be specified via a text prompt, an image, a bounding box, or a click.>6 a unified system that is easily deployed on a robot platform (in our work, a micro aerial vehicle). The system includes real-time processors for input image streams and visual-servoing control loops for following the object of interest.
built with re-detection mechanisms that account for scenarios where the object of interest is occluded or tracking is lost. This mechanism can function autonomously or with human guidance, ensuring the object is successfully identified and tracked again, maintaining continuity in the tracking process.
We validate our system by autonomously detecting, tracking, and following a multitude of mobile agents including a drone, an RC car, and a manually operated brick.
Our Approach: FAn
Open-vocabulary object following: Given (1) a robotic system (here, a micro aerial vehicle) equipped with an onboard camera, and (2) an object of interest within the onboard camera's field-of-view (specified either as a text prompt, an image, a bounding box, or a click); the object following task involves detecting the object of interest, and producing robot controls
FAn system overview: FAn uses a combination of state-of-the-art ViT models, optimizes them for real-time performance, and unifies them into a single system. In particular, we leverage the segment anything model (SAM) [35] for segmentation, DINO [34], and CLIP [33] for general-purpose visual features, and design a lightweight detection and semantic segmentation scheme by combining the features from CLIP and DINO with the class-agnostic instance segmentation determined by SAM. We use the (Seg)AOT [9], [47] and SiamMask [48] models for real-time tracking, and design a lightweight visual servoing controller for object following; see Fig. 1 for illustration.
A. Real-Time Open-Vocabulary Object Detection
We first describe our lightweight object detection and segmentation pipeline that builds atop SAM, CLIP, and DINO. Our system takes as input an RGB frame from a video stream, represented by a 3D tensor
Embedding input queries: To detect the desired object referred to by the query
First, we compute the (binary) instance segmentation masks by applying
Second, we extract the pixel-wise descriptors by applying
Similarity scores: Given a query (in the form of text, image, or click), we first extract a query feature descriptor
Single query detection: If the similarity
Multi-class detection:
Should the user provide a set of queries
After this process, each pixel is assigned a label from
Follow anything (FAn) is a real-time robotic system to detect, track, and follow objects in an open-vocabulary setting. Objects of interest may be specified using text, images, or clicks. FAn leverages foundation models like CLIP [33], DINO [34], and SAM [35] to compute segmentation masks that best align with the queried objects. These objects are tracked across video frames while accounting for occlusion and object re-emergence; enabling real-time following of objects of interest by a robot platform.
FAn outputs illustrations on input frame of four whales with a click query on a whale and a click query on water. First, SAM extracts multiple masks, then, based on DINO features, FAn classifies each mask to what object it refers from the given queries (water/whales). Finally, whales are detected by assigning the masks whose DINO feature descriptor is closest to the whales' query descriptor. NOTE: Heat maps are shown in the click (query) figures.
Automatic detection experiments (SAM-and-DINO). Examples of our automatic detection scheme for detecting Drones, Bricks, and RC Cars. The examples include (from left to right): the original input frame, the outputs of SAM segmentation masks, and DINO + Cosine similarity semantic segmentation and detection.
Heat maps showing the pixels' semantic similarity. For every pixel, its feature descriptor is extracted then cosine similarity is computed between its descriptor and a focal point pixel descriptor (pointed at by a yellow arrow).
Fast automatic detection experiments (DINO only): Examples of our fast automatic detection scheme on detecting (1) whales, (2) drones, (3) RC cars, and (4) toy bricks. This approach is much faster and works very well for detecting the desired object. However, it provides a less “clean” segmentations/masks.
Automatic re-detection via cross trajectory stored ViT features. (1) At every frame, we store the DINO features representing the tracked object. (2) Once the object is lost, we (3) either apply a segmentation model or get suggested masks from the tracker, for every mask, we compute the DINO descriptors, and (4) compare it to the pre-computed ones. If a high similarity is obtained we keep tracking the object, else, we repeat (3) on the next frame.
Automatic tracking, following, and re-detection. The tracked object is referred to by the yellow arrow, we also show the results of the re-detection mechanism in the last two rows.
Trajectory comparison. We report the mean Euclidean distance between every point in the
Left: Our custom-built quadrotor. Right-up: Successful automatic detection via text queries (SAM+CLIP) on low-resolution images; text queries used from left to right: “a toy car” (single query), “a drone” (single query), and “a whale”+“water” (multiclass). Right bottom: In some cases, the raw data from the cropped masks (to get pixel-wise features from CLIP) does not provide enough information for CLIP---since the image is of low resolution and the mask is small causing it to provide not accurate descriptors and thus FAn may not detect the objects.
Automatic detection experiments via text queries (SAM-and-Clip) on high resolution data.
Manual queries: We provide the users an option to manually draw bounding boxes (or provide outputs from a customized domain-specific detector) around the objects they wish to track, or alternatively, click on one or two pixels within the object (in real-time from the video stream). After user selection, we use SAM to accurately segment and obtain the object mask. This method ensures precise control over tracking, making it suitable for high-accuracy detection scenarios.
B. Fast Detection for Limited Hardware
Off-the-shelf implementations of foundation models like SAM and DINO are not well-suited for real-time onboard detection, segmentation, and tracking. SAM takes several seconds to compute segmentation masks per frame. While we evaluated the recently proposed FastSAM [52] model and obtained a
Fast detection by (solely) grouping DINO features:1 To mitigate this compute bottleneck, we instead propose to first obtain coarse detections by grouping DINO features. These coarse detections may further be refined by periodically computing segmentation masks and tracking these over time, effectively rendering the overall system operable at high frame rates. To obtain coarse detections, (i) we extract the pixel-wise descriptors by applying
Optimizing DINO runtime: We speed up DINO using two optimization techniques: Quantization (reduces numerical precision) and tracing (converts dynamic graphs into static ones). See Table I for runtime details of all the used models in our system. We report the running time for each model independently, not as part of the whole system. Note that some models automatically reshape inputs to a constant size. We also compare the runtime of our detection phase, with the popular Grounded-SAM [53] method in Table II.
C. Extracting Per-Pixel Feature Descriptors
While a few methods adapt foundation models like CLIP to provide per-pixel descriptors, these methods [54], [55], [56], [57] require model re-training or finetuning on an image-text aligned dataset. This often results in concepts absent in the finetuning set being forgotten by the models as demonstrated in ConceptFusion [51]. To counteract this, [51] presents a zero-shot method for constructing pixel-aligned features that combine local (region-level) data with global (image-level) context included in models like CLIP. For efficiency (real-time processing) purposes, we adapt part of this method in our system when using CLIP for providing pixel-wise feature descriptors, however, we only use their ablated baseline which computes purely local 2D features by extracting a bounding box around each segmentation mask (obtained from SAM) and passes them through the CLIP encoder. For DINO, we use [50] as is, and find that their pixel-wise feature descriptors are inherently informative and more efficient.
D. Re-Detecting a Lost Object
We offer three re-detection methods for temporary object loss during tracking, catering to different needs. Our system automatically initiates re-detection when needed, and users can choose the level of support before starting the FAn pipeline: The first level relies on the tracker to re-detect the object, it's the fastest and less robust, occasionally leading to false detections of similar objects. The second approach involves human-in-the-loop re-detection, requiring a user to click/draw a bounding box when tracking is lost, assuming human availability, which isn't always possible. To mitigate this, we also propose an automatic re-detection technique.
Automatic re-detection via cross trajectory stored ViT features: To enable a robust and accurate autonomous re-detection of the tracked (lost) object, we provide a feature-descriptor storing mechanism for the tracked object in different stages of the tracking process, these stored features, will be used to find the object once lost. Specifically, we suggest the following. Let
Experiments
Hardware: We use a quadrotor equipped with an RGB camera (see Fig. 9). The quadrotor is custom-built with a Pixhawk running PX4 flight control software. The camera data is streamed directly to a remote ground station computer equipped with an NVIDIA GeForce RTX 2070, and Intel i7-10750H CPU, with Ubuntu 20.04.5 LTS, using the “herelink” digital transmission system along with other telemetry data. The ground station runs the tracking algorithm and sends control commands to the quadrotor via Mavlink. To enable indoor testing, the quadrotor is also equipped with an onboard computer that runs MAVROS and interfaces with an external Vicon motion capture system to get the position.
Implementation and system details: We outline key details of our system: Run-time improvement. We enhance segmentation/detection performance by compressing SAM and DINO through quantization and tracing and using FastSAM. For tracking, we offer support for the fast SiamMask [48] tracker; see Table I for runtime (FPS) details. Flight controller. For versatility, we used PX4, open-source flight control software, to interface with our quadrotor. The MAVSDK Python library is used to send velocity commands for 3D motion and yaw control, streamlining integration with PX4-based drones in future projects. Visual servoing. We mount the onboard camera on the bottom of the quadrotor facing the ground. At relatively small translational velocities the first-order approximations of roll and pitch angles are close to zero. In addition, we fixed the drone altitude and yaw angle. This simplifies the visual servoing task to 2D plane tracking using proportional control. We use a proportional controller based on pixel distances to center the object in the frame and employ a lowpass filter to smooth quadrotor trajectories, ensuring accuracy in challenging scenes. Video Streaming. To process frames from an online video stream in real-time, we implemented a low-latency online streamer using the OpenCV library in Python. This streamer continuously reads frames with a parallel thread and maintains a buffer size of 1, ensuring immediate access to the latest frame when needed. Software. We mainly use Torch, cv2, and mavsdk; see our https://github.com/alaamaalouf/FollowAnythingproject page for full details.
A. Real Time Object Following Exprements
We tested (i) our overall system for detecting, tracking, re-detecting, and following: RC cars, drones, and bricks in real-time. Here we used SAM+DINO and DINO-SOLO approaches for the detection task on all of the tested objects - the provided queries are clicks on the desired objects from other pictures (we provide a script for obtaining these click queries). Both approaches worked seamlessly for detecting and tracking the desired objects. (ii) We demonstrate our system for re-detecting an object that gets occluded from the scene during tracking. Specifically, during the following experiments, the RC car and the brick pass under a “tunnel” twice, and our re-detection mechanism is able to recover and resume tracking. Fig. 7(a), (b), and (c) show different scenarios during the following. We encourage the reader to view the demos on our https://github.com/alaamaalouf/FollowAnythingproject webpage and in the https://www.youtube.com/watch?v=6Mgt3EPytrwexplainer video. (iii) In addition, we recorded the actual 3D trajectory coordinates of the following quadrotor and the target object to assess the robustness of our tracking system. Specifically, we recorded continuous tracking data for over 4 minutes while following a ground robot. We report the mean Euclidean distance between every point in the
B. Zero-Shot Detection Exprements
Data: We stored the tracking and detection streams from the SAM+DINO following experiment and used it to test the FAn system and its different variants for zero-shot detection. For each of the tested objects, we picked multiple frames during the tracking and detection showcasing diverse object positions and diverse scenes. Other than that, we also use our private set of whale images to test on.
Comparison: We quantitatively compare the suggested methods and analyze their advantages and disadvantages. We applied each of SAM+DINO, SAM+CLIP, and DINO-SOLO to assess their efficacy in detecting the object within the given data. We report both True Positive and False Positive detection results. Furthermore, we conducted a comparative analysis involving an alternative version of our approach, which consists of two variations. (i) Majority Voting: We assigned each pixel in the mask to its most similar query, and subsequently assigned the mask the label that was most frequently selected across all mask pixels. (ii)
The threshold
C. Mask Quality Experements
We compare the mask quality of our detection methods (DINO-SOLO, SAM+DINO/CLIP). We use the first video from the Cholec80 dataset [58], which has mask annotation for body parts and tools across frames during surgery. We aimed to detect the “grasper” tool and track it across frames. Table V reports (1) the mean intersection over union (mIoU) of the detection and the annotated data across frames and (2) the true positive detection percentage of the desired object, we also test how the detected mask quality affects the tracking; we report (3) the mIoU of the desired tracked object after each of the detection methods detected it. Queries: “body part”, “background”, and “surgery tool”.
D. Discussion and Conclusions
SAM+DINO: Figs. 3 and 2 show example results for real-time detection via SAM+DINO. Tables IV and III, indicate that the detection achieves a high level of accuracy for cars and whales, and performs well for drones and bricks - but may occasionally miss certain instances. After analyzing the results, it becomes apparent that when SAM generates reliable regions/segmentation, DINO consistently assigns the correct labels to each of these regions, ensuring precise and appropriate object detection. However, in cases where SAM fails to capture these regions accurately (resulting in inadequate segmentations), the object goes undetected. This scenario is exemplified by 4 drone object in the dataset and 3 bricks, where SAM fails to identify the mask of the drone/brick (see Fig. 11). Regarding accurate DINO classifications, we offer explanations illustrated in Fig. 4. These figures depict heatmaps based on cosine similarity calculations between DINO feature descriptors of each pixel and a designated focal point pixel. The visualizations demonstrate that pixels sharing similar semantic characteristics exhibit a high degree of similarity in their DINO features.
With vs without SAM. Right: SAM creates high-quality segmentation masks compared to DINO-SOLO (not using SAM). Left: SAM might miss important regions in the image.
DINO-SOLO: In Fig. 5 we show several examples showcasing the efficiency of our rapid automated detection system. This approach is significantly faster and performs admirably in detecting the desired objects. Even more, in many cases when SAM misses providing the desired object a mask, using DINO-SOLO can still detect the object. However, the resulting masks are not of high quality compared to the masks obtained from SAM, and this may potentially affect the tracking performance; see Fig. 11 and Table V.
SAM+CLIP: Examples for detection via “text” prompts through SAM+CLIP are shown in Fig. 9. For the tested low-resolution images, SAM+CLIP detections are not as robust, the method yields less precise similarity scores, increasing the likelihood of missed detections, particularly for objects lacking unique shapes like the brick. Additionally, in some cases, as the image has low resolution, if the object has a small (correct) mask, it does not present enough raw information and is thus misclassified. Fig. 9 shows an example of low-resolution images for such scenarios; we further discuss why this happens when using CLIP and not DINO in our discussion later. We note that this method is still beneficial for our system, the main idea is that we need one accurate detection with high confidence (e.g., with further increasing
SAM limitations. With vs without: SAM might miss important regions in the image. When the desired object is in these regions it will be impossible to detect it and thus DINO+SAM yields fewer true positive detections compared to DINO-SOLO. On the other hand, DINO+SAM provides high-quality masks once the object is detected while DINO-SOLO masks are less refined. See Tables V, III, and IV.
Queries: Using multiple queries to annotate other objects that might be in the scene reduces the number of False positives leading to a more robust and reliable system.
DINO vs CLIP: The method we are using to obtain pixel-wise features from DINO [50] is faster and provides better descriptors for every pixel compared to the method used for CLIP. This is because it requires one forward pass on the whole image to compute the per-pixel features. In addition, when using DINO, the method computes the per patch/pixel features while taking into count the full image, as it simply utilizes the patch-wise descriptors (outputs of the query, key, or value matrix in some attention layer of the transformer) of DINO, thus providing descriptors with richer context of the whole image. In CLIP, the method uses SAM to extract masks [51] and then applies CLIP on crops of these masks to extract features for all pixels in this mask, thus, it is less efficient and might yield less meaningful features when applying CLIP on small crops with limited raw data.
The competing methods: We found no improvement with other variants like
Summary: FAn bridges the gap between SOTA computer vision and robotic systems, providing an end-to-end solution for detecting, tracking, and following any object. Its open set, multimodal, real-time capabilities, adaptability to different environments, and open-source code make it a valuable tool.