Journals & Magazines >IEEE Robotics and Automation ... >Volume: 9 Issue: 4

Follow Anything: Open-Set Detection, Tracking, and Following in Real-Time

Abstract:

Tracking and following objects of interest is critical to several robotics use cases, ranging from industrial automation to logistics and warehousing, to healthcare and s...Show More

Metadata

Abstract:

Tracking and following objects of interest is critical to several robotics use cases, ranging from industrial automation to logistics and warehousing, to healthcare and security. In this letter, we present a robotic system to detect, track, and follow any object in real-time. Our approach, dubbed follow anything (FAn), is an open-vocabulary and multimodal model – it is not restricted to concepts seen at training time and can be applied to novel classes at inference time using text, images, or click queries. Leveraging rich visual descriptors from large-scale pre-trained models (foundation models), FAn can detect and segment objects by matching multimodal queries (text, images, clicks) against an input image sequence. These detected and segmented objects are tracked across image frames, all while accounting for occlusion and object re-emergence. We demonstrate FAn on a real-world robotic system (a micro aerial vehicle), and report its ability to seamlessly follow the objects of interest in a real-time control loop. FAn can be deployed on a laptop with a lightweight (6–8 GB) graphics card, achieving a throughput of 6–20 frames per second. To enable rapid adoption, deployment, and extensibility, we open-source our code on our project webpage. We also encourage the reader to watch our 5-minute explainer video.

Published in: IEEE Robotics and Automation Letters ( Volume: 9, Issue: 4, April 2024)

Page(s): 3283 - 3290

Date of Publication: 14 February 2024

ISSN Information:

DOI: 10.1109/LRA.2024.3366013

Funding Agency:

Contents

SECTION I.

Introduction

Detecting, tracking, and following objects of interest is critical to several robotics use-cases, such as industrial automation, logistics and warehousing, healthcare, and security [1], [2], [3], [4]. Notably, one of the key drivers of continuous progress in providing robust object-following systems is the combination of computer vision and deep learning [5], [6], where training deep convolutional networks on large labeled datasets have made tremendous strides in this area. Specifically, the object following task relies on the video segmentation and tracking task, which can be categorized into distinct subtasks. These include interactive (scribble or click-based) video segmentation [7], where a user draws a box around or clicks on the object to segment and track, mask-guided video segmentation [8], [9], [10], [11], which assumes the presence of a mask to track, and automatic video segmentation [12], [13], [14], [15], [16], which assumes that the user does not interact with the algorithm to obtain the segmentation masks; methods should provide a set of object candidates with no overlapping pixels that span through the video sequence, however, these candidates are not specific, meaning that the segmentation will be applied to all of the seen objects, and not recognize the desired object. Thus, to automatically identify the required object to follow, numerous detection approaches have been suggested [17], [18] such as RCNN and its variants [19], [20], [21], YOLO and its variants [22], [23], [24], and more [25], [26]. However, existing robotic systems for object detection and following suffer two notable shortcomings:

They are closed-set, i.e., the set of objects to detect and follow is assumed to be available a priori (during the training phase). Thus, such systems are only able to handle a fixed set of object categories [2], [27], [28], [29], [30], limiting their adaptability; adapting to newer object categories necessitates finetuning the model.
Additionally, the objects of interest are specified (queried) only by a class label, which is often unintuitive for end-users to specify, imposing restrictions on how users interact with the system [2], [31], [32].

Deep learning is currently undergoing another wave of ever-more performant and robust model design, with the creation of increasingly big and multimodal models trained on internet scale amount data containing billions of images, text, and audio. These highly capable models (e.g., CLIP [33], DINO [34]) have demonstrated impressive performance in open-set scenarios (i.e., the objects of interest are only supplied at inference time, and not trained for a specific task) [36], [37]. Notably, recent robotics approaches using foundation models have shown impressive open-set interaction abilities [38], [39], [40], [41], [42], and extended robustly to multimodal applications [43], [44], [45], [46]. However, integrating these models into real-time resource-constrained robotic systems poses significant challenges, due to their large model size and high inference latency.

A. Our Contributions

We address the pre-discussed gaps by developing an open-set real-time any object following approach, which can flexibly adapt to categories specified at inference time, via multiple modalities including text, images, and clicks. Specifically, we present the follow anything system (FAn):

an open-set, multimodal approach to detect, segment, track, and follow any object in real-time ( $>6$ FPS on a 8 GB GPU). The desired object may be specified via a text prompt, an image, a bounding box, or a click.
a unified system that is easily deployed on a robot platform (in our work, a micro aerial vehicle). The system includes real-time processors for input image streams and visual-servoing control loops for following the object of interest.
built with re-detection mechanisms that account for scenarios where the object of interest is occluded or tracking is lost. This mechanism can function autonomously or with human guidance, ensuring the object is successfully identified and tracked again, maintaining continuity in the tracking process.

We validate our system by autonomously detecting, tracking, and following a multitude of mobile agents including a drone, an RC car, and a manually operated brick.

SECTION II.

Our Approach: FAn

Open-vocabulary object following: Given (1) a robotic system (here, a micro aerial vehicle) equipped with an onboard camera, and (2) an object of interest within the onboard camera's field-of-view (specified either as a text prompt, an image, a bounding box, or a click); the object following task involves detecting the object of interest, and producing robot controls $u_{t}$ at each time step $t$ such that the object of interest is constrained to always completely be within the field of view of the onboard camera. This is an extremely challenging task; it necessitates correctly identifying the object of interest and determining its position relative to the robot's onboard camera frame, all the while accounting for variations in the environment, background clutter, object size, etc. It also then requires the object to be continuously tracked across time; while at the same time, the robot controller needs to output a sequence of stable velocities (or accelerations) and simultaneously ensure the stability of the robot and the visibility of the tracked object.

FAn system overview: FAn uses a combination of state-of-the-art ViT models, optimizes them for real-time performance, and unifies them into a single system. In particular, we leverage the segment anything model (SAM) [35] for segmentation, DINO [34], and CLIP [33] for general-purpose visual features, and design a lightweight detection and semantic segmentation scheme by combining the features from CLIP and DINO with the class-agnostic instance segmentation determined by SAM. We use the (Seg)AOT [9], [47] and SiamMask [48] models for real-time tracking, and design a lightweight visual servoing controller for object following; see Fig. 1 for illustration.

A. Real-Time Open-Vocabulary Object Detection

We first describe our lightweight object detection and segmentation pipeline that builds atop SAM, CLIP, and DINO. Our system takes as input an RGB frame from a video stream, represented by a 3D tensor $F \in \mathbb {R}^{h\times w \times 3}$ , and a query $q$ representing the desired object to detect in the video, (e.g., a text “a blue whale”, an image of a whale, or a click on a whale from another image). The object detection subsystem is tasked to detect the object specified by the user query $q$ in an input image frame. We use Seg to denote the class-agnostic instance segmentation operator (SAM [35] or Mask2Former [49]). Seg takes as an input the current frame $F$ , and outputs a set of $n$ masks $\lbrace M_{1},\ldots, M_{n} \rbrace := \mathtt {Seg}(F)$ ( $n$ depends on the input frame and is not a constant), where each mask $M_{i} \in \mathbb {R}^{h\times w}$ is a binary matrix with ones in the indices of pixels defining the corresponding segmented object, and zeros elsewhere. We also use Desc to denote a feature extractor model; which in our case is either the DINO or CLIP vision transformer (ViT) model. These models extract pixel-wise feature descriptors using techniques described e.g., in [50], [51] and summarized in Section II-C. Desc receives as an input the current frame $F$ , and outputs a descriptor tensor $D:=\mathtt {Desc}(F)\in \mathbb {R}^{h\times w \times d}$ , where for every pixel in $F$ , a descriptor vector of dimension $d$ is constructed. This $d$ dimensional vector encapsulates the semantic information about its corresponding pixel. Additionally, Desc can be also used to provide a feature descriptor $v := \mathtt {Desc}(q)\in \mathbb {R}^{d}$ for the input query $q$ .

Embedding input queries: To detect the desired object referred to by the query $q$ , we start by computing the feature descriptor of the query $q$ : $v := \mathtt {Desc}(q),$ such that $v$ encodes the information in the feature space representing the object described by the query $q$ . Now the system starts receiving frames from the stream, and for every frame $F_{i}$ ( $i:= 1,2,3\cdots$ ), FAn applies the following steps.

First, we compute the (binary) instance segmentation masks by applying $\mathtt {Seg}$ on $F_{i}$ , $\lbrace M_{1},\ldots, M_{n} \rbrace _{i}:= \mathtt {Seg}(F_{i})$ . Intuitively, this step partitions the frame into $n$ objects (regions) and a background, however, none of these objects are classified as labeled/identified objects. Additionally, these regions might intersect. Hence, what is missing, is to predict for each region, whether it is the desired object to track or not. In the case where a set of queries $Q = \lbrace q_{1},\ldots, q_{m} \rbrace$ is given, the goal is to classify which query amongst the $m$ provided matches (if any) best each segment $M_{j}\in \mathtt {Seg}(F_{i})$ . This brings us to the second step.

Second, we extract the pixel-wise descriptors by applying $\mathtt {Desc}$ on $F_{i}$ , $D_{i}:=\mathtt {Desc}(F_{i})\in \mathbb {R}^{h\times w \times d}$ . After this step, $D_{i}$ contains $h\cdot w$ descriptors, where each descriptor corresponds to a pixel in the input image. To compare each region (mask) with the given input query $q$ , we need to aggregate these per-pixel descriptors to form region-level descriptors. We find the average pooling aggregation operator to be fast and effective for this purpose. This not only provides us with a more generic descriptor encapsulating all of the features across the specific mask but also improves the performance of the downstream system modules. Opposed to comparing the $q$ query's feature descriptor $v$ to all of the per-pixel descriptors associated with a specific mask, we only need to compare the aggregate region-level descriptors. Thus, the next step in our pipeline involves computing the mean feature descriptor $v_{j}$ for each segmentation region, i.e., for every $j\in \lbrace 1,\ldots,n \rbrace$ : $v_{j}:= \frac{1}{\mathtt {{non}-zero}({M_{j}})}\sum _{p\in D_{i}[M_{j}]} p$ , where $\mathtt { {non}-zero}(M)$ denotes the number of non-zero entries in binary matrix $M$ , and $D_{i}[M_{j}]$ denotes the set of $d$ dimensional vectors from $D_{i}$ corresponding to the non-zero pixel entries in the mask $M_{j}$ . The vector $v_{j}$ , encodes the semantic information representing the region of the segment $M_{j}$ in the features space. For every region (segment/mask) $j\in \lbrace 1,\ldots,n \rbrace$ , we have its corresponding descriptor $v_{j}$ .

Similarity scores: Given a query (in the form of text, image, or click), we first extract a query feature descriptor $v$ by applying a modality-specific encoder (CLIP for text-query, DINO or CLIP image encoder followed by average pooling for image-query, directly selecting the closest pixel/patch feature for click-query). To match this query to the current image, we compute the cosine similarity between each region descriptor $v_{j}$ , and the query feature descriptor $v$ as $\cos (v_{j},v):=\frac{v_{j}^{T} v}{\Vert v_{j}\Vert \Vert v\Vert + \varepsilon }$ , where $\varepsilon >0$ is a small constant, for numerical stability. This is the fourth step, and it intuitively measures how similar each mask (region) is to the query features descriptor.

Single query detection: If the similarity $\cos (v_{j},v)$ between the given query and the mask feature descriptor is larger than a given threshold $\alpha$ , we assign the region corresponding to this mask in the original frame the label of the query.

Multi-class detection: Should the user provide a set of queries $Q = \lbrace q_{1},\ldots, q_{m} \rbrace$ , the system computes the descriptor $v^{k}:= \mathtt {Desc}(q_{k})$ for every $q_{k}\in Q$ , then, for every pair of query descriptor $v^{k}$ and region descriptor $v_{j}$ , it computes: $\cos (v_{j},v^{k})$ . Now, for every $j\in \lbrace 1,\ldots,n \rbrace$ , it finds its most similar query: $\max _{k\in \lbrace 1,\ldots,m \rbrace } \cos (v_{j},v^{k}).$ Finally, if the cosine similarity between the query vector ( $v^{k}$ ), and the mask descriptor ( $v_{j}$ ), exceeds a threshold $\alpha$ , we assign the label of the query to the region in the original frame corresponding to this mask, otherwise, it is considered “non-labeled”.

After this process, each pixel is assigned a label from $\lbrace 1,\ldots,m \rbrace$ , or 0 if unlabeled. Fig. 2 provides an illustration of the whole detection flow, and Figs. 3 and 10 present results on detecting objects via SAM+DINO, and SAM+CLIP respectively.

Fig. 1.

Follow anything (FAn) is a real-time robotic system to detect, track, and follow objects in an open-vocabulary setting. Objects of interest may be specified using text, images, or clicks. FAn leverages foundation models like CLIP [33], DINO [34], and SAM [35] to compute segmentation masks that best align with the queried objects. These objects are tracked across video frames while accounting for occlusion and object re-emergence; enabling real-time following of objects of interest by a robot platform.

Show All

Fig. 2.

FAn outputs illustrations on input frame of four whales with a click query on a whale and a click query on water. First, SAM extracts multiple masks, then, based on DINO features, FAn classifies each mask to what object it refers from the given queries (water/whales). Finally, whales are detected by assigning the masks whose DINO feature descriptor is closest to the whales' query descriptor. NOTE: Heat maps are shown in the click (query) figures.

Show All

Fig. 3.

Automatic detection experiments (SAM-and-DINO). Examples of our automatic detection scheme for detecting Drones, Bricks, and RC Cars. The examples include (from left to right): the original input frame, the outputs of SAM segmentation masks, and DINO + Cosine similarity semantic segmentation and detection.

Show All

Fig. 4.

Heat maps showing the pixels' semantic similarity. For every pixel, its feature descriptor is extracted then cosine similarity is computed between its descriptor and a focal point pixel descriptor (pointed at by a yellow arrow).

Show All

Fig. 5.

Fast automatic detection experiments (DINO only): Examples of our fast automatic detection scheme on detecting (1) whales, (2) drones, (3) RC cars, and (4) toy bricks. This approach is much faster and works very well for detecting the desired object. However, it provides a less “clean” segmentations/masks.

Show All

Fig. 6.

Automatic re-detection via cross trajectory stored ViT features. (1) At every frame, we store the DINO features representing the tracked object. (2) Once the object is lost, we (3) either apply a segmentation model or get suggested masks from the tracker, for every mask, we compute the DINO descriptors, and (4) compare it to the pre-computed ones. If a high similarity is obtained we keep tracking the object, else, we repeat (3) on the next frame.

Show All

Fig. 7.

Automatic tracking, following, and re-detection. The tracked object is referred to by the yellow arrow, we also show the results of the re-detection mechanism in the last two rows.

Show All

Fig. 8.

Trajectory comparison. We report the mean Euclidean distance between every point in the $x,y$ plane of the quadrotor and its aligned point in the plane (closest point) of the followed object.

Show All

Fig. 9.

Left: Our custom-built quadrotor. Right-up: Successful automatic detection via text queries (SAM+CLIP) on low-resolution images; text queries used from left to right: “a toy car” (single query), “a drone” (single query), and “a whale”+“water” (multiclass). Right bottom: In some cases, the raw data from the cropped masks (to get pixel-wise features from CLIP) does not provide enough information for CLIP---since the image is of low resolution and the mask is small causing it to provide not accurate descriptors and thus FAn may not detect the objects.

Show All

Fig. 10.

Automatic detection experiments via text queries (SAM-and-Clip) on high resolution data.

Show All

Manual queries: We provide the users an option to manually draw bounding boxes (or provide outputs from a customized domain-specific detector) around the objects they wish to track, or alternatively, click on one or two pixels within the object (in real-time from the video stream). After user selection, we use SAM to accurately segment and obtain the object mask. This method ensures precise control over tracking, making it suitable for high-accuracy detection scenarios.

B. Fast Detection for Limited Hardware

Off-the-shelf implementations of foundation models like SAM and DINO are not well-suited for real-time onboard detection, segmentation, and tracking. SAM takes several seconds to compute segmentation masks per frame. While we evaluated the recently proposed FastSAM [52] model and obtained a $15\times$ speedup on our hardware with comparable performance, the best runtime achieved by FastSAM is between 10 and 12 FPS, which is still insufficient for detecting fast-moving objects. This is because segmentation outputs also need to be supplemented by features from ViT models, and the detection submodules.

Fast detection by (solely) grouping DINO features:1 To mitigate this compute bottleneck, we instead propose to first obtain coarse detections by grouping DINO features. These coarse detections may further be refined by periodically computing segmentation masks and tracking these over time, effectively rendering the overall system operable at high frame rates. To obtain coarse detections, (i) we extract the pixel-wise descriptors by applying $\mathtt {Desc}$ (DINO) on the current input frame $F_{i}$ , $D_{i}:=\mathtt {Desc}(F_{i})\in \mathbb {R}^{h\times w \times d},$ (ii) given the inputs set of queries $Q = \lbrace q_{1},\ldots, q_{m}\rbrace$ , the system computes the cosine similarity $\cos (v_{h,w},v^{k})$ for each pair of query $q_{k} \in Q$ (where $v^{k}:= \mathtt {Desc}(q_{k})$ ) and pixel-wise descriptor vector $v_{h,w}$ . Next, as previously, (iii) for each pixel, it picks the closest (most similar) query, i.e., the one with the maximum cosine similarity. Now, (iv) if the cosine similarity between the query vector $v^{k}$ and the pixel feature descriptor $v_{h,w}$ surpasses a specified threshold $\alpha$ , we assign the label of the query to the corresponding pixel in the original frame $F_{i}$ . Otherwise, it is considered as “non-labeled”. Then, (v) we build a binary matrix $B_{i}\in \mathbb {R}^{h\times w}$ with 1 in pixels that are mapped to the desired object (to detect) and 0 elsewhere. Finally (vi) apply the cv2.connectedComponents function on $B_{i}$ . This function receives a binary image ( $B_{i}$ ) where white regions (pixels with label 1) on a black background (pixels with label 0) represent connected components. The function assigns unique integer labels to each connected component and labels background pixels as 0. We have used it since we might detect more than one object, each in a different region of the frame, this function provides us with each object with its unique mask. See Fig. 5 for experiments leveraging the detection module proposed here.

Optimizing DINO runtime: We speed up DINO using two optimization techniques: Quantization (reduces numerical precision) and tracing (converts dynamic graphs into static ones). See Table I for runtime details of all the used models in our system. We report the running time for each model independently, not as part of the whole system. Note that some models automatically reshape inputs to a constant size. We also compare the runtime of our detection phase, with the popular Grounded-SAM [53] method in Table II.

TABLE I Runtime in Frames Per Second (FPS) for All of the Used Models on an NVIDIA GeForce RTX 2070 Onboard a Laptop

TABLE II Runtime in Frames Per Second (FPS) for the Detection Phase of the System on an NVIDIA GeForce RTX 2070, Compared to the Popular Open-Source library [53], Using a More Powerful GPU of NVIDIA RTX 3090

C. Extracting Per-Pixel Feature Descriptors

While a few methods adapt foundation models like CLIP to provide per-pixel descriptors, these methods [54], [55], [56], [57] require model re-training or finetuning on an image-text aligned dataset. This often results in concepts absent in the finetuning set being forgotten by the models as demonstrated in ConceptFusion [51]. To counteract this, [51] presents a zero-shot method for constructing pixel-aligned features that combine local (region-level) data with global (image-level) context included in models like CLIP. For efficiency (real-time processing) purposes, we adapt part of this method in our system when using CLIP for providing pixel-wise feature descriptors, however, we only use their ablated baseline which computes purely local 2D features by extracting a bounding box around each segmentation mask (obtained from SAM) and passes them through the CLIP encoder. For DINO, we use [50] as is, and find that their pixel-wise feature descriptors are inherently informative and more efficient.

D. Re-Detecting a Lost Object

We offer three re-detection methods for temporary object loss during tracking, catering to different needs. Our system automatically initiates re-detection when needed, and users can choose the level of support before starting the FAn pipeline: The first level relies on the tracker to re-detect the object, it's the fastest and less robust, occasionally leading to false detections of similar objects. The second approach involves human-in-the-loop re-detection, requiring a user to click/draw a bounding box when tracking is lost, assuming human availability, which isn't always possible. To mitigate this, we also propose an automatic re-detection technique.

Automatic re-detection via cross trajectory stored ViT features: To enable a robust and accurate autonomous re-detection of the tracked (lost) object, we provide a feature-descriptor storing mechanism for the tracked object in different stages of the tracking process, these stored features, will be used to find the object once lost. Specifically, we suggest the following. Let $\tau > 0$ be an integer. During the tracking, at each iteration $i$ such that $i\;{\rm mod}\;{\tau }= 0$ , define $M^{obj}_{i}$ to be the mask denoting the current tracked object in the frame, we first apply Desc on the current frame $F_{i}$ to obtain $D_{i}:=\mathtt {Desc}(F_{i})\in \mathbb {R}^{h\times w \times d}$ , then we compute the mean descriptor of the current tracked object as: $v^{obj}_{i}:= \frac{1}{\mathtt {{non}-zero}(M^{obj}_{i})}\sum _{p\in D_{i}[{M^{obj}_{i}}]} p$ . This feature represents the tracked object in the $i$ th step. We thus store this descriptor and add it to the set of previously computed descriptors to obtain the set $V^{obj}:= \lbrace v^{obj}_{0}, v^{obj}_\tau,v^{obj}_{2\tau },\ldots,v^{obj}_{i} \rbrace.$ Now, whenever the system loses the tracked object, we apply the following recovery mechanism. The system goes back to the detection stage, with a query feature descriptor at hand as $v = \frac{1}{| V^{obj} |}\sum _{v^{obj} \in V^{obj}} v^{obj},$ seeking the closest region from the segmented frame, and thus re-detecting the object. Here, the segmentation might be given by the segmentation model (e.g., SAM), or by the tracker which tries to re-detect the lost object. Note we use the mean to gain faster performance for real-time applications, however, other techniques can be used to improve the robustness; see Fig. 6.

SECTION III.

Experiments

Hardware: We use a quadrotor equipped with an RGB camera (see Fig. 9). The quadrotor is custom-built with a Pixhawk running PX4 flight control software. The camera data is streamed directly to a remote ground station computer equipped with an NVIDIA GeForce RTX 2070, and Intel i7-10750H CPU, with Ubuntu 20.04.5 LTS, using the “herelink” digital transmission system along with other telemetry data. The ground station runs the tracking algorithm and sends control commands to the quadrotor via Mavlink. To enable indoor testing, the quadrotor is also equipped with an onboard computer that runs MAVROS and interfaces with an external Vicon motion capture system to get the position.

Implementation and system details: We outline key details of our system: Run-time improvement. We enhance segmentation/detection performance by compressing SAM and DINO through quantization and tracing and using FastSAM. For tracking, we offer support for the fast SiamMask [48] tracker; see Table I for runtime (FPS) details. Flight controller. For versatility, we used PX4, open-source flight control software, to interface with our quadrotor. The MAVSDK Python library is used to send velocity commands for 3D motion and yaw control, streamlining integration with PX4-based drones in future projects. Visual servoing. We mount the onboard camera on the bottom of the quadrotor facing the ground. At relatively small translational velocities the first-order approximations of roll and pitch angles are close to zero. In addition, we fixed the drone altitude and yaw angle. This simplifies the visual servoing task to 2D plane tracking using proportional control. We use a proportional controller based on pixel distances to center the object in the frame and employ a lowpass filter to smooth quadrotor trajectories, ensuring accuracy in challenging scenes. Video Streaming. To process frames from an online video stream in real-time, we implemented a low-latency online streamer using the OpenCV library in Python. This streamer continuously reads frames with a parallel thread and maintains a buffer size of 1, ensuring immediate access to the latest frame when needed. Software. We mainly use Torch, cv2, and mavsdk; see our https://github.com/alaamaalouf/FollowAnythingproject page for full details.

A. Real Time Object Following Exprements

We tested (i) our overall system for detecting, tracking, re-detecting, and following: RC cars, drones, and bricks in real-time. Here we used SAM+DINO and DINO-SOLO approaches for the detection task on all of the tested objects - the provided queries are clicks on the desired objects from other pictures (we provide a script for obtaining these click queries). Both approaches worked seamlessly for detecting and tracking the desired objects. (ii) We demonstrate our system for re-detecting an object that gets occluded from the scene during tracking. Specifically, during the following experiments, the RC car and the brick pass under a “tunnel” twice, and our re-detection mechanism is able to recover and resume tracking. Fig. 7(a), (b), and (c) show different scenarios during the following. We encourage the reader to view the demos on our https://github.com/alaamaalouf/FollowAnythingproject webpage and in the https://www.youtube.com/watch?v=6Mgt3EPytrwexplainer video. (iii) In addition, we recorded the actual 3D trajectory coordinates of the following quadrotor and the target object to assess the robustness of our tracking system. Specifically, we recorded continuous tracking data for over 4 minutes while following a ground robot. We report the mean Euclidean distance between every point in the $x,y$ plane of the quadrotor and its aligned point in the plane (closest point) of the followed object. This experiment was conducted 4 times; using PID vs proportional-only as a controller, and using SAM+DINO vs DINO-SOLO as a detector. The results are reported in Fig. 8. We can see that the drone follows the object smoothly and accurately using the different controllers and detectors. We also visualize both trajectories for the case of SAM+DINO.

B. Zero-Shot Detection Exprements

Data: We stored the tracking and detection streams from the SAM+DINO following experiment and used it to test the FAn system and its different variants for zero-shot detection. For each of the tested objects, we picked multiple frames during the tracking and detection showcasing diverse object positions and diverse scenes. Other than that, we also use our private set of whale images to test on.

Comparison: We quantitatively compare the suggested methods and analyze their advantages and disadvantages. We applied each of SAM+DINO, SAM+CLIP, and DINO-SOLO to assess their efficacy in detecting the object within the given data. We report both True Positive and False Positive detection results. Furthermore, we conducted a comparative analysis involving an alternative version of our approach, which consists of two variations. (i) Majority Voting: We assigned each pixel in the mask to its most similar query, and subsequently assigned the mask the label that was most frequently selected across all mask pixels. (ii) $K$ -Means: For each mask, we retained a set of $K>1$ representatives based on the $K$ -means algorithm. We then gauged the similarity of these representatives with the provided queries and assigned the mask a label based on the majority consensus among these $K$ representatives. Our testing encompassed two scenarios: (i) The system was presented with multiple queries representing the environment, including “a robot leg, a box, a ground” (in the whales experiment, these queries were replaced with “water”), along with the desired query “a drone, a toy car, a brick, a whale” (Table IV) and (ii) The system was given a single desired query (Table III).

TABLE III true Positive Detections Divided by the Number of Object Appearances (Provided Next to the Object Name), and Number of False Positive Detections (Number in Brackets If Any); Single Query Test

TABLE IV true Positive Detections Divided by the Number of Object Appearances (Provided Next to the Object Name), and Number of False Positive Detections (Number in Brackets If Any); Multiple Queries Test

The threshold $\alpha$ : For all methods, we tuned $\alpha$ to minimize the false positive detections while achieving a fine true positive detection rate; In our system, it's acceptable to not immediately identify the intended object, but our priority is to prevent the detection and tracking of an incorrect target. We used 0.35 for SAM+DINO and 0.23 for DINO+CLP in all experiments. For DINO-SOLO 0.4 and 0.6 were used in the multiple queries and single query experiments, respectively.

C. Mask Quality Experements

We compare the mask quality of our detection methods (DINO-SOLO, SAM+DINO/CLIP). We use the first video from the Cholec80 dataset [58], which has mask annotation for body parts and tools across frames during surgery. We aimed to detect the “grasper” tool and track it across frames. Table V reports (1) the mean intersection over union (mIoU) of the detection and the annotated data across frames and (2) the true positive detection percentage of the desired object, we also test how the detected mask quality affects the tracking; we report (3) the mIoU of the desired tracked object after each of the detection methods detected it. Queries: “body part”, “background”, and “surgery tool”.

TABLE V mIoU for Tracking and Detection (DINO-SOLO vs SAM+DINO or CLIP)

D. Discussion and Conclusions

SAM+DINO: Figs. 3 and 2 show example results for real-time detection via SAM+DINO. Tables IV and III, indicate that the detection achieves a high level of accuracy for cars and whales, and performs well for drones and bricks - but may occasionally miss certain instances. After analyzing the results, it becomes apparent that when SAM generates reliable regions/segmentation, DINO consistently assigns the correct labels to each of these regions, ensuring precise and appropriate object detection. However, in cases where SAM fails to capture these regions accurately (resulting in inadequate segmentations), the object goes undetected. This scenario is exemplified by 4 drone object in the dataset and 3 bricks, where SAM fails to identify the mask of the drone/brick (see Fig. 11). Regarding accurate DINO classifications, we offer explanations illustrated in Fig. 4. These figures depict heatmaps based on cosine similarity calculations between DINO feature descriptors of each pixel and a designated focal point pixel. The visualizations demonstrate that pixels sharing similar semantic characteristics exhibit a high degree of similarity in their DINO features.

Fig. 11.

With vs without SAM. Right: SAM creates high-quality segmentation masks compared to DINO-SOLO (not using SAM). Left: SAM might miss important regions in the image.

Show All

DINO-SOLO: In Fig. 5 we show several examples showcasing the efficiency of our rapid automated detection system. This approach is significantly faster and performs admirably in detecting the desired objects. Even more, in many cases when SAM misses providing the desired object a mask, using DINO-SOLO can still detect the object. However, the resulting masks are not of high quality compared to the masks obtained from SAM, and this may potentially affect the tracking performance; see Fig. 11 and Table V.

SAM+CLIP: Examples for detection via “text” prompts through SAM+CLIP are shown in Fig. 9. For the tested low-resolution images, SAM+CLIP detections are not as robust, the method yields less precise similarity scores, increasing the likelihood of missed detections, particularly for objects lacking unique shapes like the brick. Additionally, in some cases, as the image has low resolution, if the object has a small (correct) mask, it does not present enough raw information and is thus misclassified. Fig. 9 shows an example of low-resolution images for such scenarios; we further discuss why this happens when using CLIP and not DINO in our discussion later. We note that this method is still beneficial for our system, the main idea is that we need one accurate detection with high confidence (e.g., with further increasing $\alpha$ ) for the desired object and then we can start the object-following scheme, thus, we can still benefit from the multimodality of the system. Additionally, as this method requires only the text prompt and not an image/clicks, it is much easier to utilize. To verify our claims regarding the reason for the dropped performance of FAn when using SAM+CLIP, we tested it on high-resolution images. Here, the reasoning and detection are robust justifying our claims. We conduct the following 4 tests: (i) Standard detection, e.g., “detect a whale”, (ii) scene reasoning-based detection, e.g., “detect the boy holding the ball”. (ii) Special attribute-based detection, like, “detect the white dog”. (iv) Special prior knowledge-based detection. In this case, the system should have prior knowledge of a specific object like its name/nickname. For example “detect Messi/Cristiano Ronaldo”. (v) Special prior knowledge& attribute based detection. e.g., “detect a Real Madrid player”. See result in Fig. 10.

SAM limitations. With vs without: SAM might miss important regions in the image. When the desired object is in these regions it will be impossible to detect it and thus DINO+SAM yields fewer true positive detections compared to DINO-SOLO. On the other hand, DINO+SAM provides high-quality masks once the object is detected while DINO-SOLO masks are less refined. See Tables V, III, and IV.

Queries: Using multiple queries to annotate other objects that might be in the scene reduces the number of False positives leading to a more robust and reliable system.

DINO vs CLIP: The method we are using to obtain pixel-wise features from DINO [50] is faster and provides better descriptors for every pixel compared to the method used for CLIP. This is because it requires one forward pass on the whole image to compute the per-pixel features. In addition, when using DINO, the method computes the per patch/pixel features while taking into count the full image, as it simply utilizes the patch-wise descriptors (outputs of the query, key, or value matrix in some attention layer of the transformer) of DINO, thus providing descriptors with richer context of the whole image. In CLIP, the method uses SAM to extract masks [51] and then applies CLIP on crops of these masks to extract features for all pixels in this mask, thus, it is less efficient and might yield less meaningful features when applying CLIP on small crops with limited raw data.

The competing methods: We found no improvement with other variants like $K$ -means and majority voting; often, our original methods performed better. Also, the $K$ -means variant runs at 0.03 FPS, and the majority voting runs at 0.32 FPS.

Summary: FAn bridges the gap between SOTA computer vision and robotic systems, providing an end-to-end solution for detecting, tracking, and following any object. Its open set, multimodal, real-time capabilities, adaptability to different environments, and open-source code make it a valuable tool.

References is not available for this document.

Follow Anything: Open-Set Detection, Tracking, and Following in Real-Time

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency: