Journals & Magazines >IEEE Robotics and Automation ... >Volume: 10 Issue: 4

‘What Did the Robot Do in My Absence?’ Video Foundation Models to Enhance Intermittent Supervision

Abstract:

This paper investigates the use of Video Foundation Models (ViFMs) for generating robot data summaries to enhance intermittent human supervision of robot teams. We propos...Show More

Metadata

Abstract:

This paper investigates the use of Video Foundation Models (ViFMs) for generating robot data summaries to enhance intermittent human supervision of robot teams. We propose a novel framework that produces both generic and query-driven summaries of long-duration robot vision data in three modalities: storyboards, short videos, and text. Through a user study involving 30 participants, we evaluate the efficacy of these summary methods in allowing operators to accurately retrieve the observations and actions that occurred while the robot was operating without supervision over an extended duration (40 min). Our findings reveal that query-driven summaries significantly improve retrieval accuracy compared to generic summaries or raw data, albeit with increased task duration. Storyboards are found to be the most effective presentation modality, especially for object-related queries.

Published in: IEEE Robotics and Automation Letters ( Volume: 10, Issue: 4, April 2025)

Page(s): 3222 - 3229

Date of Publication: 05 February 2025

ISSN Information:

DOI: 10.1109/LRA.2025.3539118

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Advancements in artificial intelligence (AI) and robotics are leading to increased deployment of robots in challenging environments. While current robots are frequently teleoperated and require close human supervision, there is a drive towards increased robotic autonomy, enabling the robot to work for longer periods with no supervision.

When an autonomous robot encounters exceptions or challenges in real-world deployment, it is common for the robot to request intervention from a human supervisor, shifting from full autonomy to partial autonomy or teleoperation, i.e. intermittent supervision. To facilitate effective intermittent supervision, a robot needs to inform its human operator ‘what has been happening’ while it has been operating autonomously. Current approaches often involve requiring the human operator to remain attentive throughout the deployment [1] or manually examining raw data collected by the robot, such as video recordings from onboard cameras or robot status logs, which is time-consuming, incurs high cognitive load, and requires expert knowledge to interpret [2]. Therefore, we are motivated to investigate how data collected during autonomous robot operation can be distilled to facilitate efficient human supervision.

Recent advancements in Large Language Models (LLMs) for robotics have paved the way for a more natural human-robot interaction. However, most existing research focuses on mapping natural language commands to a robot's plan and actions [3], or generating conversational responses in social interactions [4]. Using LLMs to generate robot-to-human communication to aid human comprehension of robot status and intervention in functional tasks remains largely unexplored, with only a single study [5] exploring the use of a custom-trained LLM for generic and query-driven robot action summarisation.

We propose to employ Foundation Models to summarise robot data and improve intermittent supervision in HRI, as shown in Fig. 1. In particular, we propose to use pre-trained Video Foundation Models (ViFMs) to summarise long (40 min) egocentric robot videos to assist operators to rapidly identify object occurrences and temporal events in a multi-robot search-and-rescue scenario [1].

Fig. 1.

Diagram of the proposed robot summary generation system. The system generates generic or query-driven summaries of long egocentric robot videos in the form of storyboards, short videos, or text to help a user review the robots' autonomous history.

Show All

There are two main challenges in adopting ViFMs for generating robot summaries for humans:

Summaries generated by ViFMs may not capture relevant information for a specific task. A user may need to explicitly query the model for targeted answers.
The generated summary may be inaccurate. The modality in which the summary is presented may influence a user's ability to verify the summary and calibrate their trust towards the information provided.

To investigate ViFMs for robot summary generation, we develop a framework that provides a user with either generic or query-driven summaries of robot videos in the following output modalities: a storyboard of key images, a short summary video, and a text summary. We evaluate the proposed framework in a user study (n = 30) to compare participants' task performance and subjective experience using these different summary generation methods.

To the best of our knowledge, this is the first work which incorporates ViFMs in a zero-shot setting to generate summary communication in multiple natural modalities from a robot to a human for intermittent supervision. Our work illustrates the benefits and limitations of such models, as well as the transferability of ViFMs trained with general data for a specific HRI task scenario.

SECTION II.

Related Work

Video Summarisation: Videos contain frames [6], displayed at a specific frequency to create motion [7]. Initial works in video summarisation considered video data without connection to natural language. There are two main approaches to video summarisation: static and dynamic [6]. Static (key-frame) summarisation selects representative frames to create a storyboard, while dynamic summarisation (video-skimming) identifies and chronologically arranges key video segments [6]. SumMe [8] and TVSum [9] are two benchmark datasets commonly used in video summarisation, both provide human-annotated importance scores for frames.

Transformer-based video summarisation methods [10], [11] excel in extracting global dependencies and multihop relationships between video frames. Spatiotemporal vision transformer (STVT) model [11] achieved SOTA performance on the SumMe and TVSum datasets by training the model using both inter-frame and intra-frame information. We will refer to this line of work as ‘generic video summarisation’.

In contrast to generic summarisation, Sharghi et al. [12] introduce ‘query-focused video summarisation’ where the user preferences were introduced to the summarisation process in the form of text queries. However, these queries were limited in scope and vocabulary. With the advancements in LLMs and later in multi-modal foundation models (ViFMs are a subset of this), open-ended, natural language queries became possible, enabling more sophisticated and contextually relevant video summarisation. We refer to these summaries as ‘query-driven’ summaries.

ViFMs for Video Summarisation: ViFMs learn a general-purpose representation for various video understanding tasks, by leveraging large-scale datasets. A detailed survey of ViFMs can be found in [13]. These models are typically evaluated on diverse benchmark datasets, e.g., MVBench [14].

Video summarisation is not a common task for ViFM. Therefore, we explore video understanding tasks capable of generating video skims, storyboards, or textual outputs based on a video and a natural language query.

Madan et al. [13] identify video content understanding and descriptive understanding among video understanding tasks. Video content understanding tasks can entail: abstract understanding (e.g., classification, retrieval), temporal understanding (e.g., action localization), and spatio-temporal understanding (e.g., object tracking, segmentation). Descriptive understanding tasks, such as VQA and captioning, focus on generating a textual description of the video content.

In terms of generating summaries as visual outputs, retrieval involves finding video segments containing specific actions, objects, or scenes, going beyond just objects or just actions, and is most closely related to our work. LanguageBind [15] achieves SOTA performance in zero-shot video-text retrieval, using language as a binding modality to establish semantic connections between textual descriptions and corresponding visual content.

While both tasks under descriptive video understanding can generate textual summaries, those models that excel in VQA are preferred for their ability to provide summaries for subjective, creative, and logical questions. VideoChat2 [14] excelled in challenging temporal video understanding tasks on the MVBench [14] benchmark by outperforming SOTA ViFMs by over 15%.

An important sub-field of video summarisation is ego-centric summarisation, reviewed by Plizzari et al. [16]. This survey emphasizes the importance of egocentric video question answering (VQA) as a key enabler for a wide range of assistive applications centering around HRI. Further, they identified key open challenges in egocentric VQA, including using unrestricted query-concepts to reflect real-life scenarios, application to outdoor environments, extending to multi-modality beyond text, handling long videos, limited capabilities of vision-language models in adapting to egocentric data, and evaluation in interaction with users. Our study addresses these key challenges.

Foundation Models for Robot Communication: Firoozi et al. [17] provide an extensive survey on how foundation models improve robot capabilities. However, the use of foundations models for robots communicating with humans for intermittent supervision, remains a crucial, yet underexplored, aspect. Tellex et al. [18] reviewed the use of natural language in robotics, finding that most research focuses on robots following human instructions [3]. There is limited research on robot-to-human communication or two-way communication, addressed by our study. Specifically, we investigate unprompted generic summarisation (robot-to-human communication), and interactive ‘query-driven summarisation’ (two-way communication).

Continual and lifelong robot learning, enabling robots to continuously learn and adapt to new information and environments with minimal human intervention, is another research direction leveraging foundation models [5], [19], [20], [21]. The emphasis is on efficient representation and retrieval of long-duration data. For example, Dechant et al. [5] use an LLM for generic and query-driven robot action summarisation. They trained a single model to both summarise and answer questions. Our work differs from this line of work by proposing a general framework that allows the integration of any existing ViFM without modifying model architectures or dataset-specific finetuning. While prior work commonly uses a static validation set, we emphasise human-in-the-loop evaluation and include visual modalities in addition to texts.

SECTION III.

Framework Design

There are two main challenges in adopting ViFMs for generating robot summaries for humans. First, how should a summary be generated? A generic summary generation process may require less user effort, but the generic summary may not capture relevant information for a specific task. On the other hand, explicit user queries may provide more relevant summaries, but may require additional user expertise, time and effort. Second, how should the summarised information be presented to the user? The presentation modality may influence the user's ability to verify the information and appropriately calibrate their trust in the summary.

We developed a framework to provide users with generic or query-driven summaries using ViFMs, in the form of a short summary video (video-skims): V, a storyboard of key images: S, or a language-based text summary: T. The proposed framework is shown in Fig. 2.

$Fig. 2. - Proposed framework for generating generic and query-driven summaries in the form of storyboard, video or text. There are 4 main steps: $ \color{blue}{\rm preprocess, embed, relevance} $, and $ \color{blue}{\rm selection} $. The models and algorithms used for each step can be replaced with similar models and algorithms.$

Fig. 2.

Proposed framework for generating generic and query-driven summaries in the form of storyboard, video or text. There are 4 main steps: $ \color{blue}{\rm preprocess, embed, relevance} $, and $ \color{blue}{\rm selection} $. The models and algorithms used for each step can be replaced with similar models and algorithms.

Show All

A. Generic Summary Pipeline

A generic summary is generated without considering any input from the user. In the preprocessing step, the original long video is taken at its original frame rate. Then, each frame is encoded with ResNet18 [22] and fed into the relevance generation process to generate frame-level importance. Frame-level importance is generated by STVT [11]. By combining the temporal inter-frame correlations and the spatial intra-frame attention within frames, STVT captures important temporal and spatial content. The model was trained to learn the frame importance for each frame from human-created summaries on the TVSum dataset [9]. While the videos used to train STVT range from $2\,\,{\mathrm{to}}\,\,10$ minutes, the model is scalable for longer videos as the video is considered a collection of individual frames.

During the selection process, important frames are selected to create a generic storyboard while important segments are selected to create a generic video skim. To generate the storyboard, a greedy approach is followed, where the frames selected ($\# M$) have a high frame level importance but low semantic similarity to those that are already selected. Semantic similarity is based on the top $L$ Principal Component Analysis (PCA) features of the ResNet18 features of each frame. A threshold value of $\delta$ is used to determine if the frame is similar to any previously selected frames. The images are then presented to the user in a storyboard chronologically, along with the relevant timestamp. The individual relevance scores are not shown.

To generate a summary video, first, the frame-level sequence is transformed to a segment-level sequence using Kernel-based Temporal Segmentation (KTS) [23]. We use a cosine kernel change point detection algorithm, where the number of change points is not specified in the optimization algorithm. KTS receives frame level features, generated in the same way as for the storyboard modality. We constrain each selected segment to a minimum duration of $D$ sec to prevent the effect of simply producing a video played at a higher speed. The change points define the segment boundaries, and the segment's relevance score is calculated as the summation of frame-level scores within the segment. The {0/1} knapsack algorithm [11] is used to select the segments that have the highest importance score. The duration of the knapsack selection is specified as a percentage $K\%< 100\%$ of the original video duration, generating a generic summary video.

A generic text description is produced by Videochat2 [14] based on the generic video summary. The model receives a generic query: it is given context about the robot's environment and task, and asked to describe the video in detail. Implementation details and parameter values are given under the G column in Table I.

TABLE I Detailed Description of Each Process in the Proposed Framework Shown in Fig. 2

B. Query-Driven Summary Pipeline

The main difference between the generic summary and the query-driven summary is the additional input of a user query, provided as a text input. The users are free to query however they wish to understand what the robot has been doing in their absence, going beyond querying from a limited set of options adopted by previous research [24].

In the preprocessing step, the long video is either broken down into frames (for storyboard) or segments (for video and text). The original video can be read at its original frame rate or a lower frame rate, depending on the availability of computational resources.

Both the text-based user query and the frames/segments are embedded using LanguageBind [15]. To generate relevance scores, the dot product between the query embedding and frame/segment-level embedding is used. It is important that both the query embedding and frame/segment embedding are in the same embedding space for the dot product to generate a valid similarity score.

Selecting the frames ($\# m$) for the query-driven storyboard follows the same greedy approach from the generic pipeline, except the embeddings are based on LanguageBind. Similarly, the images are then presented chronologically, irrespective of their relevance score.

The top $k$ video segments are selected and combined chronologically, to generate the query-driven summary video. Similar to the generic text summary description, the query-driven summary video is fed to the videochat2 model, along with the user query, to generate a query-driven summary text (see column Q in Table I for implementation details).

C. User Study Design

With this framework in place to generate summaries to inform a user about what the robot has been doing in their absence, we designed a user study to evaluate whether users could benefit from such a summarisation framework. We consider a search and rescue mission undertaken by a fleet of robots in a complex underground environment (tunnel, urban, and cave) where they explore the previously unknown environment, searching for specific objects [1]. After extended autonomous exploration, a human operator is asked to identify key objects and events encountered by the robots during the operator's absence by either reviewing the raw video data, or summaries of the data generated by the proposed framework.

The user study is developed as a mixed-model design to investigate the effects of using summaries by ViFMs on task performance and usability. The user study was approved by the CSIRO research ethics committee. A total of $30$ participants were recruited for the study. The flow of the study is shown in Fig. 3 (left). A pre-study survey was conducted to understand the familiarity of the participants with robotics and video summarisation models, and their trust of such models.

Fig. 3.

(left): Study procedure and (sub-)tasks. (right): Sample user study interface for query-driven storyboard summary. When the user enters a query into the ‘User Query’ box, selected images are displayed below the box based on the raw video given at bottom left. Questions are given on the right side.

Show All

Participants were randomly assigned to either the generic summarisation pipeline $\text {G}$ or the query-driven summarisation pipeline $\text {Q}$. We will refer to this as ‘GvsQ’$\in \lbrace \text {G},\text {Q}\rbrace$. Each participant was given four intermittent supervision tasks (C, V, S, T) in a randomised order, where each task was to answer a set of questions based on front-egocentric video feed of the robots (see Section III-D1). Across tasks, $4$ different videos (and video-related questions) were used. Across conditions (GvsQ), the same video was provided for a given task. For example, if one user was assigned to G, the video they were given under S was the same as another user, assigned to Q would see under the task S.

The control task, C consisted of only the raw video without any AI-generated summary. Users were allowed to watch the long raw video at the original or ${2}{\times }$, ${4}{\times }$ or ${8}{\times }$ speeds since this is a common way to review long videos. A within-subject design was used to compare the summarisation modalities: V, S, and T. For each modality, the participants were provided with an AI-generated summary along with the corresponding raw video, where they were free to play or ignore the full-length video with no speed control. Fig. 3 (right) illustrates the user interface.

The questions were designed with varying difficulty to assess the user's ability to identify relevant information in the video content with or without the aid of different types of summaries. Each task contains two sub-tasks with different types of questions: object-related and event-related, i.e., question types $question$. (see Section III-D2). The users were asked to answer the questions as accurately and efficiently as possible, and were reminded the tasks were time-critical. Participants' responses to the questions, their confidence, task completion times, and usability preference [25] were recorded for each task.

We examine the following hypotheses in this user study:

H1: Distilling data improves: H1a accuracy, and H1b time, for retrieving relevant information as opposed to providing raw data (task score of G/Q $>$ C and task time of G/Q $< $ C).
H2: The magnitude of the improvement in: H2a accuracy, and H2b time, depends on the summary type (G/Q) and modality (V/S/T).

D. Implementation

1) Dataset

The videos (RGB data stream) used for the study were collected from a fleet of robots exploring tunnels and urban undergrounds while completing a search and rescue mission [1]. Front-facing egocentric video of two robots are used: quadruped robot Spot and All-terrain Robot (ATR) shown in Fig. 4 (left). The length of each video is 40 min read at 15 fps. An example frame from the videos is shown in Fig. 4 (right).

Fig. 4.

Example image (right) from the front egocentric video feed from a fleet of robots (left) deployed for an underground search and rescue mission [1].

Show All

2) Intermittent Supervision Tasks

Each task (C, V, S, T) includes $6$ unique questions related to its corresponding video, distinct from the questions in other tasks. This design minimizes the potential influence of earlier tasks on later ones by preventing users from memorizing and anticipating similar questions. Altogether, $24$ questions were designed with varying difficulty levels, inspired by the rescue mission [1] and MVBench [14].

The $6$ questions are broken down into two sub-tasks/question types. Questions $1\,\,{\mathrm{to}}\,\,3$ were about the existence of objects of interest and their corresponding counts. Questions $4\,\,{\mathrm{to}}\,\,6$ were about events or actions requiring temporal understanding inspired by the temporal task definitions of MVBench, including ‘action sequence’, ‘moving direction’ and ‘state transitions’. All questions are presented as multiple choice questions with ‘Not sure’ as an option. The tasks were given in a random order following the Fisher–Yates shuffle algorithm as implemented in the Python ‘random’ package.

3) Computational Considerations

Generic summaries were pre-generated while query-driven summaries were generated in real-time with the backend models running in an H100 GPU. As such, the computational requirements of the two pipelines were different. The models chosen under each pipeline also contribute to the difference in computational requirements. None of the models used for the implementation were finetuned for the test data.

In the ‘preprocessing’ stage, the Q pipeline used a lower frame rate to manage the memory requirements when running in real-time specifically with ViFMs: LanguageBind and videochat2 simultaneously loaded to memory.

4) Equalising Information Across Conditions and Modalities

To ensure that a similar amount of information is provided across conditions, the duration of video and text information, and the number of storyboard images was equalised. The threshold $\%$ for generating the generic video for the knapsack algorithm was selected to be $K=15\%$, similar to STVT [11]. This resulted in a video of approximately 6 min. Under Q, assuming users would enter at least 6 queries (for the 6 questions), the top-$k$ algorithm was set with $k=6$. This produced a summary with a duration of 4 min 48 s, resulting in nearly equal-length summary videos for both G and Q. In the query storyboard, we chose $m=4$ images, allowing the user to visualise all images at the same time without scrolling. With the same assumption of 6 queries from a user, for the generic storyboard, we selected $M=24$ images in a scrollable interface. For the text summary, by controlling the length penalty, the length of the summary was made compatible in both pipelines: approximately six sentences for the G summary and one sentence for each query in the Q summary.

Second, the information content across modalities should be balanced. Balancing information density between video and storyboard proved challenging. While video offers detailed information, it can be overwhelming. Storyboard, though less detailed, provides a concise overview. We prioritised relevance in video selection (top-k/Knapsack algorithms). For storyboard, on the other hand, we employed a greedy approach to ensure diversity. To generate the text summary, in both pipelines videochat2 [14] receives the already processed summary video from which only 100 frames are considered due to resource limitations.

SECTION IV.

Results Analysis and Discussion

The analysis investigates the two hypotheses H1 (summary brings benefits) and H2 (summary type and modality matters) respectively. The first considers the predictors/independent variables ‘summary type’ (G, Q, ‘No Summary’=C), while the second excludes C data to compare modalities in GvsQ conditions. Where relevant, question type $question$$=\lbrace object, event\rbrace$, is considered as a predictor (H1a, H2a) and with H2, the ‘output modality’ is considered a predictor. Outcome/dependant variables are the score (s) a user receives for each answer they provide and the time ($t$) they spend on each task. Scores are binary (0 for incorrect, 1 for correct), with only one correct answer per question: any other answer, including ‘Not sure’ is marked as incorrect. Time was recorded at the task-level.

When fitting models to explain the relationship between dependant and independent variables, the fitting formula is iteratively refined by initially including all possible interactions (two-way, and three-way where relevant), then progressively eliminating insignificant ones until only statistically significant interactions remain. We first tested if presentation order affected the outcomes, no significant effects were detected for either outcome. We run $4$ regressions in total: H1a, H1b, H2a, H2b and follow the reporting style in [26]. The resulting coefficients ($\beta$) and the corresponding p-values are given in Table II.

TABLE II Coefficients of Regression Models for User Score $\text {s}=\lbrace 0,1\rbrace$ and time $t$

$Table II- Coefficients of Regression Models for User Score $\text {s}=\lbrace 0,1\rbrace$ and time $t$$

Importance of Distilling Information: We first analyse (H1): distilling information improves accuracy (H1a) and efficiency (H1b) as opposed to providing raw data. The baseline (intercept) for s is $summary[\text {C}]$ and $question[object]$, and for $t$, $summary[\text {C}]$.

a) Accuracy: On average, users answered 43.3%, 36.3% and 56.7% questions correctly for C, G and Q summary types respectively.

To test Hypothesis H1a, we ran a logistic regression with ‘summary type’ and ‘question type’ predictors. Both summary type Q ($\beta =0.549$, $z=2.795$, $p=0.005$) and question type $event$ ($\beta =0.618$, $z=4.012$, $p=0.000$) increase the predicted users' score, while summary type G did not ($\beta =-0.301$, $z=-1.514$, $p=0.13$).

These findings suggest that only query-based information distillation enhances user accuracy, partially supporting the hypothesis H1a: score of $\text {G}/\text {Q}> \text {C}$, where only score of $\text {Q}>\text {C}$. The observation that users can answer event-related questions more accurately than object-related questions echoes the videochat2 results [14], where the authors hypothesise that the current ViFMs have difficulty with localisation and counting tasks in the absence of related tuning data.

b) Efficiency: To test Hypothesis H1b, we ran an ordinary least squares (OLS) regression on time with ‘summary type’ as the predictor. Summary type Q increased the predicted users' time ($\beta =156$, $t(117)=3.179$, $p=0.002$) while summary type G did not ($\beta =-29.0$, $t(117)=-0.589$, $p=0.557$). These results do not support the hypothesis H1b. Q summaries are not readily available, leading to a latency between querying and having a summary to inspect. This finding suggests that while query-based systems may improve accuracy, they may also introduce a time cost.

Importance of GvsQ and Modality: We next test the hypotheses H2: the magnitude of the improvement in accuracy (H2a) and time (H2b) depends on the summary type GvsQ and presentation modality (V/S/T). The baseline for s is $GvsQ[\text {G}]$, $modality[\text {V}]$ and $question[object]$, and for $t$, $GvsQ[\text {G}]$ and $modality[\text {V}]$.

c) Accuracy: To test Hypothesis H2a, we ran a logistic regression with ‘summary type’, ‘question type’ and ‘modality’ as predictors. Summary type Q ($\beta =0.945$, $z=4.976$, $p=0.000$), modality S ($\beta =0.905$, $z=2.885$, $p=0.004$), and question type $event$ ($\beta =1.49$, $z=4.520$, $p=0.000$) increased the predicted user's score, while modality T decreased the predicted users' score ($\beta =-.955$, $z=-2.777$, $p=0.005$). The interactions between the factors $modality[ \text {S}]$ and $question[event]$ decreased the predicted users' score ($\beta =-2.45$, $z=-5.349$, $p=0.000$), while no interaction was detected between the factors $modality[ \text {T}]$ and $question[event]$ ($\beta =-.345$, $z=-0.728$, $p=0.466$).

These results support the hypothesis H2a, emphasising that allowing users to actively shape the information they receive leads to better accuracy than merely providing them generic summaries.

The choice of modality can significantly influence task accuracy, with S potentially offering advantages over both V and T in certain contexts (type of question matters - see below). A possible cause of the significantly decreased accuracy in T may be the lack of visual information for the user to verify the summary.

The interaction effects between modality and question type revealed complex patterns. While storyboards generally improved performance, they were less effective for event-related questions compared to object-related questions: unlike a video, static images cannot show the unfolding of an event.

d) Efficiency: To test Hypothesis H2b, we ran an OLS regression on time with ‘summary type’ and ‘modality’ as predictors. Summary type Q increased the predicted users' time ($\beta =185$, $t(86)=4.083$, $p=0.000$) while modalities T ($\beta =-198$, $t(86)=-3.562$, $p=0.000$) and S ($\beta =-125$, $t(86)=-2.243$, $p=0.028$) decreased the predicted time. These results re-emphasise the time cost of interactive models.

Regarding interaction modality, the results indicate that storyboards and text may offer a more efficient means of information acquisition than video summaries which require sequential viewing. It is only with S that the accuracy and time are both improved: with V and T, we observed a trade-off between accuracy and time.

Recall that the summaries were generated live in the Q condition. The model generation time was included in our measurement of task time. On average, to generate a summary takes 15.5 s, 9.43 s; 23.2 s with a standard deviation of 0.246 s; 0.298 s; 0.724 s for the modalities V, S and T respectively. Generating the text summaries takes the longest time, as they require the video to be generated first. This is followed by the video summary, as the segments need to be concatenated to create a shorter video whereas the storyboard can be shown right after the selection process. Even through it takes longer to generate T summaries, it is still more time efficient than V in total task time.

Cause of Errors and Task Time: We analysed the source of errors by comparing the model outputs, user answers, and ground truth answers. The models demonstrated an average 63.0% accuracy rate. Without any AI assistance, users demonstrated an overall accuracy of 43.3%. With assistance, the overall task accuracy of the users was 46.5%, but varied across summarisation type and modality (see Table II). Note that the quality of the model response is impacted by the user query, and that in generating their answer, users can rely solely on the model's recommendations, or refer to the raw video, leading to a complex interplay between user queries, model responses, and user interpretations.¹

We also performed an analysis of task time by adjusting for the time to generate query summaries. This resulted in the impact of summary type Q on time being non-significant under both H1b and H2b. However, we do not exclude the model generation time in our regression analysis reported in Table II, as users had the opportunity to read the questions or watch the raw video while the query-based models were generating summaries, thus excluding generation time would provide an unfair advantage to the query-based models.

Post-hoc Analysis:

1) User Confidence

While not directly related to the initial hypotheses, another variable of interest is the confidence of the users. Confidence was rated on a $1-10$ scale for object and event-related questions (sub-task level). The confidence score for a sub-task was assigned to each question under that sub-task, and ‘Not sure’ answers were excluded from the analysis. Since confidence was measured at a scale, an ordinal logistic regression was run.

The data was organised in the same way as the hypothesis testing. Summary type G ($\beta =-0.4807$, $z=-2.667$, $p=0.008$) and Q ($\beta =-0.7541$, $z=-4.295$, $p=0.000$) decreased the predicted users' confidence. This suggests a potential lack of trust in model generated summaries where it was observed that the users' perception was that such systems might omit relevant information. Question type $event$ increased the predicted users' confidence ($\beta =0.4755$, $z=3.375$, $p=0.001$) reiterating the findings in Section IV-0a where participants had higher accuracy answering object questions. No significant effects were detected from the presentation modality.

2) Usability

Users gave a usability score per task answering the standard SUS questionnaire [25] on which we ran an ordinal logistic regression. We detected no significant effects of the summary type or modality on the usability score.

3) Familiarity

The pre-study survey consisted 8 custom questions to capture the familiarity of the users with robots and video summarisation models, and their trust in such models. A logistic regression was run on task performance (s) with familiarity as the predictor consisting of individual questions. To keep the model simple, only the one-way interaction were considered. We did not find familiarity to have significant effects on the score.

4) Question-Level Analysis

We further analysed the questions with the highest and lowest scores. The lowest score for V was counting the blue barrels, where $16/30$ users said 1, instead of the correct answer 2. When users queried for a ‘blue barrel’, the generated summary video often focused on the most prominent blue barrel, overlooking a second, less obvious barrel. G summary was the same, only capturing the prominent blue barrel. This highlights a limitation in current models: an overemphasis on prominent features.

The question with the lowest score for the S task was ‘Did you knock down a door while trying to pass through it?’. Even though an image of the fallen door was included in the 24 images provided under G summary, $12/15$ users incorrectly answered ‘No’ or ‘Not sure’, highlighting the drawback of the S modality for event questions.

5) Post-Study Survey Answers

During the post-study survey, the users were asked if the provided summaries were helpful in their task completion. Participants in the Q group unanimously reported that AI tools helped them, compared to $11/15$ in the G group. The participants were also asked about their least and most preferred modality. To calculate a single preference score, the most preferred modality is given a score of $1$ and the least preferred $-1$, and the scores are averaged over each modality. On average, users preferred V the most under the G condition and equally V and S under the Q condition. Users disliked T with both summary types. Interestingly, the S showed a dramatic increase from the G to Q condition, implying image outputs are more useful in an interactive setting. The common reason stated by users for the preference for V was provision of contextual evidence, resulting in increased confidence, improved explainability, and greater trustworthiness, whereas with T, participants reported the lack of evidence being the main reason of choosing it as the least preferred modality.

Main Findings: The study demonstrates that query-driven summaries significantly improve retrieval accuracy compared to raw data; however, this advantage comes with the trade-off of increased time spent on the task. Among the modalities, storyboards are particularly effective, enhancing both accuracy and reducing time compared to video and text summaries. However, storyboards are less effective for event-related questions, where other modalities might be more suitable. Our work demonstrates ViFM as a promising tool for intermittent robot supervision via query-driven summaries. Further, our findings on participants' performance difference in object vs. event questions highlight the room for improvement in ViFMs' capabilities in object localization and counting.

Limitations: In this study, only a subset of SOTA ViFMs were tested in a context-specific scenario using a search and rescue mission. New models are continuously being developed which can achieve higher accuracies on benchmarking datasets. However, our user study evaluation highlights that the success of ViFMs for intermittent robot supervision in a human-robot team is influenced by a complex interaction between environment, user and model. Improving the model accuracy alone is likely insufficient. Errors may arise due to the quality of the query, quality of model responses, interpretation of the model response or inherent difficulty in the task or environment. Thus, further investigations are required to understand how to adopt ViFMs effectively to support intermittent robot supervision.

Our study allowed users to enter their queries free-form. Providing fixed templates that generate structured prompts to the model could potentially improve accuracy, but may be less suited for gathering new information unknown at the time these templates were designed.

In our study, summarising across different modalities had the potential for information imbalance. Despite the potential imbalance of information between video and storyboard, our results suggest that a well-curated storyboard can be more effective than a higher-information content summary video.

SECTION V.

Conclusion and Future Work

We investigated the efficacy of ViFMs in generating robot summaries for intermittent supervision. Query-driven summaries significantly improved task accuracy compared to generic summaries and raw video, albeit at the cost of increased task completion time. The modality of summary presentation also played a crucial role, with storyboards potentially offering a balance between accuracy and efficiency compared to video and text formats. The lack of correlation between user familiarity and performance suggests that well-designed summaries can effectively bridge knowledge gaps for users with varying levels of expertise.

While the study demonstrates the potential of query-driven summaries by ViFMs for improving robot-to-human and two-way communication, it also exposes critical constraints such as: model limitations, particularly to detect less-prominent environment features; modality dependencies, particularly for events; and the time-accuracy tradeoff of interactive summarisation. By addressing key future research questions such as understanding and modelling implicit human knowledge structures, enhancing human interpretation of robot perception through simultaneous multi-modal summaries, and creating computationally efficient yet semantically rich summarization techniques, we can progress towards more adaptive and intelligent human-robotic communication.

ACKNOWLEDGMENT

The authors would like to thank Pavan Sikka, CSIRO, for his valuable insights and contributions to this work.

References is not available for this document.

MIT Libraries

MIT Libraries

‘What Did the Robot Do in My Absence?’ Video Foundation Models to Enhance Intermittent Supervision

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Related Work