Introduction
Advancements in artificial intelligence (AI) and robotics are leading to increased deployment of robots in challenging environments. While current robots are frequently teleoperated and require close human supervision, there is a drive towards increased robotic autonomy, enabling the robot to work for longer periods with no supervision.
When an autonomous robot encounters exceptions or challenges in real-world deployment, it is common for the robot to request intervention from a human supervisor, shifting from full autonomy to partial autonomy or teleoperation, i.e. intermittent supervision. To facilitate effective intermittent supervision, a robot needs to inform its human operator ‘what has been happening’ while it has been operating autonomously. Current approaches often involve requiring the human operator to remain attentive throughout the deployment [1] or manually examining raw data collected by the robot, such as video recordings from onboard cameras or robot status logs, which is time-consuming, incurs high cognitive load, and requires expert knowledge to interpret [2]. Therefore, we are motivated to investigate how data collected during autonomous robot operation can be distilled to facilitate efficient human supervision.
Recent advancements in Large Language Models (LLMs) for robotics have paved the way for a more natural human-robot interaction. However, most existing research focuses on mapping natural language commands to a robot's plan and actions [3], or generating conversational responses in social interactions [4]. Using LLMs to generate robot-to-human communication to aid human comprehension of robot status and intervention in functional tasks remains largely unexplored, with only a single study [5] exploring the use of a custom-trained LLM for generic and query-driven robot action summarisation.
We propose to employ Foundation Models to summarise robot data and improve intermittent supervision in HRI, as shown in Fig. 1. In particular, we propose to use pre-trained Video Foundation Models (ViFMs) to summarise long (40 min) egocentric robot videos to assist operators to rapidly identify object occurrences and temporal events in a multi-robot search-and-rescue scenario [1].
Diagram of the proposed robot summary generation system. The system generates generic or query-driven summaries of long egocentric robot videos in the form of storyboards, short videos, or text to help a user review the robots' autonomous history.
There are two main challenges in adopting ViFMs for generating robot summaries for humans:
Summaries generated by ViFMs may not capture relevant information for a specific task. A user may need to explicitly query the model for targeted answers.
The generated summary may be inaccurate. The modality in which the summary is presented may influence a user's ability to verify the summary and calibrate their trust towards the information provided.
To investigate ViFMs for robot summary generation, we develop a framework that provides a user with either generic or query-driven summaries of robot videos in the following output modalities: a storyboard of key images, a short summary video, and a text summary. We evaluate the proposed framework in a user study (n = 30) to compare participants' task performance and subjective experience using these different summary generation methods.
To the best of our knowledge, this is the first work which incorporates ViFMs in a zero-shot setting to generate summary communication in multiple natural modalities from a robot to a human for intermittent supervision. Our work illustrates the benefits and limitations of such models, as well as the transferability of ViFMs trained with general data for a specific HRI task scenario.
Related Work
Video Summarisation: Videos contain frames [6], displayed at a specific frequency to create motion [7]. Initial works in video summarisation considered video data without connection to natural language. There are two main approaches to video summarisation: static and dynamic [6]. Static (key-frame) summarisation selects representative frames to create a storyboard, while dynamic summarisation (video-skimming) identifies and chronologically arranges key video segments [6]. SumMe [8] and TVSum [9] are two benchmark datasets commonly used in video summarisation, both provide human-annotated importance scores for frames.
Transformer-based video summarisation methods [10], [11] excel in extracting global dependencies and multihop relationships between video frames. Spatiotemporal vision transformer (STVT) model [11] achieved SOTA performance on the SumMe and TVSum datasets by training the model using both inter-frame and intra-frame information. We will refer to this line of work as ‘generic video summarisation’.
In contrast to generic summarisation, Sharghi et al. [12] introduce ‘query-focused video summarisation’ where the user preferences were introduced to the summarisation process in the form of text queries. However, these queries were limited in scope and vocabulary. With the advancements in LLMs and later in multi-modal foundation models (ViFMs are a subset of this), open-ended, natural language queries became possible, enabling more sophisticated and contextually relevant video summarisation. We refer to these summaries as ‘query-driven’ summaries.
ViFMs for Video Summarisation: ViFMs learn a general-purpose representation for various video understanding tasks, by leveraging large-scale datasets. A detailed survey of ViFMs can be found in [13]. These models are typically evaluated on diverse benchmark datasets, e.g., MVBench [14].
Video summarisation is not a common task for ViFM. Therefore, we explore video understanding tasks capable of generating video skims, storyboards, or textual outputs based on a video and a natural language query.
Madan et al. [13] identify video content understanding and descriptive understanding among video understanding tasks. Video content understanding tasks can entail: abstract understanding (e.g., classification, retrieval), temporal understanding (e.g., action localization), and spatio-temporal understanding (e.g., object tracking, segmentation). Descriptive understanding tasks, such as VQA and captioning, focus on generating a textual description of the video content.
In terms of generating summaries as visual outputs, retrieval involves finding video segments containing specific actions, objects, or scenes, going beyond just objects or just actions, and is most closely related to our work. LanguageBind [15] achieves SOTA performance in zero-shot video-text retrieval, using language as a binding modality to establish semantic connections between textual descriptions and corresponding visual content.
While both tasks under descriptive video understanding can generate textual summaries, those models that excel in VQA are preferred for their ability to provide summaries for subjective, creative, and logical questions. VideoChat2 [14] excelled in challenging temporal video understanding tasks on the MVBench [14] benchmark by outperforming SOTA ViFMs by over 15%.
An important sub-field of video summarisation is ego-centric summarisation, reviewed by Plizzari et al. [16]. This survey emphasizes the importance of egocentric video question answering (VQA) as a key enabler for a wide range of assistive applications centering around HRI. Further, they identified key open challenges in egocentric VQA, including using unrestricted query-concepts to reflect real-life scenarios, application to outdoor environments, extending to multi-modality beyond text, handling long videos, limited capabilities of vision-language models in adapting to egocentric data, and evaluation in interaction with users. Our study addresses these key challenges.
Foundation Models for Robot Communication: Firoozi et al. [17] provide an extensive survey on how foundation models improve robot capabilities. However, the use of foundations models for robots communicating with humans for intermittent supervision, remains a crucial, yet underexplored, aspect. Tellex et al. [18] reviewed the use of natural language in robotics, finding that most research focuses on robots following human instructions [3]. There is limited research on robot-to-human communication or two-way communication, addressed by our study. Specifically, we investigate unprompted generic summarisation (robot-to-human communication), and interactive ‘query-driven summarisation’ (two-way communication).
Continual and lifelong robot learning, enabling robots to continuously learn and adapt to new information and environments with minimal human intervention, is another research direction leveraging foundation models [5], [19], [20], [21]. The emphasis is on efficient representation and retrieval of long-duration data. For example, Dechant et al. [5] use an LLM for generic and query-driven robot action summarisation. They trained a single model to both summarise and answer questions. Our work differs from this line of work by proposing a general framework that allows the integration of any existing ViFM without modifying model architectures or dataset-specific finetuning. While prior work commonly uses a static validation set, we emphasise human-in-the-loop evaluation and include visual modalities in addition to texts.
Framework Design
There are two main challenges in adopting ViFMs for generating robot summaries for humans. First, how should a summary be generated? A generic summary generation process may require less user effort, but the generic summary may not capture relevant information for a specific task. On the other hand, explicit user queries may provide more relevant summaries, but may require additional user expertise, time and effort. Second, how should the summarised information be presented to the user? The presentation modality may influence the user's ability to verify the information and appropriately calibrate their trust in the summary.
We developed a framework to provide users with generic or query-driven summaries using ViFMs, in the form of a short summary video (video-skims): V, a storyboard of key images: S, or a language-based text summary: T. The proposed framework is shown in Fig. 2.
Proposed framework for generating generic and query-driven summaries in the form of storyboard, video or text. There are 4 main steps:
A. Generic Summary Pipeline
A generic summary is generated without considering any input from the user. In the preprocessing step, the original long video is taken at its original frame rate. Then, each frame is encoded with ResNet18 [22] and fed into the relevance generation process to generate frame-level importance. Frame-level importance is generated by STVT [11]. By combining the temporal inter-frame correlations and the spatial intra-frame attention within frames, STVT captures important temporal and spatial content. The model was trained to learn the frame importance for each frame from human-created summaries on the TVSum dataset [9]. While the videos used to train STVT range from
During the selection process, important frames are selected to create a generic storyboard while important segments are selected to create a generic video skim. To generate the storyboard, a greedy approach is followed, where the frames selected (
To generate a summary video, first, the frame-level sequence is transformed to a segment-level sequence using Kernel-based Temporal Segmentation (KTS) [23]. We use a cosine kernel change point detection algorithm, where the number of change points is not specified in the optimization algorithm. KTS receives frame level features, generated in the same way as for the storyboard modality. We constrain each selected segment to a minimum duration of
A generic text description is produced by Videochat2 [14] based on the generic video summary. The model receives a generic query: it is given context about the robot's environment and task, and asked to describe the video in detail. Implementation details and parameter values are given under the G column in Table I.
B. Query-Driven Summary Pipeline
The main difference between the generic summary and the query-driven summary is the additional input of a user query, provided as a text input. The users are free to query however they wish to understand what the robot has been doing in their absence, going beyond querying from a limited set of options adopted by previous research [24].
In the preprocessing step, the long video is either broken down into frames (for storyboard) or segments (for video and text). The original video can be read at its original frame rate or a lower frame rate, depending on the availability of computational resources.
Both the text-based user query and the frames/segments are embedded using LanguageBind [15]. To generate relevance scores, the dot product between the query embedding and frame/segment-level embedding is used. It is important that both the query embedding and frame/segment embedding are in the same embedding space for the dot product to generate a valid similarity score.
Selecting the frames (
The top
C. User Study Design
With this framework in place to generate summaries to inform a user about what the robot has been doing in their absence, we designed a user study to evaluate whether users could benefit from such a summarisation framework. We consider a search and rescue mission undertaken by a fleet of robots in a complex underground environment (tunnel, urban, and cave) where they explore the previously unknown environment, searching for specific objects [1]. After extended autonomous exploration, a human operator is asked to identify key objects and events encountered by the robots during the operator's absence by either reviewing the raw video data, or summaries of the data generated by the proposed framework.
The user study is developed as a mixed-model design to investigate the effects of using summaries by ViFMs on task performance and usability. The user study was approved by the CSIRO research ethics committee. A total of
(left): Study procedure and (sub-)tasks. (right): Sample user study interface for query-driven storyboard summary. When the user enters a query into the ‘User Query’ box, selected images are displayed below the box based on the raw video given at bottom left. Questions are given on the right side.
Participants were randomly assigned to either the generic summarisation pipeline
The control task, C consisted of only the raw video without any AI-generated summary. Users were allowed to watch the long raw video at the original or
The questions were designed with varying difficulty to assess the user's ability to identify relevant information in the video content with or without the aid of different types of summaries. Each task contains two sub-tasks with different types of questions: object-related and event-related, i.e., question types
We examine the following hypotheses in this user study:
H1: Distilling data improves: H1a accuracy, and H1b time, for retrieving relevant information as opposed to providing raw data (task score of G/Q
C and task time of G/Q$>$ C).$< $ H2: The magnitude of the improvement in: H2a accuracy, and H2b time, depends on the summary type (G/Q) and modality (V/S/T).
D. Implementation
1) Dataset
The videos (RGB data stream) used for the study were collected from a fleet of robots exploring tunnels and urban undergrounds while completing a search and rescue mission [1]. Front-facing egocentric video of two robots are used: quadruped robot Spot and All-terrain Robot (ATR) shown in Fig. 4 (left). The length of each video is 40 min read at 15 fps. An example frame from the videos is shown in Fig. 4 (right).
2) Intermittent Supervision Tasks
Each task (C, V, S, T) includes
The
3) Computational Considerations
Generic summaries were pre-generated while query-driven summaries were generated in real-time with the backend models running in an H100 GPU. As such, the computational requirements of the two pipelines were different. The models chosen under each pipeline also contribute to the difference in computational requirements. None of the models used for the implementation were finetuned for the test data.
In the ‘preprocessing’ stage, the Q pipeline used a lower frame rate to manage the memory requirements when running in real-time specifically with ViFMs: LanguageBind and videochat2 simultaneously loaded to memory.
4) Equalising Information Across Conditions and Modalities
To ensure that a similar amount of information is provided across conditions, the duration of video and text information, and the number of storyboard images was equalised. The threshold
Second, the information content across modalities should be balanced. Balancing information density between video and storyboard proved challenging. While video offers detailed information, it can be overwhelming. Storyboard, though less detailed, provides a concise overview. We prioritised relevance in video selection (top-k/Knapsack algorithms). For storyboard, on the other hand, we employed a greedy approach to ensure diversity. To generate the text summary, in both pipelines videochat2 [14] receives the already processed summary video from which only 100 frames are considered due to resource limitations.
Results Analysis and Discussion
The analysis investigates the two hypotheses H1 (summary brings benefits) and H2 (summary type and modality matters) respectively. The first considers the predictors/independent variables ‘summary type’ (G, Q, ‘No Summary’=C), while the second excludes C data to compare modalities in GvsQ conditions. Where relevant, question type
When fitting models to explain the relationship between dependant and independent variables, the fitting formula is iteratively refined by initially including all possible interactions (two-way, and three-way where relevant), then progressively eliminating insignificant ones until only statistically significant interactions remain. We first tested if presentation order affected the outcomes, no significant effects were detected for either outcome. We run
Importance of Distilling Information: We first analyse (H1): distilling information improves accuracy (H1a) and efficiency (H1b) as opposed to providing raw data. The baseline (intercept) for s is
a) Accuracy: On average, users answered 43.3%, 36.3% and 56.7% questions correctly for C, G and Q summary types respectively.
To test Hypothesis H1a, we ran a logistic regression with ‘summary type’ and ‘question type’ predictors. Both summary type Q (
These findings suggest that only query-based information distillation enhances user accuracy, partially supporting the hypothesis H1a: score of
b) Efficiency: To test Hypothesis H1b, we ran an ordinary least squares (OLS) regression on time with ‘summary type’ as the predictor. Summary type Q increased the predicted users' time (
Importance of GvsQ and Modality: We next test the hypotheses H2: the magnitude of the improvement in accuracy (H2a) and time (H2b) depends on the summary type GvsQ and presentation modality (V/S/T). The baseline for s is
c) Accuracy: To test Hypothesis H2a, we ran a logistic regression with ‘summary type’, ‘question type’ and ‘modality’ as predictors. Summary type Q (
These results support the hypothesis H2a, emphasising that allowing users to actively shape the information they receive leads to better accuracy than merely providing them generic summaries.
The choice of modality can significantly influence task accuracy, with S potentially offering advantages over both V and T in certain contexts (type of question matters - see below). A possible cause of the significantly decreased accuracy in T may be the lack of visual information for the user to verify the summary.
The interaction effects between modality and question type revealed complex patterns. While storyboards generally improved performance, they were less effective for event-related questions compared to object-related questions: unlike a video, static images cannot show the unfolding of an event.
d) Efficiency: To test Hypothesis H2b, we ran an OLS regression on time with ‘summary type’ and ‘modality’ as predictors. Summary type Q increased the predicted users' time (
Regarding interaction modality, the results indicate that storyboards and text may offer a more efficient means of information acquisition than video summaries which require sequential viewing. It is only with S that the accuracy and time are both improved: with V and T, we observed a trade-off between accuracy and time.
Recall that the summaries were generated live in the Q condition. The model generation time was included in our measurement of task time. On average, to generate a summary takes 15.5 s, 9.43 s; 23.2 s with a standard deviation of 0.246 s; 0.298 s; 0.724 s for the modalities V, S and T respectively. Generating the text summaries takes the longest time, as they require the video to be generated first. This is followed by the video summary, as the segments need to be concatenated to create a shorter video whereas the storyboard can be shown right after the selection process. Even through it takes longer to generate T summaries, it is still more time efficient than V in total task time.
Cause of Errors and Task Time: We analysed the source of errors by comparing the model outputs, user answers, and ground truth answers. The models demonstrated an average 63.0% accuracy rate. Without any AI assistance, users demonstrated an overall accuracy of 43.3%. With assistance, the overall task accuracy of the users was 46.5%, but varied across summarisation type and modality (see Table II). Note that the quality of the model response is impacted by the user query, and that in generating their answer, users can rely solely on the model's recommendations, or refer to the raw video, leading to a complex interplay between user queries, model responses, and user interpretations.1
We also performed an analysis of task time by adjusting for the time to generate query summaries. This resulted in the impact of summary type Q on time being non-significant under both H1b and H2b. However, we do not exclude the model generation time in our regression analysis reported in Table II, as users had the opportunity to read the questions or watch the raw video while the query-based models were generating summaries, thus excluding generation time would provide an unfair advantage to the query-based models.
Post-hoc Analysis:
1) User Confidence
While not directly related to the initial hypotheses, another variable of interest is the confidence of the users. Confidence was rated on a
The data was organised in the same way as the hypothesis testing. Summary type G (
2) Usability
Users gave a usability score per task answering the standard SUS questionnaire [25] on which we ran an ordinal logistic regression. We detected no significant effects of the summary type or modality on the usability score.
3) Familiarity
The pre-study survey consisted 8 custom questions to capture the familiarity of the users with robots and video summarisation models, and their trust in such models. A logistic regression was run on task performance (s) with familiarity as the predictor consisting of individual questions. To keep the model simple, only the one-way interaction were considered. We did not find familiarity to have significant effects on the score.
4) Question-Level Analysis
We further analysed the questions with the highest and lowest scores. The lowest score for V was counting the blue barrels, where
The question with the lowest score for the S task was ‘Did you knock down a door while trying to pass through it?’. Even though an image of the fallen door was included in the 24 images provided under G summary,
5) Post-Study Survey Answers
During the post-study survey, the users were asked if the provided summaries were helpful in their task completion. Participants in the Q group unanimously reported that AI tools helped them, compared to
Main Findings: The study demonstrates that query-driven summaries significantly improve retrieval accuracy compared to raw data; however, this advantage comes with the trade-off of increased time spent on the task. Among the modalities, storyboards are particularly effective, enhancing both accuracy and reducing time compared to video and text summaries. However, storyboards are less effective for event-related questions, where other modalities might be more suitable. Our work demonstrates ViFM as a promising tool for intermittent robot supervision via query-driven summaries. Further, our findings on participants' performance difference in object vs. event questions highlight the room for improvement in ViFMs' capabilities in object localization and counting.
Limitations: In this study, only a subset of SOTA ViFMs were tested in a context-specific scenario using a search and rescue mission. New models are continuously being developed which can achieve higher accuracies on benchmarking datasets. However, our user study evaluation highlights that the success of ViFMs for intermittent robot supervision in a human-robot team is influenced by a complex interaction between environment, user and model. Improving the model accuracy alone is likely insufficient. Errors may arise due to the quality of the query, quality of model responses, interpretation of the model response or inherent difficulty in the task or environment. Thus, further investigations are required to understand how to adopt ViFMs effectively to support intermittent robot supervision.
Our study allowed users to enter their queries free-form. Providing fixed templates that generate structured prompts to the model could potentially improve accuracy, but may be less suited for gathering new information unknown at the time these templates were designed.
In our study, summarising across different modalities had the potential for information imbalance. Despite the potential imbalance of information between video and storyboard, our results suggest that a well-curated storyboard can be more effective than a higher-information content summary video.
Conclusion and Future Work
We investigated the efficacy of ViFMs in generating robot summaries for intermittent supervision. Query-driven summaries significantly improved task accuracy compared to generic summaries and raw video, albeit at the cost of increased task completion time. The modality of summary presentation also played a crucial role, with storyboards potentially offering a balance between accuracy and efficiency compared to video and text formats. The lack of correlation between user familiarity and performance suggests that well-designed summaries can effectively bridge knowledge gaps for users with varying levels of expertise.
While the study demonstrates the potential of query-driven summaries by ViFMs for improving robot-to-human and two-way communication, it also exposes critical constraints such as: model limitations, particularly to detect less-prominent environment features; modality dependencies, particularly for events; and the time-accuracy tradeoff of interactive summarisation. By addressing key future research questions such as understanding and modelling implicit human knowledge structures, enhancing human interpretation of robot perception through simultaneous multi-modal summaries, and creating computationally efficient yet semantically rich summarization techniques, we can progress towards more adaptive and intelligent human-robotic communication.
ACKNOWLEDGMENT
The authors would like to thank Pavan Sikka, CSIRO, for his valuable insights and contributions to this work.