Weakly-Supervised Action Segmentation and Unseen Error Detection in Anomalous Instructional Videos | IEEE Conference Publication | IEEE Xplore

Weakly-Supervised Action Segmentation and Unseen Error Detection in Anomalous Instructional Videos


Abstract:

We present a novel method for weakly-supervised action segmentation and unseen error detection in anomalous instructional videos. In the absence of an appropriate dataset...Show More

Abstract:

We present a novel method for weakly-supervised action segmentation and unseen error detection in anomalous instructional videos. In the absence of an appropriate dataset for this task, we introduce the Anomalous Toy Assembly (ATA) dataset 1, which comprises 1152 untrimmed videos of 32 participants assembling three different toys, recorded from four different viewpoints. The training set comprises 27 participants who assemble toys in an expected and consistent manner, while the test and validation sets comprise 5 participants who display sequential anomalies in their task. We introduce a weakly labeled segmentation algorithm that is a generalization of the constrained Viterbi algorithm and identifies potential anomalous moments based on the difference between future anticipation and current recognition results. The proposed method is not restricted by the training transcripts during testing, allowing for the inference of anomalous action sequences while maintaining real-time performance. Based on these segmentation results, we also introduce a baseline to detect pre-defined human errors, and benchmark results on the ATA dataset. Experiments were conducted on the ATA and CSV datasets, outperforming the state-of-the-art in segmenting anomalous videos under both online and offline conditions.
Date of Conference: 01-06 October 2023
Date Added to IEEE Xplore: 15 January 2024
ISBN Information:

ISSN Information:

Conference Location: Paris, France

Description

Description not available.
Review our Supplemental Items documentation for more information.

1. Introduction

One of the challenges in human machine interaction is the automatic vision-based understanding of human actions in instructional videos. These videos depict a series of low-level actions that collectively accomplish a top-level task, such as preparing a meal or assembling an object. However, labeling each frame of these videos can be arduous and necessitate a significant amount of manual effort to note the start and end times of each action segment. Consequently, there has been a surge of research interest in developing weakly-supervised methods to learn the actions. In particular, such methods aim to overcome the challenge of weakly-labeled instructional videos, where only the ordered sequence of action labels (transcript) is provided without any information on the duration of each action.

Description

Description not available.
Review our Supplemental Items documentation for more information.
Contact IEEE to Subscribe

References

References is not available for this document.