Self-Sufficient Framework for Continuous Sign Language Recognition | IEEE Conference Publication | IEEE Xplore

Self-Sufficient Framework for Continuous Sign Language Recognition


Abstract:

The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition (CSLR) that addresses key issues of sign language recognition. Thes...Show More

Abstract:

The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition (CSLR) that addresses key issues of sign language recognition. These include the need for complex multi-scale features such as hands, face, and mouth for understanding, and absence of frame-level annotations. To this end, we propose (1) Divide and Focus Convolution (DFConv) which extracts both manual and non-manual features without the need for additional networks or annotations, and (2) Dense Pseudo-Label Refinement (DPLR) which propagates non-spiky frame-level pseudo-labels by combining the ground truth gloss sequence labels with the predicted sequence. We demonstrate that our model achieves state-of-the-art performance among RGB-based methods on large-scale CSLR benchmarks, PHOENIX-2014 and PHOENIX-2014-T, while showing comparable results with better efficiency when compared to other approaches that use multi-modality or extra annotations.
Date of Conference: 04-10 June 2023
Date Added to IEEE Xplore: 05 May 2023
ISBN Information:

ISSN Information:

Conference Location: Rhodes Island, Greece

1. INTRODUCTION

The Continuous Sign Language Recognition (CSLR) task aims to recognise a gloss1 sequence in a sign language video [1], [2], [3]. To capture the meaning of the sign expressions from a signer, recent works obtain manual and non-manual expressions by fusing RGB with other modalities such as depth [4], infrared maps [5] and optical flow [6], or by explicitly extracting multi-cue features [2], [7], [8], [9] or human keypoints [10] using off-the-shelf detectors. However, using such extra components introduce bottlenecks in both training and inference processes. In addition, most CSLR datasets only have sentence-level gloss labels without frame- or gloss- level labels [2], [11], [12]. To overcome insufficient annotations, the Connectionist Temporal Classification (CTC) [13] loss has been traditionally opted to consider all possible underlying alignments between the input and target sequence. However, using the CTC loss without true frame-level supervision produces temporally spiky attention which can make the model fail to localise important temporal segments [14].

Contact IEEE to Subscribe

References

References is not available for this document.