This paper presents a multi-modal dataset that contains rich transcriptions of spoken conversations. As diverse multi-modal and multi-task models emerge, there is a growing need for multi-modal training and evaluation datasets accompanied by rich metadata. However, there is no universal dataset that addresses these requirements for the diverse tasks partially due to the cost of annotation. To over...Show More
The objective of this work is the effective extraction of spatial and dynamic features for Continuous Sign Language Recognition (CSLR). To accomplish this, we utilise a two-pathway SlowFast network, where each pathway operates at distinct temporal resolutions to separately capture spatial (hand shapes, facial expressions) and dynamic (movements) information. In addition, we introduce two distinct ...Show More
The objective of this work is to extract the target speaker’s voice from a mixture of voices using visual cues. Existing works on audio-visual speech separation have demonstrated their performance with promising intelligibility, but maintaining naturalness remains challenging. To address this issue, we propose AVDiffuSS, an audio-visual speech separation model based on a diffusion mechanism known ...Show More
The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition (CSLR) that addresses key issues of sign language recognition. These include the need for complex multi-scale features such as hands, face, and mouth for understanding, and absence of frame-level annotations. To this end, we propose (1) Divide and Focus Convolution (DFConv) which extracts both ma...Show More