1. Introduction
Real-world events around us consist of different multisensory signals and their complex interactions with each other. In-the-wild videos of real-life events and moments capture a rich set of multi-modalities and their complex interactions therein. Thus, it is essential to leverage multisensory information for better video understanding, but their diversity and complex nature make it challenging. For instance, even though audio and vision signals are congruent, the way they relate are different. All these events have different types of characteristics (such as uni-modal types of vision only and audio only, and multi-modal types of continuous, instant, rhythmic, etc., as shown in Figure 1) which we call them as event types. That is, understanding video contents requires to properly deal with such diverse and complex associations and relationships. However, surprisingly, this has been overlooked by prior audio-visual recognition research.