I. Introduction
A classic problem in multimedia representation and understanding is the semantic gap problem [1]. It states that there is a big representational gap between the audiovisual signals that compose multimedia objects and the concepts represented by these signals. For instance, the dominant color and movement trajectory of a given set of pixels in a video clip, which are low-level characteristics of the clip, usually do not provide much information about the meaning of the set of pixels—at least not to computers. But recent developments in artificial intelligence (AI) are changing that.