I. Introduction
With the recent boom of smart phones, digital cameras, and Social Media sites (e.g., Flickr, YouTube, and Facebook), it is convenient for people to capture and share social media data online, which successfully facilitates information generation, sharing and propagation. As a result, a popular event that is happening around us and around the world can spread very fast, and there are substantial amounts of events with multi-modality (e.g., images, videos, and texts) in Internet. Most of these social events uploaded by users are related to some specific topics, and it is time-consuming to manually identify or cluster them. Therefore, automatically understanding social events from massive social media data is important and helpful to better browse, search and monitor social events by users or governments. However, it is difficult to achieve this goal because the substantial amounts of events are very complex and diverse, which makes it difficult to mine effective information for social event understanding. For example, for the social event “Kate and Wiliam wedding”, videos may contain images of Kate and Wiliam together on the wedding's day, in an official setting (such as in the church or waiving at the crowd from the balcony.