I. Introduction
Multimodal image representation has been found to be a very powerful representation for image databases, which not only exploits low-level features, but also takes advantage of the text annotation. Websites such as Flickr encourage the website visitors to tag the images with keywords, as they see them. These community based efforts have been shown to be useful in image classification. Also, in the past, different media were isolated and analyzed separately. For example, images had one representation, audio had another and they were rarely analyzed together. More and more researchers now realize that documents of interest are multimedia, not mono-media, and thus should be represented in an integrated fashion, to take advantage of various mathematical techniques for discovering latent semantics [1]. Following this idea, images have been represented not only with their low-level features such as color, texture, or shape, but with their semantic features, which can improve the results of image clustering and retrieval [2].