I. Introduction
Large-scale unstructured data is currently becoming one of the major focal points of data management and information retrieval research [2]–[7], [12]–[15], because of its many attractive properties. For example, online media (e.g. Web blogs, Twitter, Facebook, news feeds, etc) has very broad coverage, is instantly updated and therefore is an attractive large-scale dataset containing a wealth of information not immediately available from other sources. Many web sources export only text, even if they store data internally as something else. Lastly, much enterprise information, such as employee evaluations, internal documents, powerpoint presentations, etc., are primarily text.