1 Introduction
Extracting textual information from natural images is a challenging problem with many practical applications. Unlike character recognition for scanned documents, recognizing text in unconstrained images is complicated by a wide range of variations in backgrounds, textures, fonts, and lighting conditions. As a result, many text detection and recognition systems rely on cleverly hand-engineered features [5], [4], 1[4] to represent the underlying data. Sophisticated models such as conditional random fields [11], [19] or pictorial structures [18] are also often required to combine the raw detection/recognition outputs into a complete system.