Abstract—In this paper, we present a novel approach to search and retrieve from document image collections, without explicit recognition. Existing recognition-free approaches such as word-spotting cannot scale to arbitrarily large vocabulary and document image collections. In this paper we put forth a framework that overcomes three issues of word-spotting: i) retrieving word images not labeled during indexing, ii) allow for query and retrieval of morphological variations of words and iii) scale the retrieval to large collections. We propose a character n-gram spotting framework, where word-images are considered as a bag of visual n-grams. The character n-grams are represented in a visual-feature space and indexed for quick retrieval. In the retrieval phase, the query word is expanded to its constituent n-grams, which are used to query the previously built index. A ranking mechanism is proposed that combines the retrieval results from the multiple lists corresponding to each n-gram. The approach is demonstrated on a size-able collection of English and Malayalam books. With a mean AP of 0.64, the performance of the retrieval system was found to be very promising. Keywords-Word-Spotting, Character n-Grams, Recognitionfree
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.