Location of Repository

By Charactern-gram Spottingin Documentimages, Sudha Praveen M, Pramod Sankar K and C. V. Jawahar

Abstract

Abstract—In this paper, we present a novel approach to search and retrieve from document image collections, without explicit recognition. Existing recognition-free approaches such as word-spotting cannot scale to arbitrarily large vocabulary and document image collections. In this paper we put forth a framework that overcomes three issues of word-spotting: i) retrieving word images not labeled during indexing, ii) allow for query and retrieval of morphological variations of words and iii) scale the retrieval to large collections. We propose a character n-gram spotting framework, where word-images are considered as a bag of visual n-grams. The character n-grams are represented in a visual-feature space and indexed for quick retrieval. In the retrieval phase, the query word is expanded to its constituent n-grams, which are used to query the previously built index. A ranking mechanism is proposed that combines the retrieval results from the multiple lists corresponding to each n-gram. The approach is demonstrated on a size-able collection of English and Malayalam books. With a mean AP of 0.64, the performance of the retrieval system was found to be very promising. Keywords-Word-Spotting, Character n-Grams, Recognitionfree

Year: 2014
OAI identifier: oai:CiteSeerX.psu:10.1.1.417.7653
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://cvit.iiit.ac.in/papers/... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.