Location of Repository

Cairo Microsoft Innovation Center

By Walid Magdy, Smart Village and Bldg B

Abstract

With massive book digitization efforts underway, there is a need for developing effective book retrieval strategies. This paper explores the relative contribution of different parts of digitized and OCR’ed books towards effective retrieval. The examined parts include the entire content of books, book headings, book titles, and table of content entries. Results show that indexing the headers and titles of books is nearly as effective as indexing the entire contents of books. These results indicate that certain portions of the books, specifically titles and headers, are more valuable than other parts of books. This is akin to web search where hypertext and page titles are more valuable to index than the rest of the webpage. Also, using a combination of evidence approach provides further improved retrieval effectiveness compared to using any portion of the book in isolation

Topics: H.3.1 [Content Analysis and Indexing, indexing methods. General Terms Algorithms, Measurement, Performance, Experimentation. Keywords Book search, OCR retrieval
Year: 2011
OAI identifier: oai:CiteSeerX.psu:10.1.1.187.7039
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://research.microsoft.com/... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.