Article thumbnail
Location of Repository

Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System

By Ethan Miller, Dan Shen, Junli Liu and Charles Nicholas

Abstract

Information retrieval has become more and more important due to the rapid growth of all kinds of information. However, there are few suitable systems available. This paper presents a few approaches that enable large-scale information retrieval for the TELLTALE system. TELLTALE is a dynamic hypertext information retrieval environment. It provides full-text search for text corpora that may be garbled by OCR (Optical Character Recognition) or transmission errors, and that may contain multiple languages by using several techniques based on n-grams (n character sequences of text). It can find similar documents against a 1KB query from 1G text data in 45 seconds. This remarkable performance is achieved by integrating new data structures and gamma compression into the TELLTALE framework. This paper also compares several different types of query methods such as TF/IDF and incremental similarity to the original technique of centroid subtraction. The new similarity techniques give better performance but less accuracy

Year: 2000
OAI identifier: oai:CiteSeerX.psu:10.1.1.18.7734
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://www.soe.ucsc.edu/~elm/P... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.