Location of Repository

Lexical Chains and Sliding Locality Windows in Content-based Text Similarity Detection

By Thade Nahnsen, Ozlem Uzuner and Boris Katz

Abstract

We present a system to determine content similarity of documents. More specifically, our goal is to identify book chapters that are translations of the same original chapter; this task requires identification of not only the different topics in the documents but also the particular flow of these topics. We experiment with different representations employing n-grams of lexical chains and test these representations on a corpus of approximately 1000 chapters gathered from books with multiple parallel translations. Our representations include the cosine similarity of attribute vectors of n-grams of lexical chains, the cosine similarity of tf*idf-weighted keywords, and the cosine similarity of unweighted lexical chains (unigrams of lexical chains) as well as multiplicative combinations of the similarity measures produced by these approaches. Our results identify fourgrams of unordered lexical chains as a particularly useful representation for text similarity evaluation

Topics: AI, Natural Language Processing, N-grams, Text Similarity, Lexical Chains
Year: 2005
OAI identifier: oai:dspace.mit.edu:1721.1/30546
Provided by: DSpace@MIT
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://hdl.handle.net/1721.1/3... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.