Skip to main content
Article thumbnail
Location of Repository

An Annotated Dataset for Extracting Definitions and Hypernyms from the Web

By Roberto Navigli, Paola Velardi, Juana Maria Ruiz-martínez and Facultad De Informàtica

Abstract

This paper presents and analyzes an annotated corpus of definitions, created to train an algorithm for the automatic extraction of definitions and hypernyms from Web documents. As an additional resource, we also include a corpus of non-definitions with syntactic patterns similar to those of definition sentences, e.g.: “An android is a robot ” vs. “Snowcap is unmistakable”. Domain and style independence is obtained thanks to the annotation of a sample of the Wikipedia corpus and to a novel pattern generalization algorithm based on wordclass lattices (WCL). A lattice is a directed acyclic graph (DAG), a subclass of nondeterministic finite state automata (NFA). The lattice structure has the purpose of preserving the salient differences among distinct sequences, while eliminating redundant information. The WCL algorithm will be integrated into an improved version of the GlossExtractor Web application (Velardi et al., 2008). This paper is mostly concerned with a description of the corpus, the annotation strategy, and a linguistic analysis of the data. A summary of the WCL algorithm is also provided for the sake of completeness.

Year: 2011
OAI identifier: oai:CiteSeerX.psu:10.1.1.180.5958
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://www.lrec-conf.org/proce... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.