A New Algorithm for Document Aboutness

Abstract

The thesis investigates the document aboutness task and proposes the design, implementation and test of a system that identifies the main focus of a text by detecting entities which are salient for its discourses and are drawn from Wikipedia. In order to design this system we deploy several Natural Language Processing tools, such as entity annotator, text summarizer and dependency parser. By using these tools we derive a large set of features upon which we develop a (binary) classifier that distinguishes salient versus non-salient entities. The efficiency and effectiveness of the developed system is checked via a large experimental test over the well-known annotated New York Times dataset

    Similar works