ABSTRACT Representing Documents with Named Entities for


Several information organization, access, and filtering systems can benefit from different kind of document representations than those used in traditional Information Retrieval (IR). Topic Detection and Tracking (TDT) is an example of such an application. In this paper we demonstrate that named entities serve as better choices of units for document representation over all words. In order to test this hypothesis we study the effect of words-based and entity-based representations on Story Link Detection (SLD)- a core task in TDT research. The experiments on TDT corpora show that entity-based representations give significant improvements for SLD. We also propose a mechanism to expand the set of named entities used for document representation, which enhances the performance in some cases. We then take a step further and analyze the limitations of using only named entities for the document representation. Our studies and experiments indicate that adding additional topical terms can help in addressing such limitations

Similar works

Full text



Last time updated on 22/10/2014

This paper was published in CiteSeerX.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.