Advanced Document Description, a Sequential Approach

Doucet, Antoine

thesis

Advanced Document Description, a Sequential Approach

Authors: Antoine Doucet
Publication date: 1 January 2005
Publisher: Helsingfors universitet

Abstract

To be able to perform efficient document processing, information systems need to use simple models of documents that can be treated in a smaller number of operations. This problem of document representation is not trivial. For decades, researchers have tried to combine relevant document representations with efficient processing. Documents are commonly represented by vectors in which each dimension corresponds to a word of the document. This approach is termed “bag of words”, as it entirely ignores the relative positions of words. One natural improvement over this representation is the extraction and use of cohesive word sequences. In this dissertation, we consider the problem of the extraction, selection and exploitation of word sequences, with a particular focus on the applicability of our work to domain-independent document collections written in any language

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Helsingin yliopiston digitaalinen arkisto

oai:helda.helsinki.fi:10138/21...

Last time updated on 30/08/2013