Abstract. We address in this report the problem of representing formally textual data. First, this problem is replaced in the context of automatic text processing. Then, the weaknesses of the basic document representation, i.e. the bag-of-words representation, are explained and some state-ofthe-art methods claiming to overcome these weaknesses are reviewed. Moreover we propose a novel graphical model, the Theme Topic Mixture Model, which also claims to do so, in addition of giving a probabilistic framework in which documents are considered. 2 IDIAP–RR 03-7
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.