Location of Repository

N-Poisson Document

By Eugene L. Margulis

Abstract

This paper is a report of a study investigating the validity of the Multiple Poisson ( nP) model of word distribution in document collections. An nP distribution is a mixture of n Poisson distributions with different means. We describe a practical algorithm for determining if a certain word is distributed according to an nP distribution and computing the distribution parameters. The algorithm was applied to every word in four different document collections. It was found that over 70 % of frequently occurring words and terms indeed behave according to the nP distributions. The results indicate that the proportion of nP words depends on the collection size, document length ancl the frequency of the individual words. Most of the nP words recognised are distributed according to the mixture of relatively few single Poisson distributions (two, three or four). There is an indication that the number of single Poisson components in the mixture depends on the collection frequency of words.

Year: 2011
OAI identifier: oai:CiteSeerX.psu:10.1.1.193.5792
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.