The distributed genome hypothesis states that the set of genes in a
population of bacteria is distributed over all individuals that belong to the
specific taxon. It implies that certain genes can be gained and lost from
generation to generation. We use the random genealogy given by a Kingman
coalescent in order to superimpose events of gene gain and loss along ancestral
lines. Gene gains occur at a constant rate along ancestral lines. We assume
that gained genes have never been present in the population before. Gene losses
occur at a rate proportional to the number of genes present along the ancestral
line. In this infinitely many genes model we derive moments for several
statistics within a sample: the average number of genes per individual, the
average number of genes differing between individuals, the number of
incongruent pairs of genes, the total number of different genes in the sample
and the gene frequency spectrum. We demonstrate that the model gives a
reasonable fit with gene frequency data from marine cyanobacteria.Comment: Published in at http://dx.doi.org/10.1214/09-AAP657 the Annals of
Applied Probability (http://www.imstat.org/aap/) by the Institute of
Mathematical Statistics (http://www.imstat.org