The differences between DNA-sequences within a population are the basis to
infer the ancestral relationship of the individuals. Within the classical
infinitely many sites model, it is possible to estimate the mutation rate based
on the site frequency spectrum, which is comprised by the numbers
C1,...,Cn−1, where n is the sample size and Cs is the number of site
mutations (Single Nucleotide Polymorphisms, SNPs) which are seen in s
genomes. Classical results can be used to compare the observed site frequency
spectrum with its neutral expectation, E[Cs]=θ2/s, where θ2
is the scaled site mutation rate. In this paper, we will relax the assumption
of the infinitely many sites model that all individuals only carry homologous
genetic material. Especially, it is today well-known that bacterial genomes
have the ability to gain and lose genes, such that every single genome is a
mosaic of genes, and genes are present and absent in a random fashion, giving
rise to the dispensable genome. While this presence and absence has been
modeled under neutral evolution within the infinitely many genes model in
previous papers, we link presence and absence of genes with the numbers of site
mutations seen within each gene. In this work we derive a formula for the
expectation of the joint gene and site frequency spectrum, denotes Gk,s
the number of mutated sites occurring in exactly s gene sequences, while the
corresponding gene is present in exactly k individuals. We show that standard
estimators of θ2 for dispensable genes are biased and that the site
frequency spectrum for dispensable genes differs from the classical result.Comment: 24 pages, 8 figure