68 research outputs found
Properties of neutrality tests based on allele frequency spectrum
One of the main necessities for population geneticists is the availability of
statistical tools that enable to accept or reject the neutral Wright-Fisher
model with high power. A number of statistical tests have been developed to
detect specific deviations from the null frequency spectrum in different
directions (i.e., Tajima's D, Fu and Li's F and D test, Fay and Wu's H).
Recently, a general framework was proposed to generate all neutrality tests
that are linear functions of the frequency spectrum. In this framework, a
family of optimal tests was developed to have almost maximum power against a
specific alternative evolutionary scenario. Following these developments, in
this paper we provide a thorough discussion of linear and nonlinear neutrality
tests. First, we present the general framework for linear tests and emphasize
the importance of the property of scalability with the sample size (that is,
the results of the tests should not depend on the sample size), which, if
missing, can guide to errors in data interpretation. The motivation and
structure of linear optimal tests are discussed. In a further generalization,
we develop a general framework for nonlinear neutrality tests and we derive
nonlinear optimal tests for polynomials of any degree in the frequency
spectrum.Comment: 42 pages, 3 figures, elsarticl
Mlcoalsim: Multilocus Coalescent Simulations
Coalescent theory is a powerful tool for population geneticists as well as molecular biologists interested in understanding the patterns and levels of DNA variation. Using coalescent Monte Carlo simulations it is possible to obtain the empirical distributions for a number of statistics across a wide range of evolutionary models; these distributions can be used to test evolutionary hypotheses using experimental data. The mlcoalsim application presented here (based on a version of the ms program, Hudson, 2002) adds important new features to improve methodology (uncertainty and conditional methods for mutation and recombination), models (including strong positive selection, finite sites and heterogeneity in mutation and recombination rates) and analyses (calculating a number of statistics used in population genetics and P-values for observed data). One of the most important features of mlcoalsim is the analysis of multilocus data in linked and independent regions. In summary, mlcoalsim is an integrated software application aimed at researchers interested in molecular evolution. mlcoalsim is written in ANSI C and is available at: http://www.ub.es/softevol/mlcoalsim
Decomposing the site frequency spectrum: the impact of tree topology on neutrality tests
We investigate the dependence of the site frequency spectrum (SFS) on the
topological structure of genealogical trees. We show that basic population
genetic statistics - for instance estimators of or neutrality tests
such as Tajima's - can be decomposed into components of waiting times
between coalescent events and of tree topology. Our results clarify the
relative impact of the two components on these statistics. We provide a
rigorous interpretation of positive or negative values of an important class of
neutrality tests in terms of the underlying tree shape. In particular, we show
that values of Tajima's and Fay and Wu's depend in a direct way on a
peculiar measure of tree balance which is mostly determined by the root balance
of the tree. We present a new test for selection in the same class as Fay and
Wu's and discuss its interpretation and power. Finally, we determine the
trees corresponding to extreme expected values of these neutrality tests and
present formulae for these extreme values as a function of sample size and
number of segregating sites.Comment: 23 pages, 8 figure
A generalized Watterson estimator for next-generation sequencing : from trios to autopolyploids
Several variations of the Watterson estimator of variability for Next Generation Sequencing (NGS) data have been proposed in the literature. We present a unified framework for generalized Watterson estimators based on Maximum Composite Likelihood, which encompasses most of the existing estimators. We propose this class of unbiased estimators as generalized Watterson estimators for a large class of NGS data, including pools and trios. We also discuss the relation with the estimators proposed in the literature and show that they admit two equivalent but seemingly different forms, deriving a set of combinatorial identities as a byproduct. Finally, we give a detailed treatment of Watterson estimators for single or multiple autopolyploid individuals
The expected neutral frequency spectrum of linked sites
We present an exact, closed expression for the expected neutral Site
Frequency Spectrum for two neutral sites, 2-SFS, without recombination. This
spectrum is the immediate extension of the well known single site
neutral SFS. Similar formulae are also provided for the case of the expected
SFS of sites that are linked to a focal neutral mutation of known frequency.
Formulae for finite samples are obtained by coalescent methods and remarkably
simple expressions are derived for the SFS of a large population, which are
also solutions of the multi-allelic Kolmogorov equations. Besides the general
interest of these new spectra, they relate to interesting biological cases such
as structural variants and introgressions. As an example, we present the
expected neutral frequency spectrum of regions with a chromosomal inversion.Comment: 26 pages, 5 figure
The Site Frequency/Dosage Spectrum of Autopolyploid Populations
The Site Frequency Spectrum (SFS) and the heterozygosity of allelic variants are among the most important summary statistics for population genetic analysis of diploid organisms. We discuss the generalization of these statistics to populations of autopolyploid organisms in terms of the joint Site Frequency/Dosage Spectrum and its expected value for autopolyploid populations that follow the standard neutral model. Based on these results, we present estimators of nucleotide variability from High-Throughput Sequencing (HTS) data of autopolyploids and discuss potential issues related to sequencing errors and variant calling. We use these estimators to generalize Tajima's D and other SFS-based neutrality tests to HTS data from autopolyploid organisms. Finally, we discuss how these approaches fail when the number of individuals is small. In fact, in autopolyploids there are many possible deviations from the Hardy–Weinberg equilibrium, each reflected in a different shape of the individual dosage distribution. The SFS from small samples is often dominated by the shape of these deviations of the dosage distribution from its Hardy–Weinberg expectations
PopGenome : an efficient swiss army knife for population genomic analyses in R
Although many computer programs can perform population genetics calculations, they are typically limited in the analyses and data input formats they offer; few applications can process the large data sets produced by whole-genome resequencing projects. Furthermore, there is no coherent framework for the easy integration of new statistics into existing pipelines, hindering the development and application of new population genetics and genomics approaches. Here, we present PopGenome, a population genomics package for the R software environment (a de facto standard for statistical analyses). PopGenome can efficiently process genome-scale data as well as large sets of individual loci. It reads DNA alignments and single-nucleotide polymorphism (SNP) data sets in most common formats, including those used by the HapMap, 1000 human genomes, and 1001 Arabidopsis genomes projects. PopGenome also reads associated annotation files in GFF format, enabling users to easily define regions or classify SNPs based on their annotation; all analyses can also be applied to sliding windows. PopGenome offers a wide range of diverse population genetics analyses, including neutrality tests as well as statistics for population differentiation, linkage disequilibrium, and recombination. PopGenome is linked to Hudson's MS and Ewing's MSMS programs to assess statistical significance based on coalescent simulations. PopGenome's integration in R facilitates effortless and reproducible downstream analyses as well as the production of publication-quality graphics. Developers can easily incorporate new analyses methods into the PopGenome framework. PopGenome and R are freely available from CRAN () for all major operating systems under the GNU General Public License
Approaching long genomic regions and large recombination rates with msParSm as an alternative to MaCS
The msParSm application is an evolution of msPar, the parallel version of the coalescent simulation program ms, which removes the limitation for simulating long stretches of DNA sequences with large recombination rates, without compromising the accuracy of the standard coalescence. This work introduces msParSm, describes its significant performance improvements over msPar and its shared memory parallelization details, and shows how it can get better, if not similar, execution times than MaCS. Two case studies with different mutation rates were analyzed, one approximating the human average and the other approximating the Drosophila melanogaster average. Source code is available at https://github.com/cmontemuino/msparsm
The identification of runs of homozygosity give a focus on the genetic diversity and the adaptation of the "Charolais de Cuba" cattle
Altres ajuts: CERCA Programme/Generalitat de Catalunya.Inbreeding and effective population size (Ne) are fundamental indicators for the management and conservation of genetic diversity in populations. Genomic inbreeding gives accurate estimates of inbreeding, and the Ne determines the rate of the loss of genetic variation. The objective of this work was to study the distribution of runs of homozygosity (ROHs) in order to estimate genomic inbreeding (FROH) and an effective population size using 38,789 Single Nucleotide Polymorphisms (SNPs) from the Illumina Bovine 50K BeadChip in 86 samples from populations of Charolais de Cuba (n = 40) cattle and to compare this information with French (n = 20) and British Charolais (n = 26) populations. In the Cuban, French, and British Charolais populations, the average estimated genomic inbreeding values using the FROH statistics were 5.7%, 3.4%, and 4%, respectively. The dispersion measured by variation coefficient was high at 43.9%, 37.0%, and 54.2%, respectively. The effective population size experienced a very similar decline during the last century in Charolais de Cuba (from 139 to 23 individuals), in French Charolais (from 142 to 12), and in British Charolais (from 145 to 14) for the ~20 last generations. However, the high variability found in the ROH indicators and FROH reveals an opportunity for maintaining the genetic diversity of this breed with an adequate mating strategy, which can be favored with the use of molecular markers. Moreover, the detected ROH were compared to previous results obtained on the detection of signatures of selection in the same breed. Some of the observed signatures were confirmed by the ROHs, emphasizing the process of adaptation to tropical climate experienced by the Charolais de Cuba population
The Identification of Runs of Homozygosity Gives a Focus on the Genetic Diversity and Adaptation of the “Charolais de Cuba” Cattle
Inbreeding and effective population size (Ne) are fundamental indicators for the management and conservation of genetic diversity in populations. Genomic inbreeding gives accurate estimates of inbreeding, and the Ne determines the rate of the loss of genetic variation. The objective of this work was to study the distribution of runs of homozygosity (ROHs) in order to estimate genomic inbreeding (FROH) and an effective population size using 38,789 Single Nucleotide Polymorphisms (SNPs) from the Illumina Bovine 50K BeadChip in 86 samples from populations of Charolais de Cuba (n = 40) cattle and to compare this information with French (n = 20) and British Charolais (n = 26) populations. In the Cuban, French, and British Charolais populations, the average estimated genomic inbreeding values using the FROH statistics were 5.7%, 3.4%, and 4%, respectively. The dispersion measured by variation coefficient was high at 43.9%, 37.0%, and 54.2%, respectively. The effective population size experienced a very similar decline during the last century in Charolais de Cuba (from 139 to 23 individuals), in French Charolais (from 142 to 12), and in British Charolais (from 145 to 14) for the ~20 last generations. However, the high variability found in the ROH indicators and FROH reveals an opportunity for maintaining the genetic diversity of this breed with an adequate mating strategy, which can be favored with the use of molecular markers. Moreover, the detected ROH were compared to previous results obtained on the detection of signatures of selection in the same breed. Some of the observed signatures were confirmed by the ROHs, emphasizing the process of adaptation to tropical climate experienced by the Charolais de Cuba population.info:eu-repo/semantics/publishedVersio
- …