2,129 research outputs found
Shrinkage Effect in Ancestral Maximum Likelihood
Ancestral maximum likelihood (AML) is a method that simultaneously
reconstructs a phylogenetic tree and ancestral sequences from extant data
(sequences at the leaves). The tree and ancestral sequences maximize the
probability of observing the given data under a Markov model of sequence
evolution, in which branch lengths are also optimized but constrained to take
the same value on any edge across all sequence sites. AML differs from the more
usual form of maximum likelihood (ML) in phylogenetics because ML averages over
all possible ancestral sequences. ML has long been known to be statistically
consistent -- that is, it converges on the correct tree with probability
approaching 1 as the sequence length grows. However, the statistical
consistency of AML has not been formally determined, despite informal remarks
in a literature that dates back 20 years. In this short note we prove a general
result that implies that AML is statistically inconsistent. In particular we
show that AML can `shrink' short edges in a tree, resulting in a tree that has
no internal resolution as the sequence length grows. Our results apply to any
number of taxa
Improving population-specific allele frequency estimates by adapting supplemental data: an empirical Bayes approach
Estimation of the allele frequency at genetic markers is a key ingredient in
biological and biomedical research, such as studies of human genetic variation
or of the genetic etiology of heritable traits. As genetic data becomes
increasingly available, investigators face a dilemma: when should data from
other studies and population subgroups be pooled with the primary data? Pooling
additional samples will generally reduce the variance of the frequency
estimates; however, used inappropriately, pooled estimates can be severely
biased due to population stratification. Because of this potential bias, most
investigators avoid pooling, even for samples with the same ethnic background
and residing on the same continent. Here, we propose an empirical Bayes
approach for estimating allele frequencies of single nucleotide polymorphisms.
This procedure adaptively incorporates genotypes from related samples, so that
more similar samples have a greater influence on the estimates. In every
example we have considered, our estimator achieves a mean squared error (MSE)
that is smaller than either pooling or not, and sometimes substantially
improves over both extremes. The bias introduced is small, as is shown by a
simulation study that is carefully matched to a real data example. Our method
is particularly useful when small groups of individuals are genotyped at a
large number of markers, a situation we are likely to encounter in a
genome-wide association study.Comment: Published in at http://dx.doi.org/10.1214/07-AOAS121 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Survival analysis of DNA mutation motifs with penalized proportional hazards
Antibodies, an essential part of our immune system, develop through an
intricate process to bind a wide array of pathogens. This process involves
randomly mutating DNA sequences encoding these antibodies to find variants with
improved binding, though mutations are not distributed uniformly across
sequence sites. Immunologists observe this nonuniformity to be consistent with
"mutation motifs", which are short DNA subsequences that affect how likely a
given site is to experience a mutation. Quantifying the effect of motifs on
mutation rates is challenging: a large number of possible motifs makes this
statistical problem high dimensional, while the unobserved history of the
mutation process leads to a nontrivial missing data problem. We introduce an
-penalized proportional hazards model to infer mutation motifs and
their effects. In order to estimate model parameters, our method uses a Monte
Carlo EM algorithm to marginalize over the unknown ordering of mutations. We
show that our method performs better on simulated data compared to current
methods and leads to more parsimonious models. The application of proportional
hazards to mutation processes is, to our knowledge, novel and formalizes the
current methods in a statistical framework that can be easily extended to
analyze the effect of other biological features on mutation rates
Recommended from our members
Evolution of the eyes of vipers with and without infrared-sensing pit organs
We examined lens and brille transmittance, photoreceptors, visual pigments, and visual opsin gene sequences of viperid snakes with and without infrared-sensing pit organs. Ocular media transmittance is high in both groups. Contrary to previous reports, small as well as large single cones occur in pit vipers. Non-pit vipers differ from pit vipers in having a twotiered retina, but few taxa have been examined for this poorly understood feature. All vipers sampled express rh1, sws1 and lws visual opsin genes. Opsin spectral tuning varies but not in accordance with the presence/absence of pit organs, and not always as predicted from gene sequences. The visual opsin genes were generally under purifying selection, with positive selection at spectral tuning amino acids in RH1 and SWS1 opsins, and at retinal pocket stabilization sites in RH1 or LWS (and without substantial differences between pit and nonpit vipers). Lack of evidence for sensory trade-off between viperid eyes (in the aspects examined) and pit organs might be explained by the high degree of neural integration of vision and infrared detection; the latter representing an elaboration of an existing sense with addition of a novel sense organ, rather than involving the evolution of a wholly novel sensory system
Inferring dynamic genetic networks with low order independencies
In this paper, we propose a novel inference method for dynamic genetic
networks which makes it possible to face with a number of time measurements n
much smaller than the number of genes p. The approach is based on the concept
of low order conditional dependence graph that we extend here in the case of
Dynamic Bayesian Networks. Most of our results are based on the theory of
graphical models associated with the Directed Acyclic Graphs (DAGs). In this
way, we define a minimal DAG G which describes exactly the full order
conditional dependencies given the past of the process. Then, to face with the
large p and small n estimation case, we propose to approximate DAG G by
considering low order conditional independencies. We introduce partial qth
order conditional dependence DAGs G(q) and analyze their probabilistic
properties. In general, DAGs G(q) differ from DAG G but still reflect relevant
dependence facts for sparse networks such as genetic networks. By using this
approximation, we set out a non-bayesian inference method and demonstrate the
effectiveness of this approach on both simulated and real data analysis. The
inference procedure is implemented in the R package 'G1DBN' freely available
from the CRAN archive
Bayesian estimation of species divergence times using correlated quantitative characters
Discrete morphological data have been widely used to study species evolution, but the use of quantitative (or continuous) morphological characters is less common. Here, we implement a Bayesian method to estimate species divergence times using quantitative characters. Quantitative character evolution is modelled using Brownian diffusion with character correlation and character variation within populations. Through simulations, we demonstrate that ignoring the population variation (or population “noise”) and the correlation among characters leads to biased estimates of divergence times and rate, especially if the correlation and population noise are high. We apply our new method to the analysis of quantitative characters (cranium landmarks) and molecular data from carnivoran mammals. Our results show that time estimates are affected by whether the correlations and population noise are accounted for or ignored in the analysis. The estimates are also affected by the type of data analysed, with analyses of morphological characters only, molecular data only, or a combination of both; showing noticeable differences among the time estimates. Rate variation of morphological characters among the carnivoran species appears to be very high, with Bayesian model selection indicating that the independent-rates model fits the morphological data better than the autocorrelated-rates model. We suggest that using morphological continuous characters, together with molecular data, can bring a new perspective to the study of species evolution. Our new model is implemented in the MCMCtree computer program for Bayesian inference of divergence times
Polyploidy breaks speciation barriers in Australian burrowing frogs Neobatrachus
Polyploidy has played an important role in evolution across the tree of life but it is still unclear how polyploid lineages may persist after their initial formation. While both common and well-studied in plants, polyploidy is rare in animals and generally less understood. The Australian burrowing frog genus Neobatrachus is comprised of six diploid and three polyploid species and offers a powerful animal polyploid model system. We generated exome-capture sequence data from 87 individuals representing all nine species of Neobatrachus to investigate species-level relationships, the origin and inheritance mode of polyploid species, and the population genomic effects of polyploidy on genus-wide demography. We describe rapid speciation of diploid Neobatrachus species and show that the three independently originated polyploid species have tetrasomic or mixed inheritance. We document higher genetic diversity in tetraploids, resulting from widespread gene flow between the tetraploids, asymmetric inter-ploidy gene flow directed from sympatric diploids to tetraploids, and isolation of diploid species from each other. We also constructed models of ecologically suitable areas for each species to investigate the impact of climate on differing ploidy levels. These models suggest substantial change in suitable areas compared to past climate, which correspond to population genomic estimates of demographic histories. We propose that Neobatrachus diploids may be suffering the early genomic impacts of climate-induced habitat loss, while tetraploids appear to be avoiding this fate, possibly due to widespread gene flow. Finally, we demonstrate that Neobatrachus is an attractive model to study the effects of ploidy on the evolution of adaptation in animals
Bayesian Statistical Methods for Genetic Association Studies with Case-Control and Cohort Design
Large-scale genetic association studies are carried out with the hope of discovering single
nucleotide polymorphisms involved in the etiology of complex diseases. We propose a
coalescent-based model for association mapping which potentially increases the power to
detect disease-susceptibility variants in genetic association studies with case-control and cohort
design. The approach uses Bayesian partition modelling to cluster haplotypes with
similar disease risks by exploiting evolutionary information. We focus on candidate gene
regions and we split the chromosomal region of interest into sub-regions or windows of high
linkage disequilibrium (LD) therein assuming a perfect phylogeny. The haplotype space is
then partitioned into disjoint clusters within which the phenotype-haplotype association is
assumed to be the same. The novelty of our approach consists in the fact that the distance
used for clustering haplotypes has an evolutionary interpretation, as haplotypes are clustered
according to the time to their most recent common mutation. Our approach is fully
Bayesian and we develop Markov Chain Monte Carlo algorithms to sample efficiently over
the space of possible partitions. We have also developed a Bayesian survival regression model
for high-dimension and small sample size settings. We provide a Bayesian variable selection
procedure and shrinkage tool by imposing shrinkage priors on the regression coefficients. We
have developed a computationally efficient optimization algorithm to explore the posterior
surface and find the maximum a posteriori estimates of the regression coefficients. We compare
the performance of the proposed methods in simulation studies and using real datasets
to both single-marker analyses and recently proposed multi-marker methods and show that
our methods perform similarly in localizing the causal allele while yielding lower false positive
rates. Moreover, our methods offer computational advantages over other multi-marker
approaches
- …