4,466 research outputs found
A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity
Ortholog detection (OD) is a critical step for comparative genomic analysis
of protein-coding sequences. In this paper, we begin with a comprehensive
comparison of four popular, methodologically diverse OD methods: MultiParanoid,
Blat, Multiz, and OMA. In head-to-head comparisons, these methods are shown to
significantly outperform one another 12-30% of the time. This high
complementarity motivates the presentation of the first tool for integrating
methodologically diverse OD methods. We term this program MOSAIC, or Multiple
Orthologous Sequence Analysis and Integration by Cluster optimization. Relative
to component and competing methods, we demonstrate that MOSAIC more than
quintuples the number of alignments for which all species are present, while
simultaneously maintaining or improving functional-, phylogenetic-, and
sequence identity-based measures of ortholog quality. Further, we demonstrate
that this improvement in alignment quality yields 40-280% more confidently
aligned sites. Combined, these factors translate to higher estimated levels of
overall conservation, while at the same time allowing for the detection of up
to 180% more positively selected sites. MOSAIC is available as python package.
MOSAIC alignments, source code, and full documentation are available at
http://pythonhosted.org/bio-MOSAIC
Robust forward simulations of recurrent hitchhiking
Evolutionary forces shape patterns of genetic diversity within populations
and contribute to phenotypic variation. In particular, recurrent positive
selection has attracted significant interest in both theoretical and empirical
studies. However, most existing theoretical models of recurrent positive
selection cannot easily incorporate realistic confounding effects such as
interference between selected sites, arbitrary selection schemes, and
complicated demographic processes. It is possible to quantify the effects of
arbitrarily complex evolutionary models by performing forward population
genetic simulations, but forward simulations can be computationally prohibitive
for large population sizes (). A common approach for overcoming these
computational limitations is rescaling of the most computationally expensive
parameters, especially population size. Here, we show that ad hoc approaches to
parameter rescaling under the recurrent hitchhiking model do not always provide
sufficiently accurate dynamics, potentially skewing patterns of diversity in
simulated DNA sequences. We derive an extension of the recurrent hitchhiking
model that is appropriate for strong selection in small population sizes, and
use it to develop a method for parameter rescaling that provides the best
possible computational performance for a given error tolerance. We perform a
detailed theoretical analysis of the robustness of rescaling across the
parameter space. Finally, we apply our rescaling algorithms to parameters that
were previously inferred for Drosophila, and discuss practical considerations
such as interference between selected sites
Diffusion Approximations for Demographic Inference: DaDi
Models of demographic history (population sizes, migration rates, and divergence times) inferred from genetic data complement archeology and serve as null models in genome scans for selection. Most current inference methods are computationally limited to considering simple models or non-recombining data. We introduce a method based on a diffusion approximation to the joint frequency spectrum of genetic variation between populations. Our implementation, DaDi, can model up to three interacting populations and scales well to genome-wide data. We have applied DaDi to human data from Africa, Europe, and East Asia, building the most complex statistically well-characterized model of human migration out of Africa to date
Population Genetics of Rare Variants and Complex Diseases
Identifying drivers of complex traits from the noisy signals of genetic
variation obtained from high throughput genome sequencing technologies is a
central challenge faced by human geneticists today. We hypothesize that the
variants involved in complex diseases are likely to exhibit non-neutral
evolutionary signatures. Uncovering the evolutionary history of all variants is
therefore of intrinsic interest for complex disease research. However, doing so
necessitates the simultaneous elucidation of the targets of natural selection
and population-specific demographic history. Here we characterize the action of
natural selection operating across complex disease categories, and use
population genetic simulations to evaluate the expected patterns of genetic
variation in large samples. We focus on populations that have experienced
historical bottlenecks followed by explosive growth (consistent with most human
populations), and describe the differences between evolutionarily deleterious
mutations and those that are neutral. Genes associated with several complex
disease categories exhibit stronger signatures of purifying selection than
non-disease genes. In addition, loci identified through genome-wide association
studies of complex traits also exhibit signatures consistent with being in
regions recurrently targeted by purifying selection. Through simulations, we
show that population bottlenecks and rapid growth enables deleterious rare
variants to persist at low frequencies just as long as neutral variants, but
low frequency and common variants tend to be much younger than neutral
variants. This has resulted in a large proportion of modern-day rare alleles
that have a deleterious effect on function, and that potentially contribute to
disease susceptibility.Comment: 36 pages, 7 figure
Recommended from our members
Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data
Demographic models built from genetic data play important roles in illuminating prehistorical events and serving as null models in genome scans for selection. We introduce an inference method based on the joint frequency spectrum of genetic variants within and between populations. For candidate models we numerically compute the expected spectrum using a diffusion approximation to the one-locus, two-allele Wright-Fisher process, involving up to three simultaneous populations. Our approach is a composite likelihood scheme, since linkage between neutral loci alters the variance but not the expectation of the frequency spectrum. We thus use bootstraps incorporating linkage to estimate uncertainties for parameters and significance values for hypothesis tests. Our method can also incorporate selection on single sites, predicting the joint distribution of selected alleles among populations experiencing a bevy of evolutionary forces, including expansions, contractions, migrations, and admixture. We model human expansion out of Africa and the settlement of the New World, using 5 Mb of noncoding DNA resequenced in 68 individuals from 4 populations (YRI, CHB, CEU, and MXL) by the Environmental Genome Project. We infer divergence between West African and Eurasian populations 140 thousand years ago (95% confidence interval: 40–270 kya). This is earlier than other genetic studies, in part because we incorporate migration. We estimate the European (CEU) and East Asian (CHB) divergence time to be 23 kya (95% c.i.: 17–43 kya), long after archeological evidence places modern humans in Europe. Finally, we estimate divergence between East Asians (CHB) and Mexican-Americans (MXL) of 22 kya (95% c.i.: 16.3–26.9 kya), and our analysis yields no evidence for subsequent migration. Furthermore, combining our demographic model with a previously estimated distribution of selective effects among newly arising amino acid mutations accurately predicts the frequency spectrum of nonsynonymous variants across three continental populations (YRI, CHB, CEU).</p
Recommended from our members
Ancestry-Dependent Enrichment of Deleterious Homozygotes in Runs of Homozygosity.
Runs of homozygosity (ROH) are important genomic features that manifest when an individual inherits two haplotypes that are identical by descent. Their length distributions are informative about population history, and their genomic locations are useful for mapping recessive loci contributing to both Mendelian and complex disease risk. We have previously shown that ROH, and especially long ROH that are likely the result of recent parental relatedness, are enriched for homozygous deleterious coding variation in a worldwide sample of outbred individuals. However, the distribution of ROH in admixed populations and their relationship to deleterious homozygous genotypes is understudied. Here we analyze whole-genome sequencing data from 1,441 unrelated individuals from self-identified African American, Puerto Rican, and Mexican American populations. These populations are three-way admixed between European, African, and Native American ancestries and provide an opportunity to study the distribution of deleterious alleles partitioned by local ancestry and ROH. We re-capitulate previous findings that long ROH are enriched for deleterious variation genome-wide. We then partition by local ancestry and show that deleterious homozygotes arise at a higher rate when ROH overlap African ancestry segments than when they overlap European or Native American ancestry segments of the genome. These results suggest that, while ROH on any haplotype background are associated with an inflation of deleterious homozygous variation, African haplotype backgrounds may play a particularly important role in the genetic architecture of complex diseases for admixed individuals, highlighting the need for further study of these populations
Recommended from our members
A Population Genetics-Phylogenetics Approach to Inferring Natural Selection in Coding Sequences
Through an analysis of polymorphism within and divergence between species, we can hope to learn about the distribution of selective effects of mutations in the genome, changes in the fitness landscape that occur over time, and the location of sites involved in key adaptations that distinguish modern-day species. We introduce a novel method for the analysis of variation in selection pressures within and between species, spatially along the genome and temporally between lineages. We model codon evolution explicitly using a joint population genetics-phylogenetics approach that we developed for the construction of multiallelic models with mutation, selection, and drift. Our approach has the advantage of performing direct inference on coding sequences, inferring ancestral states probabilistically, utilizing allele frequency information, and generalizing to multiple species. We use a Bayesian sliding window model for intragenic variation in selection coefficients that efficiently combines information across sites and captures spatial clustering within the genome. To demonstrate the utility of the method, we infer selective pressures acting in Drosophila melanogaster and D. simulans from polymorphism and divergence data for 100 X-linked coding regions.</p
- …