93 research outputs found
Accurate reconstruction of insertion-deletion histories by statistical phylogenetics
The Multiple Sequence Alignment (MSA) is a computational abstraction that
represents a partial summary either of indel history, or of structural
similarity. Taking the former view (indel history), it is possible to use
formal automata theory to generalize the phylogenetic likelihood framework for
finite substitution models (Dayhoff's probability matrices and Felsenstein's
pruning algorithm) to arbitrary-length sequences. In this paper, we report
results of a simulation-based benchmark of several methods for reconstruction
of indel history. The methods tested include a relatively new algorithm for
statistical marginalization of MSAs that sums over a stochastically-sampled
ensemble of the most probable evolutionary histories. For mammalian
evolutionary parameters on several different trees, the single most likely
history sampled by our algorithm appears less biased than histories
reconstructed by other MSA methods. The algorithm can also be used for
alignment-free inference, where the MSA is explicitly summed out of the
analysis. As an illustration of our method, we discuss reconstruction of the
evolutionary histories of human protein-coding genes.Comment: 28 pages, 15 figures. arXiv admin note: text overlap with
arXiv:1103.434
A Unifying Model of Genome Evolution Under Parsimony
We present a data structure called a history graph that offers a practical
basis for the analysis of genome evolution. It conceptually simplifies the
study of parsimonious evolutionary histories by representing both substitutions
and double cut and join (DCJ) rearrangements in the presence of duplications.
The problem of constructing parsimonious history graphs thus subsumes related
maximum parsimony problems in the fields of phylogenetic reconstruction and
genome rearrangement. We show that tractable functions can be used to define
upper and lower bounds on the minimum number of substitutions and DCJ
rearrangements needed to explain any history graph. These bounds become tight
for a special type of unambiguous history graph called an ancestral variation
graph (AVG), which constrains in its combinatorial structure the number of
operations required. We finally demonstrate that for a given history graph ,
a finite set of AVGs describe all parsimonious interpretations of , and this
set can be explored with a few sampling moves.Comment: 52 pages, 24 figure
Accurate Detection of Recombinant Breakpoints in Whole-Genome Alignments
We propose a novel method for detecting sites of molecular recombination in multiple alignments. Our approach is a compromise between previous extremes of computationally prohibitive but mathematically rigorous methods and imprecise heuristic methods. Using a combined algorithm for estimating tree structure and hidden Markov model parameters, our program detects changes in phylogenetic tree topology over a multiple sequence alignment. We evaluate our method on benchmark datasets from previous studies on two recombinant pathogens, Neisseria and HIV-1, as well as simulated data. We show that we are not only able to detect recombinant regions of vastly different sizes but also the location of breakpoints with great accuracy. We show that our method does well inferring recombination breakpoints while at the same time maintaining practicality for larger datasets. In all cases, we confirm the breakpoint predictions of previous studies, and in many cases we offer novel predictions
Classification of HIV-1 Sequences Using Profile Hidden Markov Models
Accurate classification of HIV-1 subtypes is essential for studying the dynamic spatial distribution pattern of HIV-1 subtypes and also for developing effective methods of treatment that can be targeted to attack specific subtypes. We propose a classification method based on profile Hidden Markov Model that can accurately identify an unknown strain. We show that a standard method that relies on the construction of a positive training set only, to capture unique features associated with a particular subtype, can accurately classify sequences belonging to all subtypes except B and D. We point out the drawbacks of the standard method; namely, an arbitrary choice of threshold to distinguish between true positives and true negatives, and the inability to discriminate between closely related subtypes. We then propose an improved classification method based on construction of a positive as well as a negative training set to improve discriminating ability between closely related subtypes like B and D. Finally, we show how the improved method can be used to accurately determine the subtype composition of Common Recombinant Forms of the virus that are made up of two or more subtypes. Our method provides a simple and highly accurate alternative to other classification methods and will be useful in accurately annotating newly sequenced HIV-1 strains
Evaluation of methods for detecting conversion events in gene clusters
Background: Gene clusters are genetically important, but their analysis poses significant computational challenges. One of the major reasons for these difficulties is gene conversion among the duplicated regions of the cluster, which can obscure their true relationships. Many computational methods for detecting gene conversion events have been released, but their performance has not been assessed for wide deployment in evolutionary history studies due to a lack of accurate evaluation methods. Results: We designed a new method that simulates gene cluster evolution, including large-scale events of duplication, deletion, and conversion as well as small mutations. We used this simulation data to evaluate several different programs for detecting gene conversion events. Conclusions: Our evaluation identifies strengths and weaknesses of several methods for detecting gene conversion, which can contribute to more accurate analysis of gene cluster evolution
AAV ancestral reconstruction library enables selection of broadly infectious viral variants
Adeno-associated virus (AAV) vectors have achieved clinical efficacy in treating several diseases. Enhanced vectors are required to extend these landmark successes to other indications, however, and protein engineering approaches may provide the necessary vector improvements to address such unmet medical needs. To generate new capsid variants with potentially enhanced infectious properties, and to gain insights into AAVâs evolutionary history, we computationally designed and experimentally constructed a putative ancestral AAV library. Combinatorial variations at 32 amino acid sites were introduced to account for uncertainty in their identities. We then analyzed the evolutionary flexibility of these residues, the majority of which have not been previously studied, by subjecting the library to iterative selection on a representative cell line panel. The resulting variants exhibited transduction efficiencies comparable to the most efficient extant serotypes, and in general ancestral libraries were broadly infectious across the cell line panel, indicating that they favored promiscuity over specificity. Interestingly, putative ancestral AAVs were more thermostable than modern serotypes and did not utilize sialic acids, galactose, or heparan sulfate proteoglycans for cellular entry. Finally, variants mediated 19â31 fold higher gene expression in muscle compared to AAV1, a clinically utilized serotype for muscle delivery, highlighting their promise for gene therapy
- âŠ