Search CORE

3,013 research outputs found

High-performance algorithm engineering for computational phylogenetics

Author: Bader David A.
Moret Bernard M. E.
Warnow Tandy
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 12/12/2006
Field of study

Infoscience - École polytechnique fédérale de Lausanne

Accurate reconstruction of insertion-deletion histories by statistical phylogenetics

Author: A Heger
A Löytynoja
A Löytynoja
A Siepel
A Siepel
A Siepel
AG Clark
AM Moses
Art F. Y. Poon
B Knudsen
B Paten
B Rannala
Benedict Paten
C Lee
C Strope
DG Higgins
EF Moore
FA Matsen
FR Kschischang
G Lunter
Gerton Lunter
I Holmes
I Miklós
Ian Holmes
J Felsenstein
JD Thompson
JL Thorne
JL Thorne
JS Pedersen
K Katoh
K Liu
KM Wong
KS Pollard
L Gomez-Valero
L Zhu
M Larkin
M Mohri
MA Suchard
N de la Chaux
O Kamneva
O Westesson
Oscar Westesson
P Markova-Raina
R Mills
RA Cartwright
RC Edgar
RK Bradley
RK Bradley
S Nelesen
S Saccone
S Sinha
T Beissbarth
X Qu
Z Wang
Z Yang
Z Yang
Z Yang
Z Zhang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2012
Field of study

The Multiple Sequence Alignment (MSA) is a computational abstraction that represents a partial summary either of indel history, or of structural similarity. Taking the former view (indel history), it is possible to use formal automata theory to generalize the phylogenetic likelihood framework for finite substitution models (Dayhoff's probability matrices and Felsenstein's pruning algorithm) to arbitrary-length sequences. In this paper, we report results of a simulation-based benchmark of several methods for reconstruction of indel history. The methods tested include a relatively new algorithm for statistical marginalization of MSAs that sums over a stochastically-sampled ensemble of the most probable evolutionary histories. For mammalian evolutionary parameters on several different trees, the single most likely history sampled by our algorithm appears less biased than histories reconstructed by other MSA methods. The algorithm can also be used for alignment-free inference, where the MSA is explicitly summed out of the analysis. As an illustration of our method, we discuss reconstruction of the evolutionary histories of human protein-coding genes.Comment: 28 pages, 15 figures. arXiv admin note: text overlap with arXiv:1103.434

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Oxford University Research Archive

FigShare

Computational strategies for dissecting the high-dimensional complexity of adaptive immune repertoires

Author: Afzal
Ahmed
Albert
Amanna
Andrews
Angermueller
Apeltsin
Arden
Atchley
Avnir
Barabási
Barak
Bashford-Rogers
Bastian
Baum
Becattini
Ben-Hamo
Berger
Betz
Boc
Bolen
Bolkhovskaya
Bolotin
Bouckaert
Boyd
Boyd
Breden
Brown
Burnet
Bürckert
Calis
Castro
Chang
Chao
Chen
Ching
Cobey
Collins
Corcoran
Covacu
Csardi
Cui
Dash
de Bourcy
DeKosky
DeKosky
DeWitt
Dziubianau
Elhanati
Elhanati
Elhanati
Ellebedy
Emerson
Felsenstein
Friedensohn
Gadala-Maria
Galson
Galson
Geering
Georgiou
Ghraichy
Giribet
Giudicelli
Glanville
Glanville
Glanville
Good
Granato
Greiff
Greiff
Greiff
Greiff
Greiff
Grigaityte
Guindon
Gupta
Gupta
Hagberg
Halliley
Hammarlund
Heather
Hershberg
Hochreiter
Hoehn
Hoehn
Hoehn
Horns
Howie
Iversen
Jackson
Janeway
Jiang
Johnston
Jost
Jurtz
Kaplinsky
Kaplinsky
Kendall
Khavrutskii
Kidd
Kidd
Kidera
Kirik
Konishi
Kumar
Landsverk
Larkin
Lavinder
Laydon
Laydon
Lee
Lee
Lewitus
Li
Lindeman
Lindner
Liu
Love
Lozupone
Madi
Madi
Malissen
Mamoshina
Mangul
Manz
Martin
Meng
Miho
Mora
Morisita
Murugan
Nazarov
Nouri
Oakes
Ostmeyer
Paradis
Parameswaran
Parola
Pinheiro
Pollok
Ralph
Ravn
Reddy
Rempala
Rempala
Revell
Rizzetto
Robinson
Ronquist
Roybal
Rubelt
Rubelt
Safonova
Schliep
Schramm
Schwab
Shannon
Sheng
Sheng
Shlemov
Shugay
Shugay
Shugay
Snir
Snir
Stamatakis
Stern
Strauli
Stubbington
Stubbington
Sun
Sun Cinelli
Swofford
Thomas
Tickotsky
Tonegawa
Torkamani
Trepel
VanDuijn
Venturi
Venturi
Vieira
Vita
Wang
Wardemann
Warren
Watson
Watson
Watson
Weinstein
Wine
Wine
Wu
Yaari
Yaari
Yang
Yeap
Yermanos
Yokota
Yu
Zhu
Publication venue: 'Frontiers Media SA'
Publication date: 29/11/2017
Field of study

The adaptive immune system recognizes antigens via an immense array of antigen-binding antibodies and T-cell receptors, the immune repertoire. The interrogation of immune repertoires is of high relevance for understanding the adaptive immune response in disease and infection (e.g., autoimmunity, cancer, HIV). Adaptive immune receptor repertoire sequencing (AIRR-seq) has driven the quantitative and molecular-level profiling of immune repertoires thereby revealing the high-dimensional complexity of the immune receptor sequence landscape. Several methods for the computational and statistical analysis of large-scale AIRR-seq data have been developed to resolve immune repertoire complexity in order to understand the dynamics of adaptive immunity. Here, we review the current research on (i) diversity, (ii) clustering and network, (iii) phylogenetic and (iv) machine learning methods applied to dissect, quantify and compare the architecture, evolution, and specificity of immune repertoires. We summarize outstanding questions in computational immunology and propose future directions for systems immunology towards coupling AIRR-seq with the computational discovery of immunotherapeutics, vaccines, and immunodiagnostics.Comment: 27 pages, 2 figure

arXiv.org e-Print Archive

Repository for Publications and Research Data

Crossref

NORA - Norwegian Open Research Archives

Probabilistic Graphical Model Representation in Phylogenetics

Author: Boussau Bastien
Heath Tracy A.
Huelsenbeck John P.
Höhna Sebastian
Landis Michael J.
Ronquist Fredrik
Publication venue
Publication date: 09/12/2013
Field of study

Recent years have seen a rapid expansion of the model space explored in statistical phylogenetics, emphasizing the need for new approaches to statistical model representation and software development. Clear communication and representation of the chosen model is crucial for: (1) reproducibility of an analysis, (2) model development and (3) software design. Moreover, a unified, clear and understandable framework for model representation lowers the barrier for beginners and non-specialists to grasp complex phylogenetic models, including their assumptions and parameter/variable dependencies. Graphical modeling is a unifying framework that has gained in popularity in the statistical literature in recent years. The core idea is to break complex models into conditionally independent distributions. The strength lies in the comprehensibility, flexibility, and adaptability of this formalism, and the large body of computational work based on it. Graphical models are well-suited to teach statistical models, to facilitate communication among phylogeneticists and in the development of generic software for simulation and statistical inference. Here, we provide an introduction to graphical models for phylogeneticists and extend the standard graphical model representation to the realm of phylogenetics. We introduce a new graphical model component, tree plates, to capture the changing structure of the subgraph corresponding to a phylogenetic tree. We describe a range of phylogenetic models using the graphical model framework and introduce modules to simplify the representation of standard components in large and complex models. Phylogenetic model graphs can be readily used in simulation, maximum likelihood inference, and Bayesian inference using, for example, Metropolis-Hastings or Gibbs sampling of the posterior distribution

arXiv.org e-Print Archive

KU ScholarWorks

PubMed Central

An Introduction to Programming for Bioscientists: A Python-based Primer

Author: Ekmekci Berk
McAnany Charles E.
Mura Cameron
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 17/05/2016
Field of study

Computing has revolutionized the biological sciences over the past several decades, such that virtually all contemporary research in the biosciences utilizes computer programs. The computational advances have come on many fronts, spurred by fundamental developments in hardware, software, and algorithms. These advances have influenced, and even engendered, a phenomenal array of bioscience fields, including molecular evolution and bioinformatics; genome-, proteome-, transcriptome- and metabolome-wide experimental studies; structural genomics; and atomistic simulations of cellular-scale molecular assemblies as large as ribosomes and intact viruses. In short, much of post-genomic biology is increasingly becoming a form of computational biology. The ability to design and write computer programs is among the most indispensable skills that a modern researcher can cultivate. Python has become a popular programming language in the biosciences, largely because (i) its straightforward semantics and clean syntax make it a readily accessible first language; (ii) it is expressive and well-suited to object-oriented programming, as well as other modern paradigms; and (iii) the many available libraries and third-party toolkits extend the functionality of the core language into virtually every biological domain (sequence and structure analyses, phylogenomics, workflow management systems, etc.). This primer offers a basic introduction to coding, via Python, and it includes concrete examples and exercises to illustrate the language's usage and capabilities; the main text culminates with a final project in structural bioinformatics. A suite of Supplemental Chapters is also provided. Starting with basic concepts, such as that of a 'variable', the Chapters methodically advance the reader to the point of writing a graphical user interface to compute the Hamming distance between two DNA sequences.Comment: 65 pages total, including 45 pages text, 3 figures, 4 tables, numerous exercises, and 19 pages of Supporting Information; currently in press at PLOS Computational Biolog

arXiv.org e-Print Archive

Directory of Open Access Journals

PubMed Central

FigShare

DM-PhyClus: A Bayesian phylogenetic algorithm for infectious disease transmission cluster inference

Author: Brenner Bluma
Labbe Aurélie
Roger Michel
Stephens David A.
Villandré Luc
Publication venue
Publication date: 08/08/2017
Field of study

Background. Conventional phylogenetic clustering approaches rely on arbitrary cutpoints applied a posteriori to phylogenetic estimates. Although in practice, Bayesian and bootstrap-based clustering tend to lead to similar estimates, they often produce conflicting measures of confidence in clusters. The current study proposes a new Bayesian phylogenetic clustering algorithm, which we refer to as DM-PhyClus, that identifies sets of sequences resulting from quick transmission chains, thus yielding easily-interpretable clusters, without using any ad hoc distance or confidence requirement. Results. Simulations reveal that DM-PhyClus can outperform conventional clustering methods, as well as the Gap procedure, a pure distance-based algorithm, in terms of mean cluster recovery. We apply DM-PhyClus to a sample of real HIV-1 sequences, producing a set of clusters whose inference is in line with the conclusions of a previous thorough analysis. Conclusions. DM-PhyClus, by eliminating the need for cutpoints and producing sensible inference for cluster configurations, can facilitate transmission cluster detection. Future efforts to reduce incidence of infectious diseases, like HIV-1, will need reliable estimates of transmission clusters. It follows that algorithms like DM-PhyClus could serve to better inform public health strategies

arXiv.org e-Print Archive

Directory of Open Access Journals

FigShare

Reconstructing phylogeny from RNA secondary structure via simulated evolution

Author: Fischer Will
Geard Nicholas
Publication venue
Publication date: 01/01/2002
Field of study

DNA sequences of genes encoding functional RNA molecules (e.g., ribosomal RNAs) are commonly used in phylogenetics (i.e. to infer evolutionary history). Trees derived from ribosomal RNA (rRNA) sequences, however, are inconsistent with other molecular data in investigations of deep branches in the tree of life. Since much of te functional constraints on the gene products (i.e. RNA molecules) relate to three-dimensional structure, rather than their actual sequences, accumulated mutations in the gene sequences may obscure phylogenetic signal over very large evolutionary time-scales. Variation in structure, however, may be suitable for phylogenetic inference even under extreme sequence divergence. To evaluate qualitatively the manner in which structural evolution relates to sequence change, we simulated the evolution of RNA sequences under various constraints on structural change

Southampton (e-Prints Soton)

Consistency and convergence rate of phylogenetic inference via regularization

Author: Dinh Vu
Ho Lam Si Tung
Matsen IV Frederick A.
Suchard Marc A.
Publication venue
Publication date: 05/01/2018
Field of study

It is common in phylogenetics to have some, perhaps partial, information about the overall evolutionary tree of a group of organisms and wish to find an evolutionary tree of a specific gene for those organisms. There may not be enough information in the gene sequences alone to accurately reconstruct the correct "gene tree." Although the gene tree may deviate from the "species tree" due to a variety of genetic processes, in the absence of evidence to the contrary it is parsimonious to assume that they agree. A common statistical approach in these situations is to develop a likelihood penalty to incorporate such additional information. Recent studies using simulation and empirical data suggest that a likelihood penalty quantifying concordance with a species tree can significantly improve the accuracy of gene tree reconstruction compared to using sequence data alone. However, the consistency of such an approach has not yet been established, nor have convergence rates been bounded. Because phylogenetics is a non-standard inference problem, the standard theory does not apply. In this paper, we propose a penalized maximum likelihood estimator for gene tree reconstruction, where the penalty is the square of the Billera-Holmes-Vogtmann geodesic distance from the gene tree to the species tree. We prove that this method is consistent, and derive its convergence rate for estimating the discrete gene tree structure and continuous edge lengths (representing the amount of evolution that has occurred on that branch) simultaneously. We find that the regularized estimator is "adaptive fast converging," meaning that it can reconstruct all edges of length greater than any given threshold from gene sequences of polynomial length. Our method does not require the species tree to be known exactly; in fact, our asymptotic theory holds for any such guide tree.Comment: 34 pages, 5 figures. To appear on The Annals of Statistic

arXiv.org e-Print Archive

eScholarship - University of California

Using Avida to test the effects of natural selection on phylogenetic reconstruction methods

Author: Hagstrom George I.
Hang Dehua H.
Ofria Charles
Torng Eric
Publication venue: 'MIT Press - Journals'
Publication date: 01/01/2004
Field of study

Phylogenetic trees group organisms by their ancestral relationships. There are a number of distinct algorithms used to reconstruct these trees from molecular sequence data, but different methods sometimes give conflicting results. Since there are few precisely known phylogenies, simulations are typically used to test the quality of reconstruction algorithms. These simulations randomly evolve strings of symbols to produce a tree, and then the algorithms are run with the tree leaves as inputs. Here we use Avida to test two widely used reconstruction methods, which gives us the chance to observe the effect of natural selection on tree reconstruction. We find that if the organisms undergo natural selection between branch points, the methods will be successful even on very large time scales. However, these algorithms often falter when selection is absent

CiteSeerX

Caltech Authors

On the role of metaheuristic optimization in bioinformatics

Author: Benito Sergio
Calvet Laura
Juan Angel A
Prados Ferran
Publication venue: 'Royal College of Obstetricians & Gynaecologists (RCOG)'
Publication date: 01/01/2022
Field of study

Metaheuristic algorithms are employed to solve complex and large-scale optimization problems in many different fields, from transportation and smart cities to finance. This paper discusses how metaheuristic algorithms are being applied to solve different optimization problems in the area of bioinformatics. While the text provides references to many optimization problems in the area, it focuses on those that have attracted more interest from the optimization community. Among the problems analyzed, the paper discusses in more detail the molecular docking problem, the protein structure prediction, phylogenetic inference, and different string problems. In addition, references to other relevant optimization problems are also given, including those related to medical imaging or gene selection for classification. From the previous analysis, the paper generates insights on research opportunities for the Operations Research and Computer Science communities in the field of bioinformatics

UCL Discovery

RiuNet