Search CORE

Directory of Open Access Journals

Springer - Publisher Connector

Caltech Authors

BEAST: Bayesian evolutionary analysis by sampling trees

Author: Drummond Alexei J.
Rambaut Andrew
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

Abstract Background The evolutionary analysis of molecular sequence variation is a statistical enterprise. This is reflected in the increased use of probabilistic models for phylogenetic inference, multiple sequence alignment, and molecular population genetics. Here we present BEAST: a fast, flexible software architecture for Bayesian analysis of molecular sequences related by an evolutionary tree. A large number of popular stochastic models of sequence evolution are provided and tree-based models suitable for both within- and between-species sequence data are implemented. Results BEAST version 1.4.6 consists of 81000 lines of Java source code, 779 classes and 81 packages. It provides models for DNA and protein sequence evolution, highly parametric coalescent analysis, relaxed clock phylogenetics, non-contemporaneous sequence data, statistical alignment and a wide range of options for prior distributions. BEAST source code is object-oriented, modular in design and freely available at <url>http://beast-mcmc.googlecode.com/</url> under the GNU LGPL license. Conclusion BEAST is a powerful and flexible evolutionary analysis package for molecular sequence variation. It also provides a resource for the further development of new models and statistical methods of evolutionary analysis.</p

Directory of Open Access Journals

Edinburgh Research Explorer

Covariance alignment: from maximum likelihood estimation to Gromov-Wasserstein

Author: Han Yanjun
Rigollet Philippe
Stepaniants George
Publication venue
Publication date: 22/11/2023
Field of study

Feature alignment methods are used in many scientific disciplines for data pooling, annotation, and comparison. As an instance of a permutation learning problem, feature alignment presents significant statistical and computational challenges. In this work, we propose the covariance alignment model to study and compare various alignment methods and establish a minimax lower bound for covariance alignment that has a non-standard dimension scaling because of the presence of a nuisance parameter. This lower bound is in fact minimax optimal and is achieved by a natural quasi MLE. However, this estimator involves a search over all permutations which is computationally infeasible even when the problem has moderate size. To overcome this limitation, we show that the celebrated Gromov-Wasserstein algorithm from optimal transport which is more amenable to fast implementation even on large-scale problems is also minimax optimal. These results give the first statistical justification for the deployment of the Gromov-Wasserstein algorithm in practice.Comment: 41 pages, 2 figure

Jefferson Digital Commons

iSeqQC: a tool for expression-based quality control in RNA sequencing.

Author: Ertel Adam
Feldman George
Fortina Paolo
Kumar Gaurav
Kupper Joan
Publication venue: Jefferson Digital Commons
Publication date: 13/02/2020
Field of study

BACKGROUND: Quality Control in any high-throughput sequencing technology is a critical step, which if overlooked can compromise an experiment and the resulting conclusions. A number of methods exist to identify biases during sequencing or alignment, yet not many tools exist to interpret biases due to outliers. RESULTS: Hence, we developed iSeqQC, an expression-based QC tool that detects outliers either produced due to variable laboratory conditions or due to dissimilarity within a phenotypic group. iSeqQC implements various statistical approaches including unsupervised clustering, agglomerative hierarchical clustering and correlation coefficients to provide insight into outliers. It can be utilized through command-line (Github: https://github.com/gkumar09/iSeqQC) or web-interface (http://cancerwebpa.jefferson.edu/iSeqQC). A local shiny installation can also be obtained from github (https://github.com/gkumar09/iSeqQC). CONCLUSION: iSeqQC is a fast, light-weight, expression-based QC tool that detects outliers by implementing various statistical approaches

Fiber-Flux Diffusion Density for White Matter Tracts Analysis: Application to Mild Anomalies Localization in Contact Sports Players

Author: C Stamile
E Garyfallidis
E Garyfallidis
FC Yeh
J Mårtensson
J Sethian
JB Colby
JD Yeatman
L Heimer
L O’Donnell
L Younes
LJ O’Donnell
LJ O’Donnell
M Khatami
MB Hulkower
ME Shenton
MK Chung
N Charon
P Basser
PA Yushkevich
S Durrleman
S Mori
SM Smith
TB Sebastian
Y Benjamini
Publication venue
Publication date: 18/09/2017
Field of study

We present the concept of fiber-flux density for locally quantifying white matter (WM) fiber bundles. By combining scalar diffusivity measures (e.g., fractional anisotropy) with fiber-flux measurements, we define new local descriptors called Fiber-Flux Diffusion Density (FFDD) vectors. Applying each descriptor throughout fiber bundles allows along-tract coupling of a specific diffusion measure with geometrical properties, such as fiber orientation and coherence. A key step in the proposed framework is the construction of an FFDD dissimilarity measure for sub-voxel alignment of fiber bundles, based on the fast marching method (FMM). The obtained aligned WM tract-profiles enable meaningful inter-subject comparisons and group-wise statistical analysis. We demonstrate our method using two different datasets of contact sports players. Along-tract pairwise comparison as well as group-wise analysis, with respect to non-player healthy controls, reveal significant and spatially-consistent FFDD anomalies. Comparing our method with along-tract FA analysis shows improved sensitivity to subtle structural anomalies in football players over standard FA measurements

Probabilistic sequence alignments: realistic models with efficient algorithms

Author: D. Poland
E. Yeramian
Edouard Debonneuil
Edouard Yeramian
R. Durbin
S. Miyazawa
U. Mückstein
Publication venue: 'American Physical Society (APS)'
Publication date: 09/06/2006
Field of study

Alignment algorithms usually rely on simplified models of gaps for computational efficiency. Based on an isomorphism between alignments and physical helix-coil models, we show in statistical mechanics that alignments with realistic laws for gaps can be computed with fast algorithms. Improved performances of probabilistic alignments with realistic models of gaps are illustrated. Probabilistic and optimization formulations are compared, with potential implications in many fields and perspectives for computationally efficient extensions to Markov models with realistic long-range interactions

HAL-Pasteur

Alignment-free Genomic Analysis via a Big Data Spark Platform

Author: Cattaneo Giuseppe
Giancarlo Raffaele
Palini Francesco
Petrillo Umberto Ferraro
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2021
Field of study

Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent Literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in Computational Biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for Alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE

Archivio della ricerca- Università di Roma La Sapienza

Accelerated Profile HMM Searches

Author: A Jacob
A Krogh
A Milosavljević
A Wozniak
AA Schäffer
B Rekapalli
C Camacho
DR Horn
EK Freyhult
EM Gertz
G Chukkapalli
GA Price
J Landman
JP Walters
JP Walters
K Karplus
LR Rabiner
LS Johnson
M Farrar
M Madera
R Durbin
RD Finn
RP Maddimsetty
S Derrien
S Hunter
S Johnson
Sean R. Eddy
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SJ Melnikoff
SR Eddy
T Oliver
T Rognes
T Rognes
TF Smith
V Chaudhary
V Sachdeva
William R. Pearson
WN Grundy
WR Pearson
Y Sun
Y Sun
YK Yu
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call “sparse rescaling”. These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches

Public Library of Science (PLOS)

CiteSeerX

Directory of Open Access Journals

Protein structure database search and evolutionary classification

Author: Tung Chi-Hua
Yang Jinn-Moon
Publication venue: Oxford University Press
Publication date: 02/08/2006
Field of study

As more protein structures become available and structural genomics efforts provide structural models in a genome-wide strategy, there is a growing need for fast and accurate methods for discovering homologous proteins and evolutionary classifications of newly determined structures. We have developed 3D-BLAST, in part, to address these issues. 3D-BLAST is as fast as BLAST and calculates the statistical significance (E-value) of an alignment to indicate the reliability of the prediction. Using this method, we first identified 23 states of the structural alphabet that represent pattern profiles of the backbone fragments and then used them to represent protein structure databases as structural alphabet sequence databases (SADB). Our method enhanced BLAST as a search method, using a new structural alphabet substitution matrix (SASM) to find the longest common substructures with high-scoring structured segment pairs from an SADB database. Using personal computers with Intel Pentium4 (2.8 GHz) processors, our method searched more than 10 000 protein structures in 1.3 s and achieved a good agreement with search results from detailed structure alignment methods. [3D-BLAST is available at

Empirical distribution of k-word matches in biological sequences

Author: Burden
Conover
Conrad J. Burden
Forêt
Gumbel
Kantorovitz
Lippert
Susan R. Wilson
Sylvain Forêt
Waterman
Publication venue: 'Elsevier BV'
Publication date: 14/03/2008
Field of study

This study focuses on an alignment-free sequence comparison method: the number of words of length k shared between two sequences, also known as the D_2 statistic. The advantages of the use of this statistic over alignment-based methods are firstly that it does not assume that homologous segments are contiguous, and secondly that the algorithm is computationally extremely fast, the runtime being proportional to the size of the sequence under scrutiny. Existing applications of the D_2 statistic include the clustering of related sequences in large EST databases such as the STACK database. Such applications have typically relied on heuristics without any statistical basis. Rigorous statistical characterisations of the distribution of D_2 have subsequently been undertaken, but have focussed on the distribution's asymptotic behaviour, leaving the distribution of D_2 uncharacterised for most practical cases. The work presented here bridges these two worlds to give usable approximations of the distribution of D_2 for ranges of parameters most frequently encountered in the study of biological sequences.Comment: 23 pages, 10 figure

ResearchOnline@JCU