277,084 research outputs found

    Fast Statistical Alignment

    Get PDF
    We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/

    BEAST: Bayesian evolutionary analysis by sampling trees

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The evolutionary analysis of molecular sequence variation is a statistical enterprise. This is reflected in the increased use of probabilistic models for phylogenetic inference, multiple sequence alignment, and molecular population genetics. Here we present BEAST: a fast, flexible software architecture for Bayesian analysis of molecular sequences related by an evolutionary tree. A large number of popular stochastic models of sequence evolution are provided and tree-based models suitable for both within- and between-species sequence data are implemented.</p> <p>Results</p> <p>BEAST version 1.4.6 consists of 81000 lines of Java source code, 779 classes and 81 packages. It provides models for DNA and protein sequence evolution, highly parametric coalescent analysis, relaxed clock phylogenetics, non-contemporaneous sequence data, statistical alignment and a wide range of options for prior distributions. BEAST source code is object-oriented, modular in design and freely available at <url>http://beast-mcmc.googlecode.com/</url> under the GNU LGPL license.</p> <p>Conclusion</p> <p>BEAST is a powerful and flexible evolutionary analysis package for molecular sequence variation. It also provides a resource for the further development of new models and statistical methods of evolutionary analysis.</p

    Covariance alignment: from maximum likelihood estimation to Gromov-Wasserstein

    Full text link
    Feature alignment methods are used in many scientific disciplines for data pooling, annotation, and comparison. As an instance of a permutation learning problem, feature alignment presents significant statistical and computational challenges. In this work, we propose the covariance alignment model to study and compare various alignment methods and establish a minimax lower bound for covariance alignment that has a non-standard dimension scaling because of the presence of a nuisance parameter. This lower bound is in fact minimax optimal and is achieved by a natural quasi MLE. However, this estimator involves a search over all permutations which is computationally infeasible even when the problem has moderate size. To overcome this limitation, we show that the celebrated Gromov-Wasserstein algorithm from optimal transport which is more amenable to fast implementation even on large-scale problems is also minimax optimal. These results give the first statistical justification for the deployment of the Gromov-Wasserstein algorithm in practice.Comment: 41 pages, 2 figure

    iSeqQC: a tool for expression-based quality control in RNA sequencing.

    Get PDF
    BACKGROUND: Quality Control in any high-throughput sequencing technology is a critical step, which if overlooked can compromise an experiment and the resulting conclusions. A number of methods exist to identify biases during sequencing or alignment, yet not many tools exist to interpret biases due to outliers. RESULTS: Hence, we developed iSeqQC, an expression-based QC tool that detects outliers either produced due to variable laboratory conditions or due to dissimilarity within a phenotypic group. iSeqQC implements various statistical approaches including unsupervised clustering, agglomerative hierarchical clustering and correlation coefficients to provide insight into outliers. It can be utilized through command-line (Github: https://github.com/gkumar09/iSeqQC) or web-interface (http://cancerwebpa.jefferson.edu/iSeqQC). A local shiny installation can also be obtained from github (https://github.com/gkumar09/iSeqQC). CONCLUSION: iSeqQC is a fast, light-weight, expression-based QC tool that detects outliers by implementing various statistical approaches

    Fiber-Flux Diffusion Density for White Matter Tracts Analysis: Application to Mild Anomalies Localization in Contact Sports Players

    Full text link
    We present the concept of fiber-flux density for locally quantifying white matter (WM) fiber bundles. By combining scalar diffusivity measures (e.g., fractional anisotropy) with fiber-flux measurements, we define new local descriptors called Fiber-Flux Diffusion Density (FFDD) vectors. Applying each descriptor throughout fiber bundles allows along-tract coupling of a specific diffusion measure with geometrical properties, such as fiber orientation and coherence. A key step in the proposed framework is the construction of an FFDD dissimilarity measure for sub-voxel alignment of fiber bundles, based on the fast marching method (FMM). The obtained aligned WM tract-profiles enable meaningful inter-subject comparisons and group-wise statistical analysis. We demonstrate our method using two different datasets of contact sports players. Along-tract pairwise comparison as well as group-wise analysis, with respect to non-player healthy controls, reveal significant and spatially-consistent FFDD anomalies. Comparing our method with along-tract FA analysis shows improved sensitivity to subtle structural anomalies in football players over standard FA measurements

    Probabilistic sequence alignments: realistic models with efficient algorithms

    Full text link
    Alignment algorithms usually rely on simplified models of gaps for computational efficiency. Based on an isomorphism between alignments and physical helix-coil models, we show in statistical mechanics that alignments with realistic laws for gaps can be computed with fast algorithms. Improved performances of probabilistic alignments with realistic models of gaps are illustrated. Probabilistic and optimization formulations are compared, with potential implications in many fields and perspectives for computationally efficient extensions to Markov models with realistic long-range interactions

    Alignment-free Genomic Analysis via a Big Data Spark Platform

    Get PDF
    Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent Literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in Computational Biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for Alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE

    Accelerated Profile HMM Searches

    Get PDF
    Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call “sparse rescaling”. These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches

    Protein structure database search and evolutionary classification

    Get PDF
    As more protein structures become available and structural genomics efforts provide structural models in a genome-wide strategy, there is a growing need for fast and accurate methods for discovering homologous proteins and evolutionary classifications of newly determined structures. We have developed 3D-BLAST, in part, to address these issues. 3D-BLAST is as fast as BLAST and calculates the statistical significance (E-value) of an alignment to indicate the reliability of the prediction. Using this method, we first identified 23 states of the structural alphabet that represent pattern profiles of the backbone fragments and then used them to represent protein structure databases as structural alphabet sequence databases (SADB). Our method enhanced BLAST as a search method, using a new structural alphabet substitution matrix (SASM) to find the longest common substructures with high-scoring structured segment pairs from an SADB database. Using personal computers with Intel Pentium4 (2.8 GHz) processors, our method searched more than 10 000 protein structures in 1.3 s and achieved a good agreement with search results from detailed structure alignment methods. [3D-BLAST is available at

    Empirical distribution of k-word matches in biological sequences

    Full text link
    This study focuses on an alignment-free sequence comparison method: the number of words of length k shared between two sequences, also known as the D_2 statistic. The advantages of the use of this statistic over alignment-based methods are firstly that it does not assume that homologous segments are contiguous, and secondly that the algorithm is computationally extremely fast, the runtime being proportional to the size of the sequence under scrutiny. Existing applications of the D_2 statistic include the clustering of related sequences in large EST databases such as the STACK database. Such applications have typically relied on heuristics without any statistical basis. Rigorous statistical characterisations of the distribution of D_2 have subsequently been undertaken, but have focussed on the distribution's asymptotic behaviour, leaving the distribution of D_2 uncharacterised for most practical cases. The work presented here bridges these two worlds to give usable approximations of the distribution of D_2 for ranges of parameters most frequently encountered in the study of biological sequences.Comment: 23 pages, 10 figure
    corecore