277,084 research outputs found
Fast Statistical Alignment
We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/
BEAST: Bayesian evolutionary analysis by sampling trees
<p>Abstract</p> <p>Background</p> <p>The evolutionary analysis of molecular sequence variation is a statistical enterprise. This is reflected in the increased use of probabilistic models for phylogenetic inference, multiple sequence alignment, and molecular population genetics. Here we present BEAST: a fast, flexible software architecture for Bayesian analysis of molecular sequences related by an evolutionary tree. A large number of popular stochastic models of sequence evolution are provided and tree-based models suitable for both within- and between-species sequence data are implemented.</p> <p>Results</p> <p>BEAST version 1.4.6 consists of 81000 lines of Java source code, 779 classes and 81 packages. It provides models for DNA and protein sequence evolution, highly parametric coalescent analysis, relaxed clock phylogenetics, non-contemporaneous sequence data, statistical alignment and a wide range of options for prior distributions. BEAST source code is object-oriented, modular in design and freely available at <url>http://beast-mcmc.googlecode.com/</url> under the GNU LGPL license.</p> <p>Conclusion</p> <p>BEAST is a powerful and flexible evolutionary analysis package for molecular sequence variation. It also provides a resource for the further development of new models and statistical methods of evolutionary analysis.</p
Covariance alignment: from maximum likelihood estimation to Gromov-Wasserstein
Feature alignment methods are used in many scientific disciplines for data
pooling, annotation, and comparison. As an instance of a permutation learning
problem, feature alignment presents significant statistical and computational
challenges. In this work, we propose the covariance alignment model to study
and compare various alignment methods and establish a minimax lower bound for
covariance alignment that has a non-standard dimension scaling because of the
presence of a nuisance parameter. This lower bound is in fact minimax optimal
and is achieved by a natural quasi MLE. However, this estimator involves a
search over all permutations which is computationally infeasible even when the
problem has moderate size. To overcome this limitation, we show that the
celebrated Gromov-Wasserstein algorithm from optimal transport which is more
amenable to fast implementation even on large-scale problems is also minimax
optimal. These results give the first statistical justification for the
deployment of the Gromov-Wasserstein algorithm in practice.Comment: 41 pages, 2 figure
iSeqQC: a tool for expression-based quality control in RNA sequencing.
BACKGROUND: Quality Control in any high-throughput sequencing technology is a critical step, which if overlooked can compromise an experiment and the resulting conclusions. A number of methods exist to identify biases during sequencing or alignment, yet not many tools exist to interpret biases due to outliers.
RESULTS: Hence, we developed iSeqQC, an expression-based QC tool that detects outliers either produced due to variable laboratory conditions or due to dissimilarity within a phenotypic group. iSeqQC implements various statistical approaches including unsupervised clustering, agglomerative hierarchical clustering and correlation coefficients to provide insight into outliers. It can be utilized through command-line (Github: https://github.com/gkumar09/iSeqQC) or web-interface (http://cancerwebpa.jefferson.edu/iSeqQC). A local shiny installation can also be obtained from github (https://github.com/gkumar09/iSeqQC).
CONCLUSION: iSeqQC is a fast, light-weight, expression-based QC tool that detects outliers by implementing various statistical approaches
Fiber-Flux Diffusion Density for White Matter Tracts Analysis: Application to Mild Anomalies Localization in Contact Sports Players
We present the concept of fiber-flux density for locally quantifying white
matter (WM) fiber bundles. By combining scalar diffusivity measures (e.g.,
fractional anisotropy) with fiber-flux measurements, we define new local
descriptors called Fiber-Flux Diffusion Density (FFDD) vectors. Applying each
descriptor throughout fiber bundles allows along-tract coupling of a specific
diffusion measure with geometrical properties, such as fiber orientation and
coherence. A key step in the proposed framework is the construction of an FFDD
dissimilarity measure for sub-voxel alignment of fiber bundles, based on the
fast marching method (FMM). The obtained aligned WM tract-profiles enable
meaningful inter-subject comparisons and group-wise statistical analysis. We
demonstrate our method using two different datasets of contact sports players.
Along-tract pairwise comparison as well as group-wise analysis, with respect to
non-player healthy controls, reveal significant and spatially-consistent FFDD
anomalies. Comparing our method with along-tract FA analysis shows improved
sensitivity to subtle structural anomalies in football players over standard FA
measurements
Probabilistic sequence alignments: realistic models with efficient algorithms
Alignment algorithms usually rely on simplified models of gaps for
computational efficiency. Based on an isomorphism between alignments and
physical helix-coil models, we show in statistical mechanics that alignments
with realistic laws for gaps can be computed with fast algorithms. Improved
performances of probabilistic alignments with realistic models of gaps are
illustrated. Probabilistic and optimization formulations are compared, with
potential implications in many fields and perspectives for computationally
efficient extensions to Markov models with realistic long-range interactions
Alignment-free Genomic Analysis via a Big Data Spark Platform
Motivation: Alignment-free distance and similarity functions (AF functions,
for short) are a well established alternative to two and multiple sequence
alignments for many genomic, metagenomic and epigenomic tasks. Due to
data-intensive applications, the computation of AF functions is a Big Data
problem, with the recent Literature indicating that the development of fast and
scalable algorithms computing AF functions is a high-priority task. Somewhat
surprisingly, despite the increasing popularity of Big Data technologies in
Computational Biology, the development of a Big Data platform for those tasks
has not been pursued, possibly due to its complexity. Results: We fill this
important gap by introducing FADE, the first extensible, efficient and scalable
Spark platform for Alignment-free genomic analysis. It supports natively
eighteen of the best performing AF functions coming out of a recent hallmark
benchmarking study. FADE development and potential impact comprises novel
aspects of interest. Namely, (a) a considerable effort of distributed
algorithms, the most tangible result being a much faster execution time of
reference methods like MASH and FSWM; (b) a software design that makes FADE
user-friendly and easily extendable by Spark non-specialists; (c) its ability
to support data- and compute-intensive tasks. About this, we provide a novel
and much needed analysis of how informative and robust AF functions are, in
terms of the statistical significance of their output. Our findings naturally
extend the ones of the highly regarded benchmarking study, since the functions
that can really be used are reduced to a handful of the eighteen included in
FADE
Accelerated Profile HMM Searches
Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call “sparse rescaling”. These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches
Protein structure database search and evolutionary classification
As more protein structures become available and structural genomics efforts provide structural models in a genome-wide strategy, there is a growing need for fast and accurate methods for discovering homologous proteins and evolutionary classifications of newly determined structures. We have developed 3D-BLAST, in part, to address these issues. 3D-BLAST is as fast as BLAST and calculates the statistical significance (E-value) of an alignment to indicate the reliability of the prediction. Using this method, we first identified 23 states of the structural alphabet that represent pattern profiles of the backbone fragments and then used them to represent protein structure databases as structural alphabet sequence databases (SADB). Our method enhanced BLAST as a search method, using a new structural alphabet substitution matrix (SASM) to find the longest common substructures with high-scoring structured segment pairs from an SADB database. Using personal computers with Intel Pentium4 (2.8 GHz) processors, our method searched more than 10 000 protein structures in 1.3 s and achieved a good agreement with search results from detailed structure alignment methods. [3D-BLAST is available at
Empirical distribution of k-word matches in biological sequences
This study focuses on an alignment-free sequence comparison method: the
number of words of length k shared between two sequences, also known as the D_2
statistic. The advantages of the use of this statistic over alignment-based
methods are firstly that it does not assume that homologous segments are
contiguous, and secondly that the algorithm is computationally extremely fast,
the runtime being proportional to the size of the sequence under scrutiny.
Existing applications of the D_2 statistic include the clustering of related
sequences in large EST databases such as the STACK database. Such applications
have typically relied on heuristics without any statistical basis. Rigorous
statistical characterisations of the distribution of D_2 have subsequently been
undertaken, but have focussed on the distribution's asymptotic behaviour,
leaving the distribution of D_2 uncharacterised for most practical cases. The
work presented here bridges these two worlds to give usable approximations of
the distribution of D_2 for ranges of parameters most frequently encountered in
the study of biological sequences.Comment: 23 pages, 10 figure
- …