88,570 research outputs found
Multiple sequence alignment based on set covers
We introduce a new heuristic for the multiple alignment of a set of
sequences. The heuristic is based on a set cover of the residue alphabet of the
sequences, and also on the determination of a significant set of blocks
comprising subsequences of the sequences to be aligned. These blocks are
obtained with the aid of a new data structure, called a suffix-set tree, which
is constructed from the input sequences with the guidance of the
residue-alphabet set cover and generalizes the well-known suffix tree of the
sequence set. We provide performance results on selected BAliBASE amino-acid
sequences and compare them with those yielded by some prominent approaches
GibbsCluster: unsupervised clustering and alignment of peptide sequences
Receptor interactions with short linear peptide fragments (ligands) are at the base of many biological signaling processes. Conserved and information-rich amino acid patterns, commonly called sequence motifs, shape and regulate these interactions. Because of the properties of a receptor-ligand system or of the assay used to interrogate it, experimental data often contain multiple sequence motifs. GibbsCluster is a powerful tool for unsupervised motif discovery because it can simultaneously cluster and align peptide data. The GibbsCluster 2.0 presented here is an improved version incorporating insertion and deletions accounting for variations in motif length in the peptide input. In basic terms, the program takes as input a set of peptide sequences and clusters them into meaningful groups. It returns the optimal number of clusters it identified, together with the sequence alignment and sequence motif characterizing each cluster. Several parameters are available to customize cluster analysis, including adjustable penalties for small clusters and overlapping groups and a trash cluster to remove outliers. As an example application, we used the server to deconvolute multiple specificities in large-scale peptidome data generated by mass spectrometry.Fil: Andreatta, Massimo. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - La Plata. Instituto de Investigaciones Biotecnológicas. Instituto de Investigaciones Biotecnológicas "Dr. Raúl AlfonsÃn" (sede Chascomús). Universidad Nacional de San MartÃn. Instituto de Investigaciones Biotecnológicas. Instituto de Investigaciones Biotecnológicas "Dr. Raúl AlfonsÃn" (sede Chascomús); ArgentinaFil: Alvarez, Bruno. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - La Plata. Instituto de Investigaciones Biotecnológicas. Instituto de Investigaciones Biotecnológicas "Dr. Raúl AlfonsÃn" (sede Chascomús). Universidad Nacional de San MartÃn. Instituto de Investigaciones Biotecnológicas. Instituto de Investigaciones Biotecnológicas "Dr. Raúl AlfonsÃn" (sede Chascomús); ArgentinaFil: Nielsen, Morten. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Centro CientÃfico Tecnológico Conicet - La Plata. Instituto de Investigaciones Biotecnológicas. Instituto de Investigaciones Biotecnológicas "Dr. Raúl AlfonsÃn" (sede Chascomús). Universidad Nacional de San MartÃn. Instituto de Investigaciones Biotecnológicas. Instituto de Investigaciones Biotecnológicas "Dr. Raúl AlfonsÃn" (sede Chascomús); Argentina. Technical University of Denmark; Dinamarc
MAVID: Constrained ancestral alignment of multiple sequences
We describe a new global multiple alignment program capable of aligning a
large number of genomic regions. Our progressive alignment approach
incorporates the following ideas: maximum-likelihood inference of ancestral
sequences, automatic guide-tree construction, protein based anchoring of
ab-initio gene predictions, and constraints derived from a global homology map
of the sequences. We have implemented these ideas in the MAVID program, which
is able to accurately align multiple genomic regions up to megabases long.
MAVID is able to effectively align divergent sequences, as well as incomplete
unfinished sequences. We demonstrate the capabilities of the program on the
benchmark CFTR region which consists of 1.8Mb of human sequence and 20
orthologous regions in marsupials, birds, fish, and mammals. Finally, we
describe two large MAVID alignments: an alignment of all the available HIV
genomes and a multiple alignment of the entire human, mouse and rat genomes
Hybrid modeling, HMM/NN architectures, and protein applications
We describe a hybrid modeling approach where the parameters of a model are calculated and modulated by another model, typically a neural network (NN), to avoid both overfitting and underfitting. We develop the approach for the case of Hidden Markov Models (HMMs), by deriving a class of hybrid HMM/NN architectures. These architectures can be trained with unified algorithms that blend HMM dynamic programming with NN backpropagation. In the case of complex data, mixtures of HMMs or modulated HMMs must be used. NNs can then be applied both to the parameters of each single HMM, and to the switching or modulation of the models, as a function of input or context. Hybrid HMM/NN architectures provide a flexible NN parameterization for the control of model structure and complexity. At the same time, they can capture distributions that, in practice, are inaccessible to single HMMs. The HMM/NN hybrid approach is tested, in its simplest form, by constructing a model of the immunoglobulin protein family. A hybrid model is trained, and a multiple alignment derived, with less than a fourth of the number of parameters used with previous single HMMs
A new procedure to analyze RNA non-branching structures
RNA structure prediction and structural motifs analysis are challenging tasks in the investigation of RNA function. We propose a novel procedure to detect structural motifs shared between two RNAs (a reference and a target). In particular, we developed two core modules: (i) nbRSSP_extractor, to assign a unique structure to the reference RNA encoded by a set of non-branching structures; (ii) SSD_finder, to detect structural motifs that the target RNA shares with the reference, by means of a new score function that rewards the relative distance of the target non-branching structures compared to the reference ones. We integrated these algorithms with already existing software to reach a coherent pipeline able to perform the following two main tasks: prediction of RNA structures (integration of RNALfold and nbRSSP_extractor) and search for chains of matches (integration of Structator and SSD_finder)
Evolutionary distances in the twilight zone -- a rational kernel approach
Phylogenetic tree reconstruction is traditionally based on multiple sequence
alignments (MSAs) and heavily depends on the validity of this information
bottleneck. With increasing sequence divergence, the quality of MSAs decays
quickly. Alignment-free methods, on the other hand, are based on abstract
string comparisons and avoid potential alignment problems. However, in general
they are not biologically motivated and ignore our knowledge about the
evolution of sequences. Thus, it is still a major open question how to define
an evolutionary distance metric between divergent sequences that makes use of
indel information and known substitution models without the need for a multiple
alignment. Here we propose a new evolutionary distance metric to close this
gap. It uses finite-state transducers to create a biologically motivated
similarity score which models substitutions and indels, and does not depend on
a multiple sequence alignment. The sequence similarity score is defined in
analogy to pairwise alignments and additionally has the positive semi-definite
property. We describe its derivation and show in simulation studies and
real-world examples that it is more accurate in reconstructing phylogenies than
competing methods. The result is a new and accurate way of determining
evolutionary distances in and beyond the twilight zone of sequence alignments
that is suitable for large datasets.Comment: to appear in PLoS ON
Incorporating molecular data in fungal systematics: a guide for aspiring researchers
The last twenty years have witnessed molecular data emerge as a primary
research instrument in most branches of mycology. Fungal systematics, taxonomy,
and ecology have all seen tremendous progress and have undergone rapid,
far-reaching changes as disciplines in the wake of continual improvement in DNA
sequencing technology. A taxonomic study that draws from molecular data
involves a long series of steps, ranging from taxon sampling through the
various laboratory procedures and data analysis to the publication process. All
steps are important and influence the results and the way they are perceived by
the scientific community. The present paper provides a reflective overview of
all major steps in such a project with the purpose to assist research students
about to begin their first study using DNA-based methods. We also take the
opportunity to discuss the role of taxonomy in biology and the life sciences in
general in the light of molecular data. While the best way to learn molecular
methods is to work side by side with someone experienced, we hope that the
present paper will serve to lower the learning threshold for the reader.Comment: Submitted to Current Research in Environmental and Applied Mycology -
comments most welcom
- …