52,513 research outputs found
Pairwise alignment incorporating dipeptide covariation
Motivation: Standard algorithms for pairwise protein sequence alignment make
the simplifying assumption that amino acid substitutions at neighboring sites
are uncorrelated. This assumption allows implementation of fast algorithms for
pairwise sequence alignment, but it ignores information that could conceivably
increase the power of remote homolog detection. We examine the validity of this
assumption by constructing extended substitution matrixes that encapsulate the
observed correlations between neighboring sites, by developing an efficient and
rigorous algorithm for pairwise protein sequence alignment that incorporates
these local substitution correlations, and by assessing the ability of this
algorithm to detect remote homologies. Results: Our analysis indicates that
local correlations between substitutions are not strong on the average.
Furthermore, incorporating local substitution correlations into pairwise
alignment did not lead to a statistically significant improvement in remote
homology detection. Therefore, the standard assumption that individual residues
within protein sequences evolve independently of neighboring positions appears
to be an efficient and appropriate approximation
Computational identification and analysis of noncoding RNAs - Unearthing the buried treasures in the genome
The central dogma of molecular biology states that the genetic information flows from DNA to RNA to protein. This dogma has exerted a substantial influence on our understanding of the genetic activities in the cells. Under this influence, the prevailing assumption until the recent past was that genes are basically repositories for protein coding information, and proteins are responsible for most of the important biological functions in all cells. In the meanwhile, the importance of RNAs has remained rather obscure, and RNA was mainly viewed as a passive intermediary that bridges the gap between DNA and protein. Except for classic examples such as tRNAs (transfer RNAs) and rRNAs (ribosomal RNAs), functional noncoding RNAs were considered to be rare.
However, this view has experienced a dramatic change during the last decade, as systematic screening of various genomes identified myriads of noncoding RNAs (ncRNAs), which are RNA molecules that function without being translated into proteins [11], [40]. It has been realized that many ncRNAs play important roles in various biological processes. As RNAs can interact with other RNAs and DNAs in a sequence-specific manner, they are especially useful in tasks that require highly specific nucleotide recognition [11]. Good examples are the miRNAs (microRNAs) that regulate gene expression by targeting mRNAs (messenger RNAs) [4], [20], and the siRNAs (small interfering RNAs) that take part in the RNAi (RNA interference) pathways for gene silencing [29], [30]. Recent developments show that ncRNAs are extensively involved in many gene regulatory mechanisms [14], [17].
The roles of ncRNAs known to this day are truly diverse. These include transcription and translation control, chromosome replication, RNA processing and modification, and protein degradation and translocation [40], just to name a few. These days, it is even claimed that ncRNAs dominate the genomic output of the higher organisms such as mammals, and it is being suggested that the greater portion of their genome (which does not encode proteins) is dedicated to the control and regulation of cell development [27]. As more and more evidence piles up, greater attention is paid to ncRNAs, which have been neglected for a long time. Researchers began to realize that the vast majority of the genome that was regarded as “junk,” mainly because it was not well understood, may indeed hold the key for the best kept secrets in life, such as the mechanism of alternative splicing, the control of epigenetic variations and so forth [27]. The complete range and extent of the role of ncRNAs are not so obvious at this point, but it is certain that a comprehensive understanding of cellular processes is not possible without understanding the functions of ncRNAs [47]
Efficient seeding techniques for protein similarity search
We apply the concept of subset seeds proposed in [1] to similarity search in
protein sequences. The main question studied is the design of efficient seed
alphabets to construct seeds with optimal sensitivity/selectivity trade-offs.
We propose several different design methods and use them to construct several
alphabets.We then perform an analysis of seeds built over those alphabet and
compare them with the standard Blastp seeding method [2,3], as well as with the
family of vector seeds proposed in [4]. While the formalism of subset seed is
less expressive (but less costly to implement) than the accumulative principle
used in Blastp and vector seeds, our seeds show a similar or even better
performance than Blastp on Bernoulli models of proteins compatible with the
common BLOSUM62 matrix
Efficient seeding techniques for protein similarity search
We apply the concept of subset seeds proposed in [1] to similarity search in
protein sequences. The main question studied is the design of efficient seed
alphabets to construct seeds with optimal sensitivity/selectivity trade-offs.
We propose several different design methods and use them to construct several
alphabets.We then perform an analysis of seeds built over those alphabet and
compare them with the standard Blastp seeding method [2,3], as well as with the
family of vector seeds proposed in [4]. While the formalism of subset seed is
less expressive (but less costly to implement) than the accumulative principle
used in Blastp and vector seeds, our seeds show a similar or even better
performance than Blastp on Bernoulli models of proteins compatible with the
common BLOSUM62 matrix
Identification of functionally related enzymes by learning-to-rank methods
Enzyme sequences and structures are routinely used in the biological sciences
as queries to search for functionally related enzymes in online databases. To
this end, one usually departs from some notion of similarity, comparing two
enzymes by looking for correspondences in their sequences, structures or
surfaces. For a given query, the search operation results in a ranking of the
enzymes in the database, from very similar to dissimilar enzymes, while
information about the biological function of annotated database enzymes is
ignored.
In this work we show that rankings of that kind can be substantially improved
by applying kernel-based learning algorithms. This approach enables the
detection of statistical dependencies between similarities of the active cleft
and the biological function of annotated enzymes. This is in contrast to
search-based approaches, which do not take annotated training data into
account. Similarity measures based on the active cleft are known to outperform
sequence-based or structure-based measures under certain conditions. We
consider the Enzyme Commission (EC) classification hierarchy for obtaining
annotated enzymes during the training phase. The results of a set of sizeable
experiments indicate a consistent and significant improvement for a set of
similarity measures that exploit information about small cavities in the
surface of enzymes
Pair HMM based gap statistics for re-evaluation of indels in alignments with affine gap penalties: Extended Version
Although computationally aligning sequence is a crucial step in the vast
majority of comparative genomics studies our understanding of alignment biases
still needs to be improved. To infer true structural or homologous regions
computational alignments need further evaluation. It has been shown that the
accuracy of aligned positions can drop substantially in particular around gaps.
Here we focus on re-evaluation of score-based alignments with affine gap
penalty costs. We exploit their relationships with pair hidden Markov models
and develop efficient algorithms by which to identify gaps which are
significant in terms of length and multiplicity. We evaluate our statistics
with respect to the well-established structural alignments from SABmark and
find that indel reliability substantially increases with their significance in
particular in worst-case twilight zone alignments. This points out that our
statistics can reliably complement other methods which mostly focus on the
reliability of match positions.Comment: 17 pages, 7 figure
- …