19 research outputs found
New decoding algorithms for Hidden Markov Models using distance measures on labellings
<p>Abstract</p> <p>Background</p> <p>Existing hidden Markov model decoding algorithms do not focus on approximately identifying the sequence feature boundaries.</p> <p>Results</p> <p>We give a set of algorithms to compute the conditional probability of all labellings "near" a reference labelling <it>λ </it>for a sequence <it>y </it>for a variety of definitions of "near". In addition, we give optimization algorithms to find the best labelling for a sequence in the robust sense of having all of its feature boundaries nearly correct. Natural problems in this domain are <it>NP</it>-hard to optimize. For membrane proteins, our algorithms find the approximate topology of such proteins with comparable success to existing programs, while being substantially more accurate in estimating the positions of transmembrane helix boundaries.</p> <p>Conclusion</p> <p>More robust HMM decoding may allow for better analysis of sequence features, in reasonable runtimes.</p
On the Treewidth of Dynamic Graphs
Dynamic graph theory is a novel, growing area that deals with graphs that
change over time and is of great utility in modelling modern wireless, mobile
and dynamic environments. As a graph evolves, possibly arbitrarily, it is
challenging to identify the graph properties that can be preserved over time
and understand their respective computability.
In this paper we are concerned with the treewidth of dynamic graphs. We focus
on metatheorems, which allow the generation of a series of results based on
general properties of classes of structures. In graph theory two major
metatheorems on treewidth provide complexity classifications by employing
structural graph measures and finite model theory. Courcelle's Theorem gives a
general tractability result for problems expressible in monadic second order
logic on graphs of bounded treewidth, and Frick & Grohe demonstrate a similar
result for first order logic and graphs of bounded local treewidth.
We extend these theorems by showing that dynamic graphs of bounded (local)
treewidth where the length of time over which the graph evolves and is observed
is finite and bounded can be modelled in such a way that the (local) treewidth
of the underlying graph is maintained. We show the application of these results
to problems in dynamic graph theory and dynamic extensions to static problems.
In addition we demonstrate that certain widely used dynamic graph classes
naturally have bounded local treewidth
Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST
BACKGROUND: TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server. RESULTS: We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy. CONCLUSION: TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms
RNA motif search with data-driven element ordering
BACKGROUND: In this paper, we study the problem of RNA motif search in long genomic sequences. This approach uses a combination of sequence and structure constraints to uncover new distant homologs of known functional RNAs. The problem is NP-hard and is traditionally solved by backtracking algorithms. RESULTS: We have designed a new algorithm for RNA motif search and implemented a new motif search tool RNArobo. The tool enhances the RNAbob descriptor language, allowing insertions in helices, which enables better characterization of ribozymes and aptamers. A typical RNA motif consists of multiple elements and the running time of the algorithm is highly dependent on their ordering. By approaching the element ordering problem in a principled way, we demonstrate more than 100-fold speedup of the search for complex motifs compared to previously published tools. CONCLUSIONS: We have developed a new method for RNA motif search that allows for a significant speedup of the search of complex motifs that include pseudoknots. Such speed improvements are crucial at a time when the rate of DNA sequencing outpaces growth in computing. RNArobo is available at http://compbio.fmph.uniba.sk/rnarobo. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1074-x) contains supplementary material, which is available to authorized users
Optimizing Multiple Spaced Seeds for Homology Search
Abstract. Optimized spaced seeds improve sensitivity and specificity in localhomology search [1]. Recently, several authors [2-4] have shown that multiple seeds can have better sensitivity and specificity than single seeds. We describea linear programming-based algorithm to optimize a set of seeds. Our algorithm offers a performance guarantee: the sensitivity of a chosen seed set is at least 70%of what can be achieved, in most reasonable models of homologous sequences. Our method achieves performance comparable to that of a greedy algorithm, butour work gives this area a mathematical foundation
The Most Probable Annotation Problem in HMMs and Its Application to Bioinformatics
Hidden Markov models (HMMs) are often used for biological sequence annotation. Each sequence feature is represented by a collection of states with the same label. In annotating a new sequence, we seek the sequence of labels that has highest probability. Computing this most probable annotation was shown NP-hard by Lyngsø and Pedersen [15]. We improve their result by showing that the problem is NP-hard for a specific HMM, and present efficient algorithms to compute the most probable annotation for a large class of HMMs, including abstractions of models previously used for transmembrane protein topology prediction and coding region detection. We also present a small experiment showing that the maximum probability annotation is more accurate than the labeling that results from simpler heuristics.
Amino Acid Classification and Hash Seeds for Homology Search
Spaced seeds have been extensively studied in the homology search field. A spaced seed can be regarded as a very special type of hash function on k-mers, where two k-mers have the same hash value if and only if they are identical at the w (w <k) positions designated by the seed. Spaced seeds substantially increased the homology search sensitivity. It is then a natural question to ask whether there is a better hash function (called hash seed) that provides better sensitivity than the spaced seed. We study this question in the paper. We propose a strategy to classify amino acids, which leads to a better hash seed. Our results raise a new question about how to design the best hash seed
Quality of Algorithms for Sequence Comparison
Pair-wise sequence alignment is the basic method of comparative analysis of proteins and nucleic acids. Studying the results of the alignment one has to consider two questions: (1) did the program find all the interesting similarities (“sensitivity”) and (2) are all the found similarities interesting (“selectivity”). Definitely, one has to specify, what alignments are considered as the interesting ones. Analogous questions can be addressed to each of the obtained alignments: (3) which part of the aligned positions are aligned correctly (“confidence”) and (4) does alignment contain all pairs of the corresponding positions of compared sequences (“accuracy”). Naturally, the answer on the questions depends on the definition of the correct alignment. The presentation addresses the above two pairs of questions that are extremely important in interpreting of the results of sequence comparison