14 research outputs found

    RasBhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison

    Full text link
    Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de

    Computing exact P-values for DNA motifs

    Get PDF
    Motivation: Many heuristic algorithms have been designed to approximate P-values of DNA motifs described by position weight matrices, for evaluating their statistical significance. They often significantly deviate from the true P-value by orders of magnitude. Exact P-value computation is needed for ranking the motifs. Furthermore, surprisingly, the complexity of the problem is unknown. Results: We show the problem to be NP-hard, and present MotifRank, software based on dynamic programming, to calculate exact P-values of motifs. We define the exact P-value on a general and more precise model. Asymptotically, MotifRank is faster than the best exact P-value computing algorithm, and is in fact practical. Our experiments clearly demonstrate that MotifRank significantly improves the accuracy of existing approximation algorithms

    Improvements on Seeding Based Protein Sequence Similarity Search

    Get PDF
    The primary goal of bioinformatics is to increase an understanding in the biology of organisms. Computational, statistical, and mathematical theories and techniques have been developed on formal and practical problems that assist to achieve this primary goal. For the past three decades, the primary application of bioinformatics has been biological data analysis. The DNA or protein sequence similarity search is perhaps the most common, yet vitally important task for analyzing biological data. The sequence similarity search is a process of finding optimal sequence alignments. On the theoretical level, the problem of sequence similarity search is complex. On the applicational level, the sequences similarity search onto a biological database has been one of the most basic tasks today. Using traditional quadratic time complexity solutions becomes a challenge due to the size of the database. Seeding (or filtration) based approaches, which trade sensitivity for speed, are a popular choice among those available. Two main phases usually exist in a seeding based approach. The first phase is referred to as the hit generation, and the second phase is referred to as the hit extension. In this thesis, two improvements on the seeding based protein sequence similarity search are presented. First, for the hit generation, a new seeding idea, namely spaced k-mer neighbors, is presented. We present our effective algorithms to find a good set of spaced k-mer neighbors. Secondly, for the hit generation, a new method, namely HexFilter, is proposed to reduce the number of hit extensions while achieving better selectivity. We show our HexFilters with optimized configurations

    A FAST ALGORITHM FOR COMPUTING HIGHLY SENSITIVE MULTIPLE SPACED SEEDS

    Get PDF
    The main goal of homology search is to find similar segments, or local alignments, be­ tween two DNA or protein sequences. Since the dynamic programming algorithm of Smith- Waterman is too slow, heuristic methods have been designed to achieve both efficiency and accuracy. Seed-based methods were made well known by their use in BLAST, the most widely used software program in biological applications. The seed of BLAST trades sensitivity for speed and spaced seeds were introduced in PatternHunter to achieve both. Several seeds are better than one and near perfect sensitivity can be obtained while maintaining the speed. There­ fore, multiple spaced seeds quickly became the state-of-the-art in similarity search, being em­ ployed by many software programs. However, the quality of these seeds is crucial and comput­ ing optimal multiple spaced seeds is NP-hard. All but one of the existing heuristic algorithms for computing good seeds are exponential. Our work has two main goals. First we engineer the only existing polynomial-time heuristic algorithm to compute better seeds than any other program, while running orders of magnitude faster. Second, we estimate its performance by comparing its seeds with the optimal seeds in a few practical cases. In order to make the computation feasible, a very fast implementation of the sensitivity function is provided

    Improvements in the Accuracy of Pairwise Genomic Alignment

    Get PDF
    Pairwise sequence alignment is a fundamental problem in bioinformatics with wide applicability. This thesis presents three new algorithms for this well-studied problem. First, we present a new algorithm, RDA, which aligns sequences in small segments, rather than by individual bases. Then, we present two algorithms for aligning long genomic sequences: CAPE, a pairwise global aligner, and FEAST, a pairwise local aligner. RDA produces interesting alignments that can be substantially different in structure than traditional alignments. It is also better than traditional alignment at the task of homology detection. However, its main negative is a very slow run time. Further, although it produces alignments with different structure, it is not clear if the differences have a practical value in genomic research. Our main success comes from our local aligner, FEAST. We describe two main improvements: a new more descriptive model of evolution, and a new local extension algorithm that considers all possible evolutionary histories rather than only the most likely. Our new model of evolution provides for improved alignment accuracy, and substantially improved parameter training. In particular, we produce a new parameter set for aligning human and mouse sequences that properly describes regions of weak similarity and regions of strong similarity. The second result is our new extension algorithm. Depending on heuristic settings, our new algorithm can provide for more sensitivity than existing extension algorithms, more specificity, or a combination of the two. By comparing to CAPE, our global aligner, we find that the sensitivity increase provided by our local extension algorithm is so substantial that it outperforms CAPE on sequence with 0.9 or more expected substitutions per site. CAPE itself gives improved sensitivity for sequence with 0.7 or more expected substitutions per site, but at a great run time cost. FEAST and our local extension algorithm improves on this too, the run time is only slightly slower than existing local alignment algorithms and asymptotically the same

    Análise e compressão de sequências genómicas

    Get PDF
    Tese de doutoramento em InformáticaA informação dos códigos genéticos sequenciados é na actualidade, provavelmente, a fonte mais inspiradora para o estudo e avanço das teorias da informação e da codificação. Algoritmos eficientes para a sua compressão antevêm-se essenciais para a optimização do armazenamento e comunicação da informação genómica. A compressão de informação genómica é um caso particular da compressão de informação. A entropia das sequências de ADN é elevada, contudo variável. Ao nível intra-genómico é maior nas regiões codificantes e menor nas regiões não codificantes. Ao nível inter-genómico é maior nos seres procarióticos e menor nos eucarióticos. Na base da redução da entropia estão as regularidades que perfazem as regiões repetitivas do ADN. As regiões repetitivas compõem-se sobretudo de padrões aproximados, que incluem pontualmente mutações, delecções, inserções ou gaps. Os padrões exactos são menos relevantes e geralmente apresentam-se em numerosas repetições adjacentes. A redundância do ADN também tem manifestações estatísticas e probabilísticas. As redundâncias das sequências de ADN são a fonte de recursos de compressão, as grandes repetições indicam-se para a compressão substitucional com recurso a dicionário, enquanto que as evidências estatísticas e probabilísticas permitem modelar e predizer parcialmente a sucessão de símbolos (bases), utilizando compressores estatísticos para capitalizar esse potencial de compressão. Considerando a entropia máxima para o ADN, a sua codificação corresponde a 2 bits por base. Em média, os melhores compressores disponíveis, concebidos para a especificidade do ADN, alcançam os 1,7 bits/base, o que corresponde a uma taxa de compressão de apenas 15%, valor que é demonstrativo da dificuldade inerente. O trabalho realizado corresponde a um framework de análise e compressão de sequências de ADN, cuja aplicação principal corresponde ao DNALight. O DNALight é uma solução híbrida para compressão de informação genómica baseada na cooperação de várias metodologias vocacionadas para absorver ocorrências das diferentes tipologias de redundâncias presentes nas cadeias de nucleótidos. De facto, a compressão não é possível sem análise. É na completa análise que reside a obtenção dos recursos que permitirão reduzir a entropia. Para a análise de sequências de ADN desenvolveram-se algoritmos inovadores para a pesquisa de padrões exactos (GRASPm) e aproximados (SimSearch) que alcançam desempenhos que superam destacadamente o estado da arte. Estes algoritmos intervêm na primeira fase do DNALight que aproveita o potencial dos padrões mais representativos para a compressão substitucional baseada em dicionário de padrões exactos e aproximados. Para maximizar as captações de padrões, a pesquisa é exaustiva e efectuada multi-nível, ou seja, na sequência normal 5’-3’, na complementar natural 3’-5’, e também nas duas restantes complementares artificiais. Na segunda fase do DNALight, que procura fazer o aproveitamento das redundâncias desconsideradas pela captação da primeira fase, são construídos modelos probabilísticos de linguagem compactos com bases nas regiões menos repetitivas que transitam para esta fase, e que constituem o input para esta metodologia complementar. Em concorrência, os modelos geram predições sustentadas nas apreciações probabilísticas de modelos de linguagem globais e locais. As predições acertadas ou aproximadas permitem codificações mais económicas pois criam maior desequilíbrio no modelo probabilístico de codificação, beneficiando o desempenho da codificação aritmética que encerra o processo. O processo de descompressão é similar mas reverso ao descrito para a compressão. Os resultados experimentais colocam o DNALight como novo integrante do estado da arte em compressão de sequências de ADN, superando consistentemente, mas em pequena escala, os seus antecessores.Genetics is nowadays, probably, the most inspiring source for coding theory study and developments. Efficient compression algorithms are essential to optimise genomic data storage and communication. Genomic data compression is a particular case of data compression. The entropy present in DNA sequences is high, however variable. At intra-genomic level, it is higher in coding regions and lower in non-coding regions. At inter-genomic level, it is higher in the prokaryotes and lower in eukaryotes. DNA entropy reduction is achieved by coding more efficiently the repetitive regions of the ADN. Repetitive regions are mainly composed of inexact patterns. Patterns’ errors are caused by biological processes and DNA dynamics including mutations, deletions, insertions or gaps. Exact patterns are less relevant and generally are presented in tandem repetitions. DNA redundancies have also statistical and probabilistic manifestations. The redundancies of DNA sequences are the most proficuous source of compression resources, the larger repetitions are indicated for substitucional compression based on a dictionary, whereas the statistical and probabilistic evidences allow to model and predict the succession of symbols (bases) in the sequence, using statistical compression to capitalize this compression potential. Considering the maximum DNA entropy, its codification cost corresponds to 2 bits per base. On average, the best available compressors, conceived accordingly DNA data specificities, reach 1,7 bits/base, which corresponds to a compression rate of only 15%, and this value is demonstrative of the inherent difficulty. The developed work corresponds to a framework for the analysis and compression of DNA sequences, being DNALight the most representative application. DNALight is a hybrid solution for DNA compression based on the cooperative integration of complementary methodologies to absorb the different redundancies present in DNA sequences. In fact, compression is not possible without analysis. Gathering resources for compression relies mostly in analysis, and the emerged recurrences will allow to reduce the entropy. Innovative algorithms were developed for exact pattern-matching (GRASPm) and approximate and exact pattern discovery (SimSearch) and their performance notoriously surpasses the state of the art. These algorithms play an important role in the first phase of the DNALight to implement substitucional compression based on dictionary of exact and approximated repeats. To maximize pattern recollection, the searching is performed multi-level, i.e., in normal sequence 5' - 3', in natural complementary sequence 3' - 5', and also in the two remaining artificial complementary sequences. In the second phase of DNALight, focused on taking advantage of the missed redundancies in the first phase, probabilistic language models are built based on the less repetitive regions as they constitute the input of this complementary methodology. In competition, the models generate predictions supported in the probabilistic analysis of global and local language models. Accurate or approximated predictions allow compact codifications as they provide a more disproportional probabilistic model for codification, benefiting the arithmetic coding performance that encloses the process. The decompression process is similar, but reverse when compared with compression. The experimental results place DNALight as a new constituent of the state of the art in DNA sequences compression, surpassing consistently, but in small scale, its predecessors.Programa de Desenvolvimento Educativo para Portugal (PRODEP
    corecore