4 research outputs found

    Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

    Full text link
    Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem. Here for the first time, we propose a mixed integer program (MIP) capable to solve this optimization problem for Hamming distance with given number of pieces. Our experiments show that the optimal search schemes found by our MIP significantly improve the performance of search in bidirectional FM-index upon previous ad-hoc solutions. For example, approximate matching of 101-bp Illumina reads (with two errors) becomes 35 times faster than standard backtracking. Moreover, despite being performed purely in the index, the running time of search using our optimal schemes (for up to two errors) is comparable to the best state-of-the-art aligners, which benefit from combining search in index with in-text verification using dynamic programming. As a result, we anticipate a full-fledged aligner that employs an intelligent combination of search in the bidirectional FM-index using our optimal search schemes and in-text verification using dynamic programming outperforms today's best aligners. The development of such an aligner, called FAMOUS (Fast Approximate string Matching using OptimUm search Schemes), is ongoing as our future work

    miRkwood: a tool for the reliable identification of microRNAs in plant genomes

    Get PDF
    International audienceBackground: MicroRNAs (miRNAs) play crucial roles in post-transcriptional regulation of eukaryotic gene expression and are involved in many aspects of plant development. Although several prediction tools are available for metazoan genomes, the number of tools dedicated to plants is relatively limited. Results: Here, we present miRkwood, a user-friendly tool for the identification of miRNAs in plant genomes using small RNA sequencing data. Deep-sequencing data of Argonaute associated small RNAs showed that miRkwood is able to identify a large diversity of plant miRNAs and limits false positive predictions. Moreover, it outperforms current tools such as ShortStack and contrary to ShortStack, miRkwood provides a quality score allowing users to rank miRNA predictions. Conclusion: miRkwood is a very efficient tool for the annotation of miRNAs in plant genomes. It is available as a web server, as a standalone version, as a docker image and as a Galaxy tool

    Approximate search of short patterns with high error rates using the 01⁎0 lossless seeds

    Get PDF
    International audienceApproximate pattern matching is an important computational problem that has a wide range of applications in computational biology and in information retrieval. However, searching a short pattern in a text with high error rates (10–20%) under the Levenshtein distance is a task for which few efficient solutions exist. Here we address this problem by introducing a new type of seeds: the 01⁎0 seeds. These seeds are made of two exact parts separated by parts with exactly one error. We show that those seeds are lossless, and we apply them to two filtration algorithms for two popular applications, one where a compressed index is built on the text and another one where the patterns are indexed. We also demonstrate experimentally the advantages of our approach compared to alternative methods implementing other types of seeds. This work opens the way to the design of more efficient and more sensitive text algorithms

    Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

    Get PDF
    The objective of the research in this dissertation is to derive optimal search schemes for approximate string matching using bidirectional FM-index, and utilize them in increasing the speed of such searches. Such a problem arises in computer science with many applications. Approximate string matching problem is also central in bioinformatics where biologists are interested in aligning pieces of DNA back to genome. Given a text, the search for a given pattern can be accelerated by preprocessing the text through constructing a hash table or indexing the text. Bidirectional indices have opened new possibilities by allowing a search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem. Prior work tends to use search heuristics but lacks the ability to find the best strategies for using an index to search for a pattern. In this dissertation, we will find the optimal search scheme for approximate string matching problem for a bidirectional index with the assumption of having the number of partitions. Moreover, we will investigate the computational gain from applying these optimal search schemes to search in a bidirectional FM-index. Intellectual Merit. First, we propose an MIP formulation to find the optimal search scheme for approximate string matching problem using a bidirectional index under Hamming distance error. Second, we demonstrate that our MIP can solve the optimum search scheme problem to optimality in a reasonable amount of time for input parameters of considerable size, and enjoys very quick convergence to optimal or near-optimal solutions for input parameters of larger size. Third, we show that approximate search in a bidirectional FM-index can be performed significantly faster if the optimal schemes obtained from our MIP are used. This is demonstrated based on number of edges in the search tries as well as actual running time of in-index search for Illumina DNA Sequencing reads (up to 35 times faster than standard backtracking for 3 errors). Although our MIP solutions are for Hamming distance, they perform equally well for edit distance. Fourth, we demonstrate that our optimal search schemes is superior to the best of in-index aligners for 2 and 3 errors. In an attempt to acquire a glimpse of the potential of combining our optimal search schemes with in-text verification, we combine optimal search scheme and in-text verification for Hamming distance. This experiment halved the running time for reads of size 101 and 125. Furthermore, we showcase the power of our optimal search schemes by demonstrating that for 1 to 3 errors, approximate string matching of reads of size 40, 101, and 125 performed completely in index compete in running time with the best full-fledged aligners, which benefit from combining search in index with in-text verification for edit distance. Moreover, we will relax the assumption of having equal size partitions in our MIP and address the more general form of approximate string matching problem where the only assumption is the prespecified number of partitions. We will present an MIP formulation for edit distance and provide an alternative formulation for Hamming distance. Broader Impacts. The results of this research promise a significant increase in speed of finding approximate occurrences of a pattern in a text. This is an important problem with many applications in bioinformatics and computer science such as recovering text in signal processing and information retrieval [23]. Approximate string matching plays an indisputable role in the realm of bioinformatics, where any downstream analysis on the genomic data starts with aligning sequenced DNA or RNA reads back to a reference genome. Technologies such as next generation sequencing has produced considerable amount of data leading to increasing demand for fast read aligners to map DNA pieces to genome. In order to solve this central problem, one could consider the genome of any species of interest as the "text" and the sequenced pieces of DNA as the "patterns" and therefore search for approximate occurrences of a pattern in a text using a full-text index. Some tolerance for errors is required due to mutations in genome of each individual organism such as single nucleotide variants (SNVs) as well as errors in sequencing technologies. This broad spectrum of applications indicates the significant impact of this research on many areas of health and life sciences and practice, where discovery, diagnosis, and treatment all depend on genome sequencing
    corecore