Search CORE

958 research outputs found

Indexing large genome collections on a PC

Author: Danek Agnieszka
Deorowicz Sebastian
Grabowski Szymon
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 28/03/2014
Field of study

Motivation: The availability of thousands of invidual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. Results: We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size fitting in a standard computer with 16--32\,GB, or even 8\,GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries are handled in average time of 39\,

\mu

s and with up to 3 mismatches in 373\,

\mu

s on the test PC with the index size of 13.4\,GB. For a smaller index, occupying 7.4\,GB in memory, the respective times grow to 76\,

\mu

s and 917\,

\mu

s. Availability: Software and Suuplementary material: \url{http://sun.aei.polsl.pl/mugi}

arXiv.org e-Print Archive

Directory of Open Access Journals

FigShare

Improved algorithms for string searching problems

Author: Salmela Leena
Publication venue: Teknillinen korkeakoulu
Publication date: 01/01/2009
Field of study

We present improved practically efficient algorithms for several string searching problems, where we search for a short string called the pattern in a longer string called the text. We are mainly interested in the online problem, where the text is not preprocessed, but we also present a light indexing approach to speed up exact searching of a single pattern. The new algorithms can be applied e.g. to many problems in bioinformatics and other content scanning and filtering problems. In addition to exact string matching, we develop algorithms for several other variations of the string matching problem. We study algorithms for approximate string matching, where a limited number of errors is allowed in the occurrences of the pattern, and parameterized string matching, where a substring of the text matches the pattern if the characters of the substring can be renamed in such a way that the renamed substring matches the pattern exactly. We also consider searching multiple patterns simultaneously and searching weighted patterns, where the weight of a character at a given position reflects the probability of that character occurring at that position. Many of the new algorithms use the backward matching principle, where the characters of the text that are aligned with the pattern are read backward, i.e. from right to left. Another common characteristic of the new algorithms is the use of q-grams, i.e. q consecutive characters are handled as a single character. Many of the new algorithms are bit parallel, i.e. they pack several variables to a single computer word and update all these variables with a single instruction. We show that the q-gram backward string matching algorithms that solve the exact, approximate, or multiple string matching problems are optimal on average. We also show that the q-gram backward string matching algorithm for the parameterized string matching problem is sublinear on average for a class of moderately repetitive patterns. All the presented algorithms are also shown to be fast in practice when compared to earlier algorithms. We also propose an alphabet sampling technique to speed up exact string matching. We choose a subset of the alphabet and select the corresponding subsequence of the text. String matching is then performed on this reduced subsequence and the found matches are verified in the original text. We show how to choose the sampled alphabet optimally and show that the technique speeds up string matching especially for moderate to long patterns

Computing the original eBWT faster, simpler, and with less memory

Author: Boucher Christina
Cenzato Davide
Lipták Zsuzsanna
Rossi Massimiliano
Sciortino Marinella
Publication venue
Publication date: 01/01/2021
Field of study

Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of the BWT to a collection of strings, however, since this introduction, it has been used more generally to describe any BWT of a collection of strings and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We combine our new eBWT construction with a variation of prefix-free parsing to allow for scalable construction of the eBWT. We evaluate our algorithm (pfpebwt) on sets of human chromosomes 19, Salmonella, and SARS-CoV2 genomes, and demonstrate that it is the fastest method for all collections, with a maximum speedup of 7.6x on the second best method. The peak memory is at most 2x larger than the second best method. Comparing with methods that are also, as our algorithm, able to report suffix array samples, we obtain a 57.1x improvement in peak memory. The source code is publicly available at https://github.com/davidecenzato/PFP-eBWT.Comment: 20 pages, 5 figures, 1 tabl

arXiv.org e-Print Archive

Analysis of the Period Recovery Error Bound

Author: Amir Amihood
Boneh Itai
Itzhaki Michael
Kondratovsky Eitan
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 28th Annual European Symposium on Algorithms (ESA 2020)
Publication date: 01/01/2020
Field of study

Dagstuhl Research Online Publication Server

Fast-Find: A novel computational approach to analyzing combinatorial motifs

Author: Hamady Micah
Knight Rob
Peden Erin
Singh Ravinder
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Many vital biological processes, including transcription and splicing, require a combination of short, degenerate sequence patterns, or motifs, adjacent to defined sequence features. Although these motifs occur frequently by chance, they only have biological meaning within a specific context. Identifying transcripts that contain meaningful combinations of patterns is thus an important problem, which existing tools address poorly. RESULTS: Here we present a new approach, Fast-FIND (Fast-Fully Indexed Nucleotide Database), that uses a relational database to support rapid indexed searches for arbitrary combinations of patterns defined either by sequence or composition. Fast-FIND is easy to implement, takes less than a second to search the entire Drosophila genome sequence for arbitrary patterns adjacent to sites of alternative polyadenylation, and is sufficiently fast to allow sensitivity analysis on the patterns. We have applied this approach to identify transcripts that contain combinations of sequence motifs for RNA-binding proteins that may regulate alternative polyadenylation. CONCLUSION: Fast-FIND provides an efficient way to identify transcripts that are potentially regulated via alternative polyadenylation. We have used it to generate hypotheses about interactions between specific polyadenylation factors, which we will test experimentally

Springer - Publisher Connector

Directory of Open Access Journals

eScholarship - University of California