Search CORE

352 research outputs found

Prospects and limitations of full-text index structures in genome analysis

Author: Dawyndt Peter
De Baets Bernard
Fack Veerle
Vyverman Michaël
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

Ghent University Academic Bibliography

PubMed Central

A new method for indexing genomes using on-disk suffix trees

Author: Alex Thomo
Chris Upton
Marina Barsky
Ulrike Stege
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2008
Field of study

We propose a new method to build persistent suffix trees for indexing the genomic data. Our algorithm DiGeST (Disk-Based Genomic Suffix Tree) improves significantly over previous work in reducing the random access to the in-put string and performing only two passes over disk data. DiGeST is based on the two-phase multi-way merge sort paradigm using a concise binary representation of the DNA alphabet. Furthermore, our method scales to larger genomic data than managed before

CiteSeerX

Crossref

K-repeating Substrings: a String-Algorithmic Approach to Privacy-Preserving Publishing of Textual Data

Author: Hasida Koiti
Matsubara Yusuke
Publication venue: Department of Linguistics, Faculty of Arts, Chulalongkorn University
Publication date: 01/01/2014
Field of study

Waseda University Repository

c-trie++: A Dynamic Trie Tailored for Fast Prefix Searches

Author: Bannai Hideo
Inenaga Shunsuke
Kanda Shunsuke
Köppl Dominik
Nakashima Yuto
Takeda Masayuki
Tsuruta Kazuya
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/10/2020
Field of study

Given a dynamic set

K

k

strings of total length

n

whose characters are drawn from an alphabet of size

\sigma

, a keyword dictionary is a data structure built on

K

that provides locate, prefix search, and update operations on

K

. Under the assumption that

\alpha = w / \lg \sigma

characters fit into a single machine word

w

, we propose a keyword dictionary that represents

K

n \lg \sigma + \Theta(k \lg n)

bits of space, supporting all operations in

O(m / \alpha + \lg \alpha)

expected time on an input string of length

m

in the word RAM model. This data structure is underlined with an exhaustive practical evaluation, highlighting the practical usefulness of the proposed data structure, especially for prefix searches - one of the most elementary keyword dictionary operations

arXiv.org e-Print Archive

Crossref

Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools

Author: Bingmann Timo
Publication venue
Publication date: 01/01/2018
Field of study

This dissertation focuses on two fundamental sorting problems: string sorting and suffix sorting. The first part considers parallel string sorting on shared-memory multi-core machines, the second part external memory suffix sorting using the induced sorting principle, and the third part distributed external memory suffix sorting with a new distributed algorithmic big data framework named Thrill.Comment: 396 pages, dissertation, Karlsruher Instituts f\"ur Technologie (2018). arXiv admin note: text overlap with arXiv:1101.3448 by other author

arXiv.org e-Print Archive

KITopen

Efficient repeat finding in sets of strings via suffix arrays

Author: Barenbaum Pablo
Becher Veronica Andrea
Deymonnaz Alejandro
Halsband Melisa
Heiber Pablo Ariel
Publication venue: Discrete Mathematics and Theoretical Computer Science
Publication date: 01/01/2013
Field of study

We consider two repeat finding problems relative to sets of strings: (a) Find the largest substrings that occur in every string of a given set; (b) Find the maximal repeats in a given string that occur in no string of a given set. Our solutions are based on the suffix array construction, requiring O(m) memory, where m is the length of the longest input string, and O(n &log;m) time, where n is the the whole input size (the sum of the length of each string in the input). The most expensive part of our algorithms is the computation of several suffix arrays. We give an implementation and experimental results that evidence the efficiency of our algorithms in practice, even for very large inputs.Fil: Barenbaum, Pablo. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina;Fil: Becher, Veronica Andrea. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina; Consejo Nacional de Investigaciones Cientificas y Tecnicas. Oficina de Coordinacion Administrativa Ciudad Universitaria; Argentina;Fil: Deymonnaz, Alejandro. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina;Fil: Halsband, Melisa. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina;Fil: Heiber, Pablo Ariel. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina; Consejo Nacional de Investigaciones Cientificas y Tecnicas. Oficina de Coordinacion Administrativa Ciudad Universitaria; Argentina

CiteSeerX

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

CONICET Digital

Episciences.org

ALGORITHMS FOR CORRECTING NEXT GENERATION SEQUENCING ERRORS

Author: Fazayeli Farideh
Publication venue: Scholarship@Western
Publication date: 01/01/2011
Field of study

The advent of next generation sequencing technologies (NGS) generated a revolution in biological research. However, in order to use the data they produce, new computational tools are needed. Due to significantly shorter length of the reads and higher per-base error rate, more complicated approaches are employed and still critical problems, such as genome assembly, are not satisfactorily solved. We therefore focus our attention on improving the quality of the NGS data. More precisely, we address the error correction issue. The current methods for correcting errors are not very accurate. In addition, they do not adapt to the data. We proposed a novel tool, HiTEC, to correct errors in NGS data. HiTEC is based on the suffix array data structure accompanied by a statistical analysis. HiTEC’s accuracy is significantly higher than all previous methods. In addition, it is the only tool with the ability of adjusting to the given data set. In addition, HiTEC is time and space efficient

Scholarship@Western

Parallel and scalable combinatorial string algorithms on distributed memory systems

Author: Flick Patrick
Publication venue: Georgia Institute of Technology
Publication date: 29/05/2019
Field of study

Methods for processing and analyzing DNA and genomic data are built upon combinatorial graph and string algorithms. The advent of high-throughput DNA sequencing is enabling the generation of billions of reads per experiment. Classical and sequential algorithms can no longer deal with these growing data sizes - which for the last 10 years have greatly out-paced advances in processor speeds. Processing and analyzing state-of-the-art genomic data sets require the design of scalable and efficient parallel algorithms and the use of large computing clusters. Suffix arrays and trees are fundamental string data structures, which lie at the foundation of many string algorithms, with important applications in text processing, information retrieval, and computational biology. Conversely, the parallel construction of these indices is an actively studied problem. However, prior approaches lacked good worst-case run-time guarantees and exhibit poor scaling and overall performance. In this work, we present our distributed-memory parallel algorithms for indexing large datasets, including algorithms for the distributed construction of suffix arrays, LCP arrays, and suffix trees. We formulate a generalized version of the All-Nearest-Smaller-Values problem, provide an optimal distributed solution, and apply it to the distributed construction of suffix trees - yielding a work-optimal parallel algorithm. Our algorithms for distributed suffix array and suffix tree construction improve the state-of-the-art by simultaneously improving worst-case run-time bounds and achieving superior practical performance. Next, we introduce a novel distributed string index, the Distributed Enhanced Suffix Array (DESA) - based on the suffix and LCP arrays, the DESA consists of these and additional distributed data structures. The DESA is designed to allow efficient pattern search queries in distributed memory while requiring at most O(n/p) memory per process. We present efficient distributed-memory parallel algorithms for querying, as well as for the efficient construction of this distributed index. Finally, we present our work on distributed-memory algorithms for clustering de Bruijn graphs and its application to solving a grand challenge metagenomic dataset.Ph.D

Scholarly Materials And Research @ Georgia Tech