Search CORE

13 research outputs found

New algorithms for exact and approximate text matching

Author: Grabowski Szymon
Publication venue: Lodz University of Technology. Press
Publication date: 01/01/2010
Field of study

Praca przedstawia główne wyniki z tematyki algorytmów tekstowych otrzymane w Katedrze Informatyki Stosowanej w latach 2004-2009. Algorytmy te dotyczą wybranych rozmaitych problemów wyszukiwania dokładnego i przybliżonego, również w intensywnie w ostatnich latach badanym scenariuszu z wykorzystaniem kompresji.This work presents main results in the domain of text algorithms obtained in Computer Engineering Dept. in the years 2004-2009. The algorithms concern various exact and approximate string matching problems, also in the recently actively developed scenario involving compression

Lodz University of Technology Repository

Prospects and limitations of full-text index structures in genome analysis

Author: Dawyndt Peter
De Baets Bernard
Fack Veerle
Vyverman Michaël
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

Ghent University Academic Bibliography

PubMed Central

ALFALFA : fast and accurate mapping of long next generation sequencing reads

Author: Vyverman Michaël
Publication venue: Ghent University. Faculty of Sciences
Publication date: 01/01/2014
Field of study

Ghent University Academic Bibliography

Content-aware compression for big textual data analysis

Author: Dong Dapeng
Publication venue: 'University College Cork'
Publication date: 01/01/2016
Field of study

A substantial amount of information on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. This thesis proposes techniques for more efficient textual big data analysis suitable for the Hadoop analytic platform. This research explores the direct processing of compressed textual data. The focus is on developing novel compression methods with a number of desirable properties to support text-based big data analysis in distributed environments. The novel contributions of this work include the following. Firstly, a Content-aware Partial Compression (CaPC) scheme is developed. CaPC makes a distinction between informational and functional content in which only the informational content is compressed. Thus, the compressed data is made transparent to existing software libraries which often rely on functional content to work. Secondly, a context-free bit-oriented compression scheme (Approximated Huffman Compression) based on the Huffman algorithm is developed. This uses a hybrid data structure that allows pattern searching in compressed data in linear time. Thirdly, several modern compression schemes have been extended so that the compressed data can be safely split with respect to logical data records in distributed file systems. Furthermore, an innovative two layer compression architecture is used, in which each compression layer is appropriate for the corresponding stage of data processing. Peripheral libraries are developed that seamlessly link the proposed compression schemes to existing analytic platforms and computational frameworks, and also make the use of the compressed data transparent to developers. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements

Irish Universities

Cork Open Research Archive

A Simple Alphabet-Independent FM-Index

Author: Alejandro Salinger
Gonzalo Navarro
Szymon Grabowski
Veli Mäkinen
Publication venue
Publication date: 01/01/2005
Field of study

We design a succinct full-text index based on the idea of Huffman-compressing the text and then applying the Burrows-Wheeler transform over it. The resulting structure can be searched as an FM-index, with the benefit of removing the sharp dependence on the alphabet size, σ, present in that structure. On a text of length n with zero-order entropy H0, our index needs O(n(H0 + 1)) bits of space, without any dependence on σ. The average search time for a pattern of length m is O(m(H0 + 1)), under reasonable assumptions. Each position of a text occurrence can be reported in worst case time O((H0 + 1)log n), while any text substring of length L can be retrieved in O((H0 + 1)L) average time in addition to the previous worst case time. Our index provides a relevant space/time tradeoff between existing succinct data structures, with the additional interest of being easy to implement. Our experimental results show that, although not among the most succinct, our index is faster than the others in many aspects, even letting them use significatively more space

CiteSeerX

Repositorio Académico de la Universidad de Chile

A SIMPLE ALPHABET-INDEPENDENT FM-INDEX

Author: Alejandro Salinger
Veli Mäkinen et al.
Publication venue
Publication date
Field of study

CiteSeerX