Search CORE

505 research outputs found

Data structures and algorithms for approximate string matching Zvi Galil, Raffaele Giancarlo

Author: Galil Zvi
Giancarlo Raffaele
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/1987
Field of study

This paper surveys techniques for designing efficient sequential and parallel approximate string matching algorithms. Special attention is given to the methods for the construction of data structures that efficiently support primitive operations needed in approximate string matching

Elsevier - Publisher Connector

Columbia University Academic Commons

Double String Tandem Repeats

Author: Amir Amihood
Butman Ayelet
Landau Gad M.
Marcus Shoshana
Sokol Dina
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)
Publication date: 01/01/2020
Field of study

A tandem repeat is an occurrence of two adjacent identical substrings. In this paper, we introduce the notion of a double string, which consists of two parallel strings, and we study the problem of locating all tandem repeats in a double string. The problem introduced here has applications beyond actual double strings, as we illustrate by solving two different problems with the algorithm of the double string tandem repeats problem. The first problem is that of finding all corner-sharing tandems in a 2-dimensional text, defined by Apostolico and Brimkov. The second problem is that of finding all scaled tandem repeats in a 1d text, where a scaled tandem repeat is defined as a string UU\u27 such that U\u27 is discrete scale of U. In addition to the algorithms for exact tandem repeats, we also present algorithms that solve the problem in the inexact sense, allowing up to k mismatches. We believe that this framework will open a new perspective for other problems in the future

Dagstuhl Research Online Publication Server

Pattern matching in Lempel-Ziv compressed strings: fast, simple, and deterministic

Author: Gawrychowski Pawel
Publication venue
Publication date: 01/01/2011
Field of study

Countless variants of the Lempel-Ziv compression are widely used in many real-life applications. This paper is concerned with a natural modification of the classical pattern matching problem inspired by the popularity of such compression methods: given an uncompressed pattern s[1..m] and a Lempel-Ziv representation of a string t[1..N], does s occur in t? Farach and Thorup gave a randomized O(nlog^2(N/n)+m) time solution for this problem, where n is the size of the compressed representation of t. We improve their result by developing a faster and fully deterministic O(nlog(N/n)+m) time algorithm with the same space complexity. Note that for highly compressible texts, log(N/n) might be of order n, so for such inputs the improvement is very significant. A (tiny) fragment of our method can be used to give an asymptotically optimal solution for the substring hashing problem considered by Farach and Muthukrishnan.Comment: submitte

arXiv.org e-Print Archive

CiteSeerX

Longest Common Extensions in Sublinear Space

Author: A Amir
D Gusfield
D Harel
EW Myers
G Manacher
GM Landau
GM Landau
GM Landau
MG Main
NJ Fine
P Bille
R Cole
R Kolpakov
RM Karp
Publication venue
Publication date: 01/01/2015
Field of study

The longest common extension problem (LCE problem) is to construct a data structure for an input string

T

of length

n

that supports LCE

(i,j)

queries. Such a query returns the length of the longest common prefix of the suffixes starting at positions

i

and

j

T

. This classic problem has a well-known solution that uses

O(n)

space and

O(1)

query time. In this paper we show that for any trade-off parameter

1 \leq \tau \leq n

, the problem can be solved in

O(\frac{n}{\tau})

space and

O(\tau)

query time. This significantly improves the previously best known time-space trade-offs, and almost matches the best known time-space product lower bound.Comment: An extended abstract of this paper has been accepted to CPM 201

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

Online Research Database In Technology

Tilatehokas metagenomisten DNA-fragmenttien ryhmittely

Author: Alanko Jarno
Publication venue
Publication date: 18/01/2016
Field of study

The collection of all genomes in an environment is called the metagenome of the environment. In the past 15 years, high-throughput sequencing has made it feasible to sequence entire environments at once for the first time in history, which has resulted in a variety of interesting new algorithmic problems. This thesis focuses on the basic problem of clustering the reads from an environment according to which species, or more generally, taxonomic unit they originate from. In this work, we identify and formalize two fundamental string processing tasks useful in clustering metagenomic read sets. We solve the two problems with space efficiency in mind using the recently developed bidirectional Burrows-Wheeler index. The algorithms were implemented in a way which makes parallel processing possible. Our tool is experimentally shown to give good results for simple simulated datasets, and to use less than 10 times less space and time compared to two recently published metagenome clustering tools.Kaikkien ympäristössä esiintyvien genomien joukkoa kutsutaan kyseisen ympäristön \emph{metagenomiksi}. Viimeisen 15 vuoden aikana kehitetyt korkean läpisyötön sekvenssoriteknologiat ovat mahdollistaneet ensimmäistä kertaa historiassa kokonaisen ympäristön metagenomin kartoittamisen. Tämä kehityssuunta on johtanut uusiin mielenkiintoisiin algoritmisiin ongelmiin. Tämä työ käsittelee ympäristöistä näytteistettyjen DNA-fragmenttejen ryhmittelyä lajien, tai yleisemmin taksonomisten yksiköiden mukaan. Työssä tunnistetaan ja formalisoidaan kaksi merkkijono-ongelmaa, jotka ilmentyvät metagenomisten DNA-fragmentteja ryhmittelyssä. Ongelmiin esitetään tilatehokkaat ratkaisut käyttäen hiljattain kehitettyä kaksisuuntaista Burrows-Wheeler indeksiä. Algoritmit toteutettiin pitäen silmällä rinnakkaista laskentaa. Työssä osoitetaan, että uusi toteutus antaa hyviä tuloksia yksinkertaisille simuloiduille näytteille, ja että työkalu on kymmenen kertaa nopeampi ja tilatehokkaampi, kuin kaksi hiljattain julkaistua metagenomisten näytteiden ryhmittelyyn tarkoitettua työkalua

Aaltodoc Publication Archive

Effiicient Computation of Maximal Exact Matches Between Genomic Sequences

Author: Portes de Cerqueira Cesar Valeria Leticia
Publication venue: Scholarship@Western
Publication date: 05/03/2020
Field of study

Sequence alignment is one of the most accomplished methods in the field of bioinformatics, being crucial to determine similarities between sequences, from finding genes to predicting functions. The computation of Maximal Exact Matches (MEM) plays a fundamental part in some algorithms for sequence alignment. MEMs between a reference-query genome are often utilized as seeds in a genome aligner to increase its efficiency. The MEM computation is a time consuming step in the sequence alignment process and increasing the performance of this step increases significantly the whole process of the alignment between the sequences. As of today, there are many programs available for MEM computing, from algorithms based full text indexes, like essaMEM; to more effective ones, such as E-MEM, copMEM and bfMEM. However, none of the available programs for the computation of MEMs are able to work with highly related sequences. In this study, we propose an improved version, E-MEM2, of the well known MEM computing software, E-MEM. With a trade-off between time and memory, the improved version shows to run faster than its previous version, presenting very large improvements when comparing closely-related sequences

Scholarship@Western