Search CORE

16,394 research outputs found

Algorithms for Order-Preserving Matching

Author: Chhabra Tamanna
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

String matching is a widely studied problem in Computer Science. There have been many recent developments in this field. One fascinating problem considered lately is the order-preserving matching (OPM) problem. The task is to find all the substrings in the text which have the same length and relative order as the pattern, where the relative order is the numerical order of the numbers in a string. The problem finds its applications in the areas involving time series or series of numbers. More specifically, it is useful for those who are interested in the relative order of the pattern and not in the pattern itself. For example, it can be used by analysts in a stock market to study movements of prices. In addition to the OPM problem, we also studied its approximate variation. In approximate order-preserving matching, we search for those substrings in the text which have relative order similar to the pattern, i.e., relative order of the pattern matches with at most k mismatches. With respect to applications of order-preserving matching, approximate search is more meaningful than exact search. We developed various advanced solutions for the problem and its variant. Special emphasis was laid on the practical efficiency of the solutions. Particularly, we introduced a simple solution for the OPM problem using filtration. We proved experimentally that our method was effective and faster than the previous solutions for the problem. In addition, we combined the Single Instruction Multiple Data (SIMD) instruction set architecture with filtration to develop competent solutions which were faster than our previous solution. Moreover, we proposed another efficient solution without filtration using the SIMD architecture. We also presented an offline solution based on the FM-index scheme. Furthermore, we proposed practical solutions for the approximate order-preserving matching problem and one of the solutions was the first sublinear solution on average for the problem

Aaltodoc Publication Archive

A Compact Index for Order-Preserving Pattern Matching

Author: Decaroli Gianni
Gagie Travis
Manzini Giovanni
Publication venue
Publication date: 01/01/2017
Field of study

Order-preserving pattern matching was introduced recently but it has already attracted much attention. Given a reference sequence and a pattern, we want to locate all substrings of the reference sequence whose elements have the same relative order as the pattern elements. For this problem we consider the offline version in which we build an index for the reference sequence so that subsequent searches can be completed very efficiently. We propose a space-efficient index that works well in practice despite its lack of good worst-case time bounds. Our solution is based on the new approach of decomposing the indexed sequence into an order component, containing ordering information, and a delta component, containing information on the absolute values. Experiments show that this approach is viable, faster than the available alternatives, and it is the first one offering simultaneously small space usage and fast retrieval.Comment: 16 pages. A preliminary version appeared in the Proc. IEEE Data Compression Conference, DCC 2017, Snowbird, UT, USA, 201

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Pisa

Archivio Istituzionale della Ricerca- Università del Piemonte Orientale

Indexing large genome collections on a PC

Author: Danek Agnieszka
Deorowicz Sebastian
Grabowski Szymon
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 28/03/2014
Field of study

Motivation: The availability of thousands of invidual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. Results: We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size fitting in a standard computer with 16--32\,GB, or even 8\,GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries are handled in average time of 39\,

\mu

s and with up to 3 mismatches in 373\,

\mu

s on the test PC with the index size of 13.4\,GB. For a smaller index, occupying 7.4\,GB in memory, the respective times grow to 76\,

\mu

s and 917\,

\mu

s. Availability: Software and Suuplementary material: \url{http://sun.aei.polsl.pl/mugi}

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

FigShare

Languages of lossless seeds

Author: Břinda Karel
Publication venue: 'Open Publishing Association'
Publication date: 21/05/2014
Field of study

Several algorithms for similarity search employ seeding techniques to quickly discard very dissimilar regions. In this paper, we study theoretical properties of lossless seeds, i.e., spaced seeds having full sensitivity. We prove that lossless seeds coincide with languages of certain sofic subshifts, hence they can be recognized by finite automata. Moreover, we show that these subshifts are fully given by the number of allowed errors k and the seed margin l. We also show that for a fixed k, optimal seeds must asymptotically satisfy l ~ m^(k/(k+1)).Comment: In Proceedings AFL 2014, arXiv:1405.527

arXiv.org e-Print Archive

Directory of Open Access Journals

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM