Search CORE

229 research outputs found

A practical index for approximate dictionary matching with few mismatches

Author: Cisłak Aleksander
Grabowski Szymon
Publication venue
Publication date: 11/02/2016
Field of study

Approximate dictionary matching is a classic string matching problem (checking if a query string occurs in a collection of strings) with applications in, e.g., spellchecking, online catalogs, geolocation, and web searchers. We present a surprisingly simple solution called a split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, and experimentally show that it offers competitive space-time tradeoffs. Our implementation in the C++ language is focused mostly on data compaction, which is beneficial for the search speed (e.g., by being cache friendly). We compare our solution with other algorithms and we show that it performs better for the Hamming distance. Query times in the order of 1 microsecond were reported for one mismatch for the dictionary size of a few megabytes on a medium-end PC. We also demonstrate that a basic compression technique consisting in

q

-gram substitution can significantly reduce the index size (up to 50% of the input text size for the DNA), while still keeping the query time relatively low

arXiv.org e-Print Archive

Crossref

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Linear pattern matching on sparse suffix trees

Author: Kolpakov Roman
Kucherov Gregory
Starikovskaya Tatiana
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 14/03/2011
Field of study

Packing several characters into one computer word is a simple and natural way to compress the representation of a string and to speed up its processing. Exploiting this idea, we propose an index for a packed string, based on a {\em sparse suffix tree} \cite{KU-96} with appropriately defined suffix links. Assuming, under the standard unit-cost RAM model, that a word can store up to

\log_{\sigma}n

characters (

\sigma

the alphabet size), our index takes

O(n/\log_{\sigma}n)

space, i.e. the same space as the packed string itself. The resulting pattern matching algorithm runs in time

O(m+r^2+r\cdot occ)

, where

m

is the length of the pattern,

r

is the actual number of characters stored in a word and

occ

is the number of pattern occurrences

arXiv.org e-Print Archive

CiteSeerX

Crossref

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array

Author: D Okanohara
J Fischer
J Fischer
J Fischer
J Kärkkäinen
J Kärkkäinen
J Kärkkäinen
J Sirén
JI Munro
JS Vitter
K Sadakane
K Sadakane
P Ferragina
P Ferragina
P Ferragina
R Dementiev
T Beller
T Kasai
U Manber
W Hon
W Szpankowski
Publication venue
Publication date: 01/01/2016
Field of study

The longest common prefix (LCP) array is a versatile auxiliary data structure in indexed string matching. It can be used to speed up searching using the suffix array (SA) and provides an implicit representation of the topology of an underlying suffix tree. The LCP array of a string of length

n

can be represented as an array of length

n

words, or, in the presence of the SA, as a bit vector of

2n

bits plus asymptotically negligible support data structures. External memory construction algorithms for the LCP array have been proposed, but those proposed so far have a space requirement of

O(n)

words (i.e.

O(n \log n)

bits) in external memory. This space requirement is in some practical cases prohibitively expensive. We present an external memory algorithm for constructing the

2n

bit version of the LCP array which uses

O(n \log \sigma)

bits of additional space in external memory when given a (compressed) BWT with alphabet size

\sigma

and a sampled inverse suffix array at sampling rate

O(\log n)

. This is often a significant space gain in practice where

\sigma

is usually much smaller than

n

or even constant. We also consider the case of computing succinct LCP arrays for circular strings

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Approximate Online Pattern Matching in Sublinear Time

Author: Chakraborty Diptarka
Das Debarati
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 39th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2019)
Publication date: 01/01/2019
Field of study

Dagstuhl Research Online Publication Server

Ontology-based solutions for interoperability among product lifecycle management systems: A systematic literature review

Author: Fraga Alvaro Luis
Leone Horacio Pascual
Vegetti Maria Marcela
Publication venue: 'Elsevier BV'
Publication date: 01/12/2020
Field of study

During recent years, globalization has had an impact on the competitive capacity of industries, forcing them to integrate their productive processes with other, geographically distributed, facilities. This requires the information systems that support such processes to interoperate. Significant attention has been paid to the development of ontology-based solutions, which are meant to tackle issues from inconsistency to semantic interoperability and knowledge reusability. This paper looks into how the available technology, models and ontology-based solutions might interact within the manufacturing industry environment to achieve semantic interoperability among industrial information systems. Through a systematic literature review, this paper has aimed to identify the most relevant elements to consider in the development of an ontology-based solution and how these solutions are being deployed in industry. The research analyzed 54 studies in alignment with the specific requirements of our research questions. The most relevant results show that ontology-based solutions can be set up using OWL as the ontology language, Protégé as the ontology modeling tool, Jena as the application programming interface to interact with the built ontology, and different standards from the International Organization for Standardization Technical Committee 184, Subcommittee 4 or 5, to get the foundational concepts, axioms, and relationships to develop the knowledge base. We believe that the findings of this study make an important contribution to practitioners and researchers as they provide useful information about different projects and choices involved in undertaking projects in the field of industrial ontology application.Fil: Fraga, Alvaro Luis. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Desarrollo y Diseño. Universidad Tecnológica Nacional. Facultad Regional Santa Fe. Instituto de Desarrollo y Diseño; ArgentinaFil: Vegetti, Maria Marcela. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Desarrollo y Diseño. Universidad Tecnológica Nacional. Facultad Regional Santa Fe. Instituto de Desarrollo y Diseño; ArgentinaFil: Leone, Horacio Pascual. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Desarrollo y Diseño. Universidad Tecnológica Nacional. Facultad Regional Santa Fe. Instituto de Desarrollo y Diseño; Argentin

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

CONICET Digital

A Survey of Combinatorial Methods for Phylogenetic Networks

Author: Bandelt
Bandelt
Bandelt
Bandelt
Bandelt
Baroni
Bokhari
Bordewich
Bordewich
Bryant
Buneman
Casado
Castrucci
Celine Scornavacca
Choy
Clement
Daniel H. Huson
Delwiche
Dessimoz
Disotell
Doolittle
Doyon
Dress
Ebersberger
Excoffier
Grass Phylogeny Working Group
Griffiths
Griffiths
Grünewald
Gusfield
Gusfield
Gusfield
Gusfield
Hein
Hein
Holland
Holland
Huber
Huson
Huson
Huson
Huson
Huson
Huson
Huson
Huson
Huson
Huson
Jansson
Jin
Jin
Jin
Kanj
Kivisild
Linder
Maddison
Makarenkov
Morrison
Morrison
O'Donnell
Parida
Planet
Rieseberg
Rokas
Sneath
Song
Song
Song
Syvanen
Templeton
To
Tofigh
van Iersel
Wagele
Wang
Whidden
Whitfield
Publication venue: Oxford University Press
Publication date: 01/01/2011
Field of study

The evolutionary history of a set of species is usually described by a rooted phylogenetic tree. Although it is generally undisputed that bifurcating speciation events and descent with modifications are major forces of evolution, there is a growing belief that reticulate events also have a role to play. Phylogenetic networks provide an alternative to phylogenetic trees and may be more suitable for data sets where evolution involves significant amounts of reticulate events, such as hybridization, horizontal gene transfer, or recombination. In this article, we give an introduction to the topic of phylogenetic networks, very briefly describing the fundamental concepts and summarizing some of the most important combinatorial methods that are available for their computation

Faster algorithms for longest common substring

Author: Charalampopoulos P. (Panagiotis)
Kociumaka T. (Tomasz)
Pissis S. (Solon)
Radoszewski J. (Jakub)
Publication venue
Publication date: 01/01/2021
Field of study

In the classic longest common substring (LCS) problem, we are given two strings S and T, each of length at most n, over an alphabet of size σ, and we are asked to find a longest string occurring as a fragment of both S and T. Weiner, in his seminal paper that introduced the suffix tree, presented an (n log σ)-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an (n)-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in (n log σ/log n) space and read in (n log σ/log n) time. We show that, in this model, we can compute an LCS in time (n log σ / √{log n}), which is sublinear in n if σ = 2^{o(√{log n})} (in particular, if σ = (1)), using optimal space (n log σ/log n). We then lift our ideas to the problem of computing a k-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of S that occurs in T with at most k mismatches. Flouri et al. showed how to compute a 1-mismatch LCS in (n log n) time [IPL 2015]. Thankachan et al. extended this result to computing a k-mismatch LCS in (n log^k n) time for k = (1) [J. Comput. Biol. 2016]. We show an (n log^{k-1/2} n)-time algorithm, for any constant integer k > 0 and irrespective of the alphabet size, using (n) space as the previous approaches. We thus notably break through the well-known n log^k n barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with k errors. </p

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server

HAL Descartes

Dagstuhl Research Online Publication Server

Hal-Diderot

Altered secretion of exosomes by muscle cells: role in ALS pathogenesis

Author: Butler-Brown G
Duddy William
Duguez Stephanie
Gonzales De Aguilar JL
Le Gall Laura
Loeffler JP
Martinat Cecile
Mouly V
Ouandaogo Gisele
Pradat PF
Thorley Matthew
Publication venue: European Academy of Neurology
Publication date: 01/06/2015
Field of study

Ulster University's Research Portal