Search CORE

15 research outputs found

Fast Searching in Packed Strings

Author: A. Amir
D.E. Knuth
E.W. Myers
G. Navarro
J. Tarhio
K. Fredriksson
K. Fredriksson
R. Baeza-Yates
R.A. Baeza-Yates
R.M. Karp
R.S. Boyer
S. Wu
S.T. Klein
T.A. Welch
V.L. Arlazarov
W. Masek
W. Rytter
Publication venue
Publication date: 01/01/2009
Field of study

Given strings

P

and

Q

the (exact) string matching problem is to find all positions of substrings in

Q

matching

P

. The classical Knuth-Morris-Pratt algorithm [SIAM J. Comput., 1977] solves the string matching problem in linear time which is optimal if we can only read one character at the time. However, most strings are stored in a computer in a packed representation with several characters in a single word, giving us the opportunity to read multiple characters simultaneously. In this paper we study the worst-case complexity of string matching on strings given in packed representation. Let

m \leq n

be the lengths

P

and

Q

, respectively, and let

\sigma

denote the size of the alphabet. On a standard unit-cost word-RAM with logarithmic word size we present an algorithm using time O\left(\frac{n}{\log_\sigma n} + m + \occ\right). Here \occ is the number of occurrences of

P

Q

. For

m = o(n)

this improves the

O(n)

bound of the Knuth-Morris-Pratt algorithm. Furthermore, if

m = O(n/\log_\sigma n)

our algorithm is optimal since any algorithm must spend at least \Omega(\frac{(n+m)\log \sigma}{\log n} + \occ) = \Omega(\frac{n}{\log_\sigma n} + \occ) time to read the input and report all occurrences. The result is obtained by a novel automaton construction based on the Knuth-Morris-Pratt algorithm combined with a new compact representation of subautomata allowing an optimal tabulation-based simulation.Comment: To appear in Journal of Discrete Algorithms. Special Issue on CPM 200

arXiv.org e-Print Archive

CiteSeerX

Elsevier - Publisher Connector

Crossref

Online Research Database In Technology

A Knowledge Engineering Approach to Recognizing and Extracting Sequences of Nucleic Acids from Scientific Literature

Author: Crespo del Arco Jose
García Remesal Miguel
Maojo Garcia Victor Manuel
Publication venue: Facultad de Informática (UPM)
Publication date: 01/01/2010
Field of study

In this paper we present a knowledge engineering approach to automatically recognize and extract genetic sequences from scientific articles. To carry out this task, we use a preliminary recognizer based on a finite state machine to extract all candidate DNA/RNA sequences. The latter are then fed into a knowledge-based system that automatically discards false positives and refines noisy and incorrectly merged sequences. We created the knowledge base by manually analyzing different manuscripts containing genetic sequences. Our approach was evaluated using a test set of 211 full-text articles in PDF format containing 3134 genetic sequences. For such set, we achieved 87.76% precision and 97.70% recall respectively. This method can facilitate different research tasks. These include text mining, information extraction, and information retrieval research dealing with large collections of documents containing genetic sequences

Archivo Digital UPM

Bit-parallel search algorithms for long patterns

Author: A. Hume
A.C.-C. Yao
G. Navarro
G. Navarro
G. Zhang
H. Peltola
J. Tarhio
K. Fredriksson
L. He
M. Crochemore
M.O. Külekci
R.N. Horspool
T. Lecroq
Publication venue
Publication date: 01/01/2010
Field of study

Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Optimal Packed String Matching

Author: Ben-Kiki Oren
Bille Philip
Breslauer Dany
Gasieniec Leszek
Grossi Roberto
Weimann Oren
Publication venue: Schloss Dagstuhl-Leibniz-Zentrum fuer Informati
Publication date: 01/01/2011
Field of study

In the packed string matching problem, each machine word accommodates α characters, thus an n-character text occupies n/α memory words. We extend the Crochemore-Perrin constantspace O(n)-time string matching algorithm to run in optimal O(n/α) time and even in real-time, achieving a factor α speedup over traditional algorithms that examine each character individually. Our solution can be efficiently implemented, unlike prior theoretical packed string matching work. We adapt the standard RAM model and only use its AC 0 instructions (i.e., no multiplication) plus two specialized AC 0 packed string instructions. The main string-matching instruction is available in commodity processors (i.e., Intel’s SSE4.2 and AVX Advanced String Operations); the other maximal-suffix instruction is only required during pattern preprocessing. In the absence of these two specialized instructions, we propose theoretically-efficient emulation using integer multiplication (not AC 0) and table lookup

CiteSeerX

Dagstuhl Research Online Publication Server

Online Research Database In Technology

Transcriptome annotation using tandem SAGE tags

Author: Anthony Boureux
Bertone
Bertone
Bertone
Brenner
Carninci
Chen
Cheng
Claverie
Cummins
ENCODE
Eric Rivals
Fabien Pierrat
Florence Ottones
Florence Ruffle
Ge
Horspool
Huttenhofer
Jacques Marti
Johnson
Jorma Tarhio
Jurka
Margulies
Mireille Lejeune
Mockler
Ng
Nielsen
Oscar Pecharromàn Pérez
Piquemal
Quéré
Quéré
Rinn
Saha
Semon
Shendure
Silva
Tarhio
Thérèse Commes
Velculescu
Virlon
Wheeler
Woelk
Publication venue: Oxford University Press
Publication date: 01/01/2007
Field of study

Analysis of several million expressed gene signatures (tags) revealed an increasing number of different sequences, largely exceeding that of annotated genes in mammalian genomes. Serial analysis of gene expression (SAGE) can reveal new Poly(A) RNAs transcribed from previously unrecognized chromosomal regions. However, conventional SAGE tags are too short to identify unambiguously unique sites in large genomes. Here, we design a novel strategy with tags anchored on two different restrictions sites of cDNAs. New transcripts are then tentatively defined by the two SAGE tags in tandem and by the spanning sequence read on the genome between these tagged sites. Having developed a new algorithm to locate these tag-delimited genomic sequences (TDGS), we first validated its capacity to recognize known genes and its ability to reveal new transcripts with two SAGE libraries built in parallel from a single RNA sample. Our algorithm proves fast enough to experiment this strategy at a large scale. We then collected and processed the complete sets of human SAGE tags to predict yet unknown transcripts. A cross-validation with tiling arrays data shows that 47% of these TDGS overlap transcriptional active regions. Our method provides a new and complementary approach for complex transcriptome annotation

A method for automatically extracting infectious disease-related primers and probes from the literature

Author: Crespo José
Cuevas Alejandro
de la Calle Guillermo
de la Iglesia Diana
García-Remesal Miguel
Lopez-Alonso Victoria
López-Campos Guillermo
Maojo Víctor
Martin-Sanchez Fernando
Pérez-Rey David
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

BACKGROUND: Primer and probe sequences are the main components of nucleic acid-based detection systems. Biologists use primers and probes for different tasks, some related to the diagnosis and prescription of infectious diseases. The biological literature is the main information source for empirically validated primer and probe sequences. Therefore, it is becoming increasingly important for researchers to navigate this important information. In this paper, we present a four-phase method for extracting and annotating primer/probe sequences from the literature. These phases are: (1) convert each document into a tree of paper sections, (2) detect the candidate sequences using a set of finite state machine-based recognizers, (3) refine problem sequences using a rule-based expert system, and (4) annotate the extracted sequences with their related organism/gene information. RESULTS: We tested our approach using a test set composed of 297 manuscripts. The extracted sequences and their organism/gene annotations were manually evaluated by a panel of molecular biologists. The results of the evaluation show that our approach is suitable for automatically extracting DNA sequences, achieving precision/recall rates of 97.98% and 95.77%, respectively. In addition, 76.66% of the detected sequences were correctly annotated with their organism name. The system also provided correct gene-related information for 46.18% of the sequences assigned a correct organism name. CONCLUSIONS: We believe that the proposed method can facilitate routine tasks for biomedical researchers using molecular methods to diagnose and prescribe different infectious diseases. In addition, the proposed method can be expanded to detect and extract other biological sequences from the literature. The extracted information can also be used to readily update available primer/probe databases or to create new databases from scratch.The present work has been funded, in part, by the European Commission through the ACGT integrated project (FP6-2005-IST-026996) and the ACTION-Grid support action (FP7-ICT-2007-2-224176), the Spanish Ministry of Science and Innovation through the OntoMineBase project (ref. TSI2006-13021-C02-01), the ImGraSec project (ref. TIN2007-61768), FIS/AES PS09/00069 and COMBIOMED-RETICS, and the Comunidad de Madrid, Spain.S

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Springer - Publisher Connector

PubMed Central

REPISALUD

University of Melbourne Institutional Repository

Archivo Digital UPM

Towards optimal packed string matching

Author: Aho
Aho
AMD
AMD
Apostolico
Arlazarov
Baeza-Yates
Belazzougui
Ben-Kiki
Ben-Nissan
Bille
Boyer
Breslauer
Breslauer
Breslauer
Breslauer
Breslauer
Brodnik
Cole
Cole
Commentz-Walter
Crochemore
Crochemore
Crochemore
Czumaj
Césari
Dany Breslauer
Daykin
Duval
Faro
Faro
Faro
Fich
Fine
Fischer
Fredriksson
Fredriksson
Furst
Galil
Galil
Goldberg
Gusfield
Gąsieniec
Iliopoulos
Intel
Intel
Intel
Knuth
Knuth
Leszek Ga̧sieniec
Lothaire
Muthukrishnan
Muthukrishnan
Muthukrishnan
Navarro
Oren Ben-Kiki
Oren Weimann
Philip Bille
Roberto Grossi
Rytter
Tarhio
Vishkin
Vishkin
Yao
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

a r t i c l e i n f o a b s t r a c t Dedicated to Professor Gad M. Landau, on the occasion of his 60th birthday Keywords: String matching Word-RAM Packed strings In the packed string matching problem, it is assumed that each machine word can accommodate up to α characters, thus an n-character string occupies n/α memory words. The main word-size string-matching instruction wssm is available in contemporary commodity processors. The other word-size maximum-suffix instruction wslm is only required during the pattern pre-processing. Benchmarks show that our solution can be efficiently implemented, unlike some prior theoretical packed string matching work. (b) We also consider the complexity of the packed string matching problem in the classical word-RAM model in the absence of the specialized micro-level instructions wssm and wslm. We propose micro-level algorithms for the theoretically efficient emulation using parallel algorithms techniques to emulate wssm and using the Four-Russians technique to emulate wslm. Surprisingly, our bit-parallel emulation of wssm also leads to a new simplified parallel random access machine string-matching algorithm. As a byproduct to facilitate our results we develop a new algorithm for finding the leftmost (most significant) 1 bits in consecutive non-overlapping blocks of uniform size inside a word. This latter problem is not known to be reducible to finding the rightmost 1, which can be easily solved, since we do not know how to reverse the bits of a word in O (1) time

CiteSeerX

Crossref

Archivio della Ricerca - Università di Pisa

Online Research Database In Technology

Faster algorithms for longest common substring

Author: Charalampopoulos P. (Panagiotis)
Kociumaka T. (Tomasz)
Pissis S. (Solon)
Radoszewski J. (Jakub)
Publication venue
Publication date: 01/01/2021
Field of study

In the classic longest common substring (LCS) problem, we are given two strings S and T, each of length at most n, over an alphabet of size σ, and we are asked to find a longest string occurring as a fragment of both S and T. Weiner, in his seminal paper that introduced the suffix tree, presented an (n log σ)-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an (n)-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in (n log σ/log n) space and read in (n log σ/log n) time. We show that, in this model, we can compute an LCS in time (n log σ / √{log n}), which is sublinear in n if σ = 2^{o(√{log n})} (in particular, if σ = (1)), using optimal space (n log σ/log n). We then lift our ideas to the problem of computing a k-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of S that occurs in T with at most k mismatches. Flouri et al. showed how to compute a 1-mismatch LCS in (n log n) time [IPL 2015]. Thankachan et al. extended this result to computing a k-mismatch LCS in (n log^k n) time for k = (1) [J. Comput. Biol. 2016]. We show an (n log^{k-1/2} n)-time algorithm, for any constant integer k > 0 and irrespective of the alphabet size, using (n) space as the previous approaches. We thus notably break through the well-known n log^k n barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with k errors. </p

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server

HAL Descartes

Dagstuhl Research Online Publication Server

Hal-Diderot