Search CORE

22,408 research outputs found

Indexing labeled sequences

Author: Giraud Mathieu
Rocher Tatiana
Salson Mikaël
Publication venue: 'PeerJ'
Publication date: 01/01/2018
Field of study

International audienceBackground: Labels are a way to add some information on a text, such as functional annotations such as genes on a DNA sequences. V(D)J recombinations are DNA recombinations involving two or three short genes in lymphocytes. Sequencing this short region (500 bp or less) produces labeled sequences and brings insight in the lymphocyte repertoire for onco-hematology or immunology studies. Methods: We present two indexes for a text with non-overlapping labels. They store the text in a Burrows–Wheeler transform (BWT) and a compressed label sequence in a Wavelet Tree. The label sequence is taken in the order of the text (TL-index) or in the order of the BWT (TL-BW-index). Both indexes need a space related to the entropy of the labeled text. Results: These indexes allow efficient text–label queries to count and find labeled patterns. The TL-BW-index has an overhead on simple label queries but is very efficient on combined pattern–label queries. We implemented the indexes in C++ and compared them against a baseline solution on pseudo-random as well as on V(D)J labeled texts. Discussion: New indexes such as the ones we proposed improve the way we index and query labeled texts as, for instance, lymphocyte repertoire for hematological and immunological studies

Crossref

INRIA a CCSD electronic archive server

Directory of Open Access Journals

HAL Descartes

Hal-Diderot

Regular Languages meet Prefix Sorting

Author: Alanko Jarno
D'Agostino Giovanna
Policriti Alberto
Prezza Nicola
Publication venue
Publication date: 09/07/2019
Field of study

Indexing strings via prefix (or suffix) sorting is, arguably, one of the most successful algorithmic techniques developed in the last decades. Can indexing be extended to languages? The main contribution of this paper is to initiate the study of the sub-class of regular languages accepted by an automaton whose states can be prefix-sorted. Starting from the recent notion of Wheeler graph [Gagie et al., TCS 2017]-which extends naturally the concept of prefix sorting to labeled graphs-we investigate the properties of Wheeler languages, that is, regular languages admitting an accepting Wheeler finite automaton. Interestingly, we characterize this family as the natural extension of regular languages endowed with the co-lexicographic ordering: when sorted, the strings belonging to a Wheeler language are partitioned into a finite number of co-lexicographic intervals, each formed by elements from a single Myhill-Nerode equivalence class. Moreover: (i) We show that every Wheeler NFA (WNFA) with

n

states admits an equivalent Wheeler DFA (WDFA) with at most

2n-1-|\Sigma|

states that can be computed in

O(n^3)

time. This is in sharp contrast with general NFAs. (ii) We describe a quadratic algorithm to prefix-sort a proper superset of the WDFAs, a

O(n\log n)

-time online algorithm to sort acyclic WDFAs, and an optimal linear-time offline algorithm to sort general WDFAs. By contribution (i), our algorithms can also be used to index any WNFA at the moderate price of doubling the automaton's size. (iii) We provide a minimization theorem that characterizes the smallest WDFA recognizing the same language of any input WDFA. The corresponding constructive algorithm runs in optimal linear time in the acyclic case, and in

O(n\log n)

time in the general case. (iv) We show how to compute the smallest WDFA equivalent to any acyclic DFA in nearly-optimal time.Comment: added minimization theorems; uploaded submitted version; New version with new results (W-MH theorem, linear determinization), added author: Giovanna D'Agostin

arXiv.org e-Print Archive

Crossref

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

LR characterization of chirotopes of finite planar families of pairwise disjoint convex bodies

Author: A Björner
A Hatcher
A Lascoux
B Sturmfels
D Dekker
DC Kay
FB Kalhoff
G Ringel
G Ringel
GK Francis
H Busemann
H Edelsbrunner
H Hadwiger
HR Salzmann
J Bokowski
J Bokowski
J Bokowski
J Cantwell
J Cantwell
J Cantwell
J Folkman
J Goodman
J Groot de
J Stolfi
J-P Doignon
JE Goodman
JE Goodman
JE Goodman
JE Goodman
L Habert
Luc Habert
M Berger
M Drandell
M Haiman
M Novick
M Novick
Michel Pocchiola
P Edelman
R Cordovil
R Dhandapani
R Dhandapani
R Stanley
R Wenger
S Felsner
W Holsztynski
WS Massey
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 05/10/2014
Field of study

We extend the classical LR characterization of chirotopes of finite planar families of points to chirotopes of finite planar families of pairwise disjoint convex bodies: a map \c{hi} on the set of 3-subsets of a finite set I is a chirotope of finite planar families of pairwise disjoint convex bodies if and only if for every 3-, 4-, and 5-subset J of I the restriction of \c{hi} to the set of 3-subsets of J is a chirotope of finite planar families of pairwise disjoint convex bodies. Our main tool is the polarity map, i.e., the map that assigns to a convex body the set of lines missing its interior, from which we derive the key notion of arrangements of double pseudolines, introduced for the first time in this paper.Comment: 100 pages, 73 figures; accepted manuscript versio

arXiv.org e-Print Archive

Crossref

Suffix Tree of Alignment: An Efficient Index for Similar Data

Author: A. Amir
D. Gusfield
E. Ukkonen
E.M. McCreight
G. Navarro
H.H. Do
J. Ziv
K. Sadakane
M. Crochemore
M. Farach-Colton
P. Bille
R. Grossi
R.A. Baeza-Yates
S. Huang
S. Karlin
S. Kuruppu
V. Levenshtein
V. Mäkinen
V. Mäkinen
Publication venue
Publication date: 01/01/2013
Field of study

We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings

A

and

B

is a compacted trie representing all suffixes in

A

and

B

. It has

|A|+|B|

leaves and can be constructed in

O(|A|+|B|)

time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not exploit the similarity which is usually represented as an alignment of

A

and

B

. In this paper we propose a space/time-efficient suffix tree of alignment which wisely exploits the similarity in an alignment. Our suffix tree for an alignment of

A

and

B

has

|A| + l_d + l_1

leaves where

l_d

is the sum of the lengths of all parts of

B

different from

A

and

l_1

is the sum of the lengths of some common parts of

A

and

B

. We did not compromise the pattern search to reduce the space. Our suffix tree can be searched for a pattern

P

O(|P|+occ)

time where

occ

is the number of occurrences of

P

A

and

B

. We also present an efficient algorithm to construct the suffix tree of alignment. When the suffix tree is constructed from scratch, the algorithm requires

O(|A| + l_d + l_1 + l_2)

time where

l_2

is the sum of the lengths of other common substrings of

A

and

B

. When the suffix tree of

A

is already given, it requires

O(l_d + l_1 + l_2)

time.Comment: 12 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

King's Research Portal

The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space

Author: Grossi Roberto
Ottaviano Giuseppe
Publication venue
Publication date: 01/01/2012
Field of study

An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of column-oriented databases, log processing, and other storage and query tasks. In these applications each string can appear several times and the order of the strings in the sequence is relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract interesting features from the sequence. Moreover, space-efficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast memory. We introduce and study the problem of compressed indexed sequence of strings, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while preserving provably good performance for the supported operations. We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet Trie smoothly adapts to a sequence of strings that changes over time. It improves on the state-of-the-art compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space occupancy to close to the entropy of the sequence

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

CiNCT: Compression and retrieval for massive vehicular trajectories via relative movement labeling

Author: Ishikawa Yoshiharu
Koide Satoshi
Tadokoro Yukihiro
Xiao Chuan
Publication venue
Publication date: 29/09/2017
Field of study

In this paper, we present a compressed data structure for moving object trajectories in a road network, which are represented as sequences of road edges. Unlike existing compression methods for trajectories in a network, our method supports pattern matching and decompression from an arbitrary position while retaining a high compressibility with theoretical guarantees. Specifically, our method is based on FM-index, a fast and compact data structure for pattern matching. To enhance the compression, we incorporate the sparsity of road networks into the data structure. In particular, we present the novel concepts of relative movement labeling and PseudoRank, each contributing to significant reductions in data size and query processing time. Our theoretical analysis and experimental studies reveal the advantages of our proposed method as compared to existing trajectory compression methods and FM-index variants

arXiv.org e-Print Archive

Crossref