Search CORE

19,853 research outputs found

Optimal cache-aware suffix selection

Author: Franceschini Gianni
Grossi Roberto
Muthukrishnan S.
Publication venue
Publication date: 01/01/2009
Field of study

Given string

S[1..N]

and integer

k

, the {\em suffix selection} problem is to determine the

k

th lexicographically smallest amongst the suffixes

S[i... N]

1 \leq i \leq N

. We study the suffix selection problem in the cache-aware model that captures two-level memory inherent in computing systems, for a \emph{cache} of limited size

M

and block size

B

. The complexity of interest is the number of block transfers. We present an optimal suffix selection algorithm in the cache-aware model, requiring \Thetah{N/B} block transfers, for any string

S

over an unbounded alphabet (where characters can only be compared), under the common tall-cache assumption (i.e. M=\Omegah{B^{1+\epsilon}}, where

\epsilon<1

). Our algorithm beats the bottleneck bound for permuting an input array to the desired output array, which holds for nearly any nontrivial problem in hierarchical memory models

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Pisa

Dagstuhl Research Online Publication Server

Archivio della ricerca- Università di Roma La Sapienza

Hal-Diderot

Optimal Substring-Equality Queries with Applications to Sparse Text Indexing

Author: Prezza Nicola
Publication venue
Publication date: 01/01/2020
Field of study

We consider the problem of encoding a string of length

n

from an integer alphabet of size

\sigma

so that access and substring equality queries (that is, determining the equality of any two substrings) can be answered efficiently. Any uniquely-decodable encoding supporting access must take

n\log\sigma + \Theta(\log (n\log\sigma))

bits. We describe a new data structure matching this lower bound when

\sigma\leq n^{O(1)}

while supporting both queries in optimal

O(1)

time. Furthermore, we show that the string can be overwritten in-place with this structure. The redundancy of

\Theta(\log n)

bits and the constant query time break exponentially a lower bound that is known to hold in the read-only model. Using our new string representation, we obtain the first in-place subquadratic (indeed, even sublinear in some cases) algorithms for several string-processing problems in the restore model: the input string is rewritable and must be restored before the computation terminates. In particular, we describe the first in-place subquadratic Monte Carlo solutions to the sparse suffix sorting, sparse LCP array construction, and suffix selection problems. With the sole exception of suffix selection, our algorithms are also the first running in sublinear time for small enough sets of input suffixes. Combining these solutions, we obtain the first sublinear-time Monte Carlo algorithm for building the sparse suffix tree in compact space. We also show how to derandomize our algorithms using small space. This leads to the first Las Vegas in-place algorithm computing the full LCP array in

O(n\log n)

time and to the first Las Vegas in-place algorithms solving the sparse suffix sorting and sparse LCP array construction problems in

O(n^{1.5}\sqrt{\log \sigma})

time. Running times of these Las Vegas algorithms hold in the worst case with high probability.Comment: Refactored according to TALG's reviews. New w.h.p. bounds and Las Vegas algorithm

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Linear-Space Data Structures for Range Mode Query in Arrays

Author: Durocher Stephane
Morrison Jason
Publication venue
Publication date: 01/01/2011
Field of study

A mode of a multiset

S

is an element

a \in S

of maximum multiplicity; that is,

a

occurs at least as frequently as any other element in

S

. Given a list

A[1:n]

n

items, we consider the problem of constructing a data structure that efficiently answers range mode queries on

A

. Each query consists of an input pair of indices

(i, j)

for which a mode of

A[i:j]

must be returned. We present an

O(n^{2-2\epsilon})

-space static data structure that supports range mode queries in

O(n^\epsilon)

time in the worst case, for any fixed

\epsilon \in [0,1/2]

. When

\epsilon = 1/2

, this corresponds to the first linear-space data structure to guarantee

O(\sqrt{n})

query time. We then describe three additional linear-space data structures that provide

O(k)

O(m)

, and

O(|j-i|)

query time, respectively, where

k

denotes the number of distinct elements in

A

and

m

denotes the frequency of the mode of

A

. Finally, we examine generalizing our data structures to higher dimensions.Comment: 13 pages, 2 figure

arXiv.org e-Print Archive

CiteSeerX

Genetic Algorithm (GA) in Feature Selection for CRF Based Manipuri Multiword Expression (MWE) Identification

Author: Bandyopadhyay Sivaji
Nongmeikapam Kishorjit
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 10/11/2011
Field of study

This paper deals with the identification of Multiword Expressions (MWEs) in Manipuri, a highly agglutinative Indian Language. Manipuri is listed in the Eight Schedule of Indian Constitution. MWE plays an important role in the applications of Natural Language Processing(NLP) like Machine Translation, Part of Speech tagging, Information Retrieval, Question Answering etc. Feature selection is an important factor in the recognition of Manipuri MWEs using Conditional Random Field (CRF). The disadvantage of manual selection and choosing of the appropriate features for running CRF motivates us to think of Genetic Algorithm (GA). Using GA we are able to find the optimal features to run the CRF. We have tried with fifty generations in feature selection along with three fold cross validation as fitness function. This model demonstrated the Recall (R) of 64.08%, Precision (P) of 86.84% and F-measure (F) of 73.74%, showing an improvement over the CRF based Manipuri MWE identification without GA application.Comment: 14 pages, 6 figures, see http://airccse.org/journal/jcsit/1011csit05.pd

arXiv.org e-Print Archive

Crossref

Authorship attribution in portuguese using character N-grams

Author: Baptista Jorge
Markov Ilia
Pichardo-Lagunas Obdulia
Publication venue: 'Obuda University'
Publication date: 01/01/2017
Field of study

For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.Mexican Government (Conacyt) [240844, 20161958]; Mexican Government (SIP-IPN) [20171813, 20171344, 20172008]; Mexican Government (SNI); Mexican Government (COFAA-IPN)

Crossref

Sapientia

Proxy Caching for Video-on-Demand Using Flexible Starting Point Selection

Author: Li Xiaoling
Muhammad Muhammad
Steinbach Eckehard
Tu Wei
Publication venue: IEEE - Institute of Electrical and Electronics Engineers
Publication date: 01/01/2009
Field of study

Institute of Transport Research:Publications