Search CORE

4 research outputs found

String Indexing for Top- $k$ Close Consecutive Occurrences

Author: Bille Philip
Gørtz Inge Li
Pedersen Max Rishøj
Rotenberg Eva
Steiner Teresa Anna
Publication venue
Publication date: 29/09/2020
Field of study

The classic string indexing problem is to preprocess a string

S

into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string

P

, report all occurrences of

P

within

S

. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-

k

close consecutive occurrences problem (SITCCO). Here, a consecutive occurrence is a pair

(i,j)

i < j

, such that

P

occurs at positions

i

and

j

S

and there is no occurrence of

P

between

i

and

j

, and their distance is defined as

j-i

. Given a pattern

P

and a parameter

k

, the goal is to report the top-

k

consecutive occurrences of

P

S

of minimal distance. The challenge is to compactly represent

S

while supporting queries in time close to length of

P

and

k

. We give two time-space trade-offs for the problem. Let

n

be the length of

S

m

the length of

P

, and

\epsilon\in(0,1]

. Our first result achieves

O(n\log n)

space and optimal query time of

O(m+k)

, and our second result achieves linear space and query time

O(m+k^{1+\epsilon})

. Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.Comment: Fixed typos, minor change

arXiv.org e-Print Archive

Online Research Database In Technology

Efficient index for retrieving top-k most frequent documents

Author: Boyer
Cormen
Knuth
Manish Patil
McCreight
Rahul Shah
Shih-Bin Wu
Weisstein
Willard
Wing-Kai Hon
Witten
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Efficient Indexing for Structured and Unstructured Data

Author: Patil Manish Madhukar
Publication venue: LSU Digital Commons
Publication date: 01/01/2014
Field of study

The collection of digital data is growing at an exponential rate. Data originates from wide range of data sources such as text feeds, biological sequencers, internet traffic over routers, through sensors and many other sources. To mine intelligent information from these sources, users have to query the data. Indexing techniques aim to reduce the query time by preprocessing the data. Diversity of data sources in real world makes it imperative to develop application specific indexing solutions based on the data to be queried. Data can be structured i.e., relational tables or unstructured i.e., free text. Moreover, increasingly many applications need to seamlessly analyze both kinds of data making data integration a central issue. Integrating text with structured data needs to account for missing values, errors in the data etc. Probabilistic models have been proposed recently for this purpose. These models are also useful for applications where uncertainty is inherent in data e.g. sensor networks. This dissertation aims to propose efficient indexing solutions for several problems that lie at the intersection of database and information retrieval such as joining ranked inputs, full-text documents searching etc. Other well-known problems of ranked retrieval and pattern matching are also studied under probabilistic settings. For each problem, the worst-case theoretical bounds of the proposed solutions are established and/or their practicality is demonstrated by thorough experimentation

Louisiana State University