10,613 research outputs found
Succinct Indexable Dictionaries with Applications to Encoding -ary Trees, Prefix Sums and Multisets
We consider the {\it indexable dictionary} problem, which consists of storing
a set for some integer , while supporting the
operations of \Rank(x), which returns the number of elements in that are
less than if , and -1 otherwise; and \Select(i) which returns
the -th smallest element in . We give a data structure that supports both
operations in O(1) time on the RAM model and requires bits to store a set of size , where {\cal B}(n,m) = \ceil{\lg
{m \choose n}} is the minimum number of bits required to store any -element
subset from a universe of size . Previous dictionaries taking this space
only supported (yes/no) membership queries in O(1) time. In the cell probe
model we can remove the additive term in the space bound,
answering a question raised by Fich and Miltersen, and Pagh.
We present extensions and applications of our indexable dictionary data
structure, including:
An information-theoretically optimal representation of a -ary cardinal
tree that supports standard operations in constant time,
A representation of a multiset of size from in bits that supports (appropriate generalizations of) \Rank
and \Select operations in constant time, and
A representation of a sequence of non-negative integers summing up to
in bits that supports prefix sum queries in constant
time.Comment: Final version of SODA 2002 paper; supersedes Leicester Tech report
2002/1
The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space
An indexed sequence of strings is a data structure for storing a string
sequence that supports random access, searching, range counting and analytics
operations, both for exact matches and prefix search. String sequences lie at
the core of column-oriented databases, log processing, and other storage and
query tasks. In these applications each string can appear several times and the
order of the strings in the sequence is relevant. The prefix structure of the
strings is relevant as well: common prefixes are sought in strings to extract
interesting features from the sequence. Moreover, space-efficiency is highly
desirable as it translates directly into higher performance, since more data
can fit in fast memory.
We introduce and study the problem of compressed indexed sequence of strings,
representing indexed sequences of strings in nearly-optimal compressed space,
both in the static and dynamic settings, while preserving provably good
performance for the supported operations.
We present a new data structure for this problem, the Wavelet Trie, which
combines the classical Patricia Trie with the Wavelet Tree, a succinct data
structure for storing a compressed sequence. The resulting Wavelet Trie
smoothly adapts to a sequence of strings that changes over time. It improves on
the state-of-the-art compressed data structures by supporting a dynamic
alphabet (i.e. the set of distinct strings) and prefix queries, both crucial
requirements in the aforementioned applications, and on traditional indexes by
reducing space occupancy to close to the entropy of the sequence
Dynamic Integer Sets with Optimal Rank, Select, and Predecessor Search
We present a data structure representing a dynamic set S of w-bit integers on
a w-bit word RAM. With |S|=n and w > log n and space O(n), we support the
following standard operations in O(log n / log w) time:
- insert(x) sets S = S + {x}. - delete(x) sets S = S - {x}. - predecessor(x)
returns max{y in S | y= x}. -
rank(x) returns #{y in S | y< x}. - select(i) returns y in S with rank(y)=i, if
any.
Our O(log n/log w) bound is optimal for dynamic rank and select, matching a
lower bound of Fredman and Saks [STOC'89]. When the word length is large, our
time bound is also optimal for dynamic predecessor, matching a static lower
bound of Beame and Fich [STOC'99] whenever log n/log w=O(log w/loglog w).
Technically, the most interesting aspect of our data structure is that it
supports all the above operations in constant time for sets of size n=w^{O(1)}.
This resolves a main open problem of Ajtai, Komlos, and Fredman [FOCS'83].
Ajtai et al. presented such a data structure in Yao's abstract cell-probe model
with w-bit cells/words, but pointed out that the functions used could not be
implemented. As a partial solution to the problem, Fredman and Willard
[STOC'90] introduced a fusion node that could handle queries in constant time,
but used polynomial time on the updates. We call our small set data structure a
dynamic fusion node as it does both queries and updates in constant time.Comment: Presented with different formatting in Proceedings of the 55nd IEEE
Symposium on Foundations of Computer Science (FOCS), 2014, pp. 166--175. The
new version fixes a bug in one of the bounds stated for predecessor search,
pointed out to me by Djamal Belazzougu
Applying digital content management to support localisation
The retrieval and presentation of digital content such as that on the World Wide Web (WWW) is a substantial area of research. While recent years have seen huge expansion in the size of web-based archives that can be searched efficiently by commercial search engines, the presentation of potentially relevant content is still limited to ranked document lists represented by simple text snippets or image keyframe surrogates. There is expanding interest in techniques to personalise the presentation of content to improve the richness and effectiveness of the user experience. One of the most significant challenges to achieving this is the increasingly multilingual nature of this data, and the need to provide suitably localised responses to users based on this content. The Digital Content Management (DCM) track of the Centre for Next Generation Localisation (CNGL) is seeking to develop technologies to support advanced personalised access and presentation of information by combining elements from the existing research areas of Adaptive Hypermedia and Information Retrieval. The combination of these technologies is intended to produce significant improvements in the way users access information. We review key features of these technologies and introduce early ideas for how these technologies can support localisation and localised content before concluding with some impressions of future directions in DCM
Automatic Extraction of Subcategorization from Corpora
We describe a novel technique and implemented system for constructing a
subcategorization dictionary from textual corpora. Each dictionary entry
encodes the relative frequency of occurrence of a comprehensive set of
subcategorization classes for English. An initial experiment, on a sample of 14
verbs which exhibit multiple complementation patterns, demonstrates that the
technique achieves accuracy comparable to previous approaches, which are all
limited to a highly restricted set of subcategorization classes. We also
demonstrate that a subcategorization dictionary built with the system improves
the accuracy of a parser by an appreciable amount.Comment: 8 pages; requires aclap.sty. To appear in ANLP-9
- âŠ