6,319 research outputs found
Streaming Similarity Self-Join
We introduce and study the problem of computing the similarity self-join in a
streaming context (SSSJ), where the input is an unbounded stream of items
arriving continuously. The goal is to find all pairs of items in the stream
whose similarity is greater than a given threshold. The simplest formulation of
the problem requires unbounded memory, and thus, it is intractable. To make the
problem feasible, we introduce the notion of time-dependent similarity: the
similarity of two items decreases with the difference in their arrival time. By
leveraging the properties of this time-dependent similarity function, we design
two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch
(MB), uses existing index-based filtering techniques for the static version of
the problem, and combines them in a pipeline. The second framework, Streaming
(STR), adds time filtering to the existing indexes, and integrates new
time-based bounds deeply in the working of the algorithms. We also introduce a
new indexing technique (L2), which is based on an existing state-of-the-art
indexing technique (L2AP), but is optimized for the streaming case. Extensive
experiments show that the STR algorithm, when instantiated with the L2 index,
is the most scalable option across a wide array of datasets and parameters
Indexing the Event Calculus with Kd-trees to Monitor Diabetes
Personal Health Systems (PHS) are mobile solutions tailored to monitoring
patients affected by chronic non communicable diseases. A patient affected by a
chronic disease can generate large amounts of events. Type 1 Diabetic patients
generate several glucose events per day, ranging from at least 6 events per day
(under normal monitoring) to 288 per day when wearing a continuous glucose
monitor (CGM) that samples the blood every 5 minutes for several days. This is
a large number of events to monitor for medical doctors, in particular when
considering that they may have to take decisions concerning adjusting the
treatment, which may impact the life of the patients for a long time. Given the
need to analyse such a large stream of data, doctors need a simple approach
towards physiological time series that allows them to promptly transfer their
knowledge into queries to identify interesting patterns in the data. Achieving
this with current technology is not an easy task, as on one hand it cannot be
expected that medical doctors have the technical knowledge to query databases
and on the other hand these time series include thousands of events, which
requires to re-think the way data is indexed. In order to tackle the knowledge
representation and efficiency problem, this contribution presents the kd-tree
cached event calculus (\ceckd) an event calculus extension for knowledge
engineering of temporal rules capable to handle many thousands events produced
by a diabetic patient. \ceckd\ is built as a support to a graphical interface
to represent monitoring rules for diabetes type 1. In addition, the paper
evaluates the \ceckd\ with respect to the cached event calculus (CEC) to show
how indexing events using kd-trees improves scalability with respect to the
current state of the art.Comment: 24 pages, preliminary results calculated on an implementation of
CECKD, precursor to Journal paper being submitted in 2017, with further
indexing and results possibilities, put here for reference and chronological
purposes to remember how the idea evolve
Pattern Matching in Multiple Streams
We investigate the problem of deterministic pattern matching in multiple
streams. In this model, one symbol arrives at a time and is associated with one
of s streaming texts. The task at each time step is to report if there is a new
match between a fixed pattern of length m and a newly updated stream. As is
usual in the streaming context, the goal is to use as little space as possible
while still reporting matches quickly. We give almost matching upper and lower
space bounds for three distinct pattern matching problems. For exact matching
we show that the problem can be solved in constant time per arriving symbol and
O(m+s) words of space. For the k-mismatch and k-difference problems we give
O(k) time solutions that require O(m+ks) words of space. In all three cases we
also give space lower bounds which show our methods are optimal up to a single
logarithmic factor. Finally we set out a number of open problems related to
this new model for pattern matching.Comment: 13 pages, 1 figur
On Optimally Partitioning Variable-Byte Codes
The ubiquitous Variable-Byte encoding is one of the fastest compressed
representation for integer sequences. However, its compression ratio is usually
not competitive with other more sophisticated encoders, especially when the
integers to be compressed are small that is the typical case for inverted
indexes. This paper shows that the compression ratio of Variable-Byte can be
improved by 2x by adopting a partitioned representation of the inverted lists.
This makes Variable-Byte surprisingly competitive in space with the best
bit-aligned encoders, hence disproving the folklore belief that Variable-Byte
is space-inefficient for inverted index compression. Despite the significant
space savings, we show that our optimization almost comes for free, given that:
we introduce an optimal partitioning algorithm that does not affect indexing
time because of its linear-time complexity; we show that the query processing
speed of Variable-Byte is preserved, with an extensive experimental analysis
and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering
(TKDE), 15 April 201
- …