156,125 research outputs found
Efficient Multi-way Theta-Join Processing Using MapReduce
Multi-way Theta-join queries are powerful in describing complex relations and
therefore widely employed in real practices. However, existing solutions from
traditional distributed and parallel databases for multi-way Theta-join queries
cannot be easily extended to fit a shared-nothing distributed computing
paradigm, which is proven to be able to support OLAP applications over immense
data volumes. In this work, we study the problem of efficient processing of
multi-way Theta-join queries using MapReduce from a cost-effective perspective.
Although there have been some works using the (key,value) pair-based
programming model to support join operations, efficient processing of multi-way
Theta-join queries has never been fully explored. The substantial challenge
lies in, given a number of processing units (that can run Map or Reduce tasks),
mapping a multi-way Theta-join query to a number of MapReduce jobs and having
them executed in a well scheduled sequence, such that the total processing time
span is minimized. Our solution mainly includes two parts: 1) cost metrics for
both single MapReduce job and a number of MapReduce jobs executed in a certain
order; 2) the efficient execution of a chain-typed Theta-join with only one
MapReduce job. Comparing with the query evaluation strategy proposed in [23]
and the widely adopted Pig Latin and Hive SQL solutions, our method achieves
significant improvement of the join processing efficiency.Comment: VLDB201
Scenario-Based Query Processing for Video-Surveillance Archives
Cataloged from PDF version of article.Automated video surveillance has emerged as a trendy application domain in recent years, and accessing the semantic content of surveillance video has become a challenging research area. The results of a considerable amount of research dealing with automated access to video surveillance have appeared in the literature; however, significant semantic gaps in event models and content-based access to surveillance video remain. In this paper, we propose a scenario-based query-processing system for video surveillance archives. In our system, a scenario is specified as a sequence of event predicates that can be enriched with object-based low-level features and directional predicates. We introduce an inverted tracking scheme, which effectively tracks the moving objects and enables view-based addressing of the scene. Our query-processing system also supports inverse querying and view-based querying, for after-the-fact activity analysis. We propose a specific surveillance query language to express the supported query types in a scenario-based manner. We also present a visual query-specification interface devised to facilitate the query-specification process. We have conducted performance experiments to show that our query-processing technique has a high expressive power and satisfactory retrieval accuracy in video surveillance. (C) 2009 Elsevier Ltd. All rights reserved
Div-BLAST: Diversification of sequence search results
Cataloged from PDF version of article.Sequence similarity tools, such as BLAST, seek sequences most similar to a query from a database of
sequences. They return results significantly similar to the query sequence and that are typically highly
similar to each other. Most sequence analysis tasks in bioinformatics require an exploratory approach,
where the initial results guide the user to new searches. However, diversity has not yet been considered an
integral component of sequence search tools for this discipline. Some redundancy can be avoided by
introducing non-redundancy during database construction, but it is not feasible to dynamically set a level
of non-redundancy tailored to a query sequence. We introduce the problem of diverse search and browsing
in sequence databases that produce non-redundant results optimized for any given query. We define
diversity measures for sequences and propose methods to obtain diverse results extracted from current
sequence similarity search tools. We also propose a new measure to evaluate the diversity of a set of
sequences that is returned as a result of a sequence similarity query. We evaluate the effectiveness of the
proposed methods in post-processing BLAST and PSI-BLAST results. We also assess the functional
diversity of the returned results based on available Gene Ontology annotations. Additionally, we include a
comparison with a current redundancy elimination tool, CD-HIT. Our experiments show that the proposed
methods are able to achieve more diverse yet significant result sets compared to static non-redundancy
approaches. In both sequence-based and functional diversity evaluation, the proposed diversification
methods significantly outperform original BLAST results and other baselines. A web based tool
implementing the proposed methods, Div-BLAST, can be accessed at cedar.cs.bilkent.edu.tr/Div-BLAS
The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space
An indexed sequence of strings is a data structure for storing a string
sequence that supports random access, searching, range counting and analytics
operations, both for exact matches and prefix search. String sequences lie at
the core of column-oriented databases, log processing, and other storage and
query tasks. In these applications each string can appear several times and the
order of the strings in the sequence is relevant. The prefix structure of the
strings is relevant as well: common prefixes are sought in strings to extract
interesting features from the sequence. Moreover, space-efficiency is highly
desirable as it translates directly into higher performance, since more data
can fit in fast memory.
We introduce and study the problem of compressed indexed sequence of strings,
representing indexed sequences of strings in nearly-optimal compressed space,
both in the static and dynamic settings, while preserving provably good
performance for the supported operations.
We present a new data structure for this problem, the Wavelet Trie, which
combines the classical Patricia Trie with the Wavelet Tree, a succinct data
structure for storing a compressed sequence. The resulting Wavelet Trie
smoothly adapts to a sequence of strings that changes over time. It improves on
the state-of-the-art compressed data structures by supporting a dynamic
alphabet (i.e. the set of distinct strings) and prefix queries, both crucial
requirements in the aforementioned applications, and on traditional indexes by
reducing space occupancy to close to the entropy of the sequence
Degree Sequence Bound for Join Cardinality Estimation
Recent work has demonstrated the catastrophic effects of poor cardinality
estimates on query processing time. In particular, underestimating query
cardinality can result in overly optimistic query plans which take orders of
magnitude longer to complete than one generated with the true cardinality.
Cardinality bounding avoids this pitfall by computing a strict upper bound on
the query's output size using statistics about the database such as table sizes
and degrees, i.e. value frequencies. In this paper, we extend this line of work
by proving a novel bound called the Degree Sequence Bound which takes into
account the full degree sequences and the max tuple multiplicity. This bound
improves upon previous work incorporating degree constraints which focused on
the maximum degree rather than the degree sequence. Further, we describe how to
practically compute this bound using a learned approximation of the true degree
sequences
DyREx: Dynamic Query Representation for Extractive Question Answering
Extractive question answering (ExQA) is an essential task for Natural
Language Processing. The dominant approach to ExQA is one that represents the
input sequence tokens (question and passage) with a pre-trained transformer,
then uses two learned query vectors to compute distributions over the start and
end answer span positions. These query vectors lack the context of the inputs,
which can be a bottleneck for the model performance. To address this problem,
we propose \textit{DyREx}, a generalization of the \textit{vanilla} approach
where we dynamically compute query vectors given the input, using an attention
mechanism through transformer layers. Empirical observations demonstrate that
our approach consistently improves the performance over the standard one. The
code and accompanying files for running the experiments are available at
\url{https://github.com/urchade/DyReX}.Comment: Accepted at "2nd Workshop on Efficient Natural Language and Speech
Processing (ENLSP-II)" @ NeurIPS 202
- …