1,547 research outputs found
Making Queries Tractable on Big Data with Preprocessing
A query class is traditionally considered tractable if there exists a polynomial-time (PTIME) algorithm to answer its queries. When it comes to big data, however, PTIME al-gorithms often become infeasible in practice. A traditional and effective approach to coping with this is to preprocess data off-line, so that queries in the class can be subsequently evaluated on the data efficiently. This paper aims to pro-vide a formal foundation for this approach in terms of com-putational complexity. (1) We propose a set of Î -tractable queries, denoted by Î T0Q, to characterize classes of queries that can be answered in parallel poly-logarithmic time (NC) after PTIME preprocessing. (2) We show that several natu-ral query classes are Î -tractable and are feasible on big data. (3) We also study a set Î TQ of query classes that can be ef-fectively converted to Î -tractable queries by re-factorizing its data and queries for preprocessing. We introduce a form of NC reductions to characterize such conversions. (4) We show that a natural query class is complete for Î TQ. (5) We also show that Î T0Q â P unless P = NC, i.e., the set Î T0Q of all Î -tractable queries is properly contained in the set P of all PTIME queries. Nonetheless, Î TQ = P, i.e., all PTIME query classes can be made Î -tractable via proper re-factorizations. This work is a step towards understanding the tractability of queries in the context of big data. 1
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
Towards Analytics Aware Ontology Based Access to Static and Streaming Data (Extended Version)
Real-time analytics that requires integration and aggregation of
heterogeneous and distributed streaming and static data is a typical task in
many industrial scenarios such as diagnostics of turbines in Siemens. OBDA
approach has a great potential to facilitate such tasks; however, it has a
number of limitations in dealing with analytics that restrict its use in
important industrial applications. Based on our experience with Siemens, we
argue that in order to overcome those limitations OBDA should be extended and
become analytics, source, and cost aware. In this work we propose such an
extension. In particular, we propose an ontology, mapping, and query language
for OBDA, where aggregate and other analytical functions are first class
citizens. Moreover, we develop query optimisation techniques that allow to
efficiently process analytical tasks over static and streaming data. We
implement our approach in a system and evaluate our system with Siemens turbine
data
Parameter Compilation
In resolving instances of a computational problem, if multiple instances of
interest share a feature in common, it may be fruitful to compile this feature
into a format that allows for more efficient resolution, even if the
compilation is relatively expensive. In this article, we introduce a formal
framework for classifying problems according to their compilability. The basic
object in our framework is that of a parameterized problem, which here is a
language along with a parameterization---a map which provides, for each
instance, a so-called parameter on which compilation may be performed. Our
framework is positioned within the paradigm of parameterized complexity, and
our notions are relatable to established concepts in the theory of
parameterized complexity. Indeed, we view our framework as playing a unifying
role, integrating together parameterized complexity and compilability theory
- âŠ