1,448 research outputs found
Database Learning: Toward a Database that Becomes Smarter Every Time
In today's databases, previous query answers rarely benefit answering future
queries. For the first time, to the best of our knowledge, we change this
paradigm in an approximate query processing (AQP) context. We make the
following observation: the answer to each query reveals some degree of
knowledge about the answer to another query because their answers stem from the
same underlying distribution that has produced the entire dataset. Exploiting
and refining this knowledge should allow us to answer queries more
analytically, rather than by reading enormous amounts of raw data. Also,
processing more queries should continuously enhance our knowledge of the
underlying distribution, and hence lead to increasingly faster response times
for future queries.
We call this novel idea---learning from past query answers---Database
Learning. We exploit the principle of maximum entropy to produce answers, which
are in expectation guaranteed to be more accurate than existing sample-based
approximations. Empowered by this idea, we build a query engine on top of Spark
SQL, called Verdict. We conduct extensive experiments on real-world query
traces from a large customer of a major database vendor. Our results
demonstrate that Verdict supports 73.7% of these queries, speeding them up by
up to 23.0x for the same accuracy level compared to existing AQP systems.Comment: This manuscript is an extended report of the work published in ACM
SIGMOD conference 201
Holistic Influence Maximization: Combining Scalability and Efficiency with Opinion-Aware Models
The steady growth of graph data from social networks has resulted in
wide-spread research in finding solutions to the influence maximization
problem. In this paper, we propose a holistic solution to the influence
maximization (IM) problem. (1) We introduce an opinion-cum-interaction (OI)
model that closely mirrors the real-world scenarios. Under the OI model, we
introduce a novel problem of Maximizing the Effective Opinion (MEO) of
influenced users. We prove that the MEO problem is NP-hard and cannot be
approximated within a constant ratio unless P=NP. (2) We propose a heuristic
algorithm OSIM to efficiently solve the MEO problem. To better explain the OSIM
heuristic, we first introduce EaSyIM - the opinion-oblivious version of OSIM, a
scalable algorithm capable of running within practical compute times on
commodity hardware. In addition to serving as a fundamental building block for
OSIM, EaSyIM is capable of addressing the scalability aspect - memory
consumption and running time, of the IM problem as well.
Empirically, our algorithms are capable of maintaining the deviation in the
spread always within 5% of the best known methods in the literature. In
addition, our experiments show that both OSIM and EaSyIM are effective,
efficient, scalable and significantly enhance the ability to analyze real
datasets.Comment: ACM SIGMOD Conference 2016, 18 pages, 29 figure
Building Confidential and Efficient Query Services in the Cloud with RASP Data Perturbation
With the wide deployment of public cloud computing infrastructures, using
clouds to host data query services has become an appealing solution for the
advantages on scalability and cost-saving. However, some data might be
sensitive that the data owner does not want to move to the cloud unless the
data confidentiality and query privacy are guaranteed. On the other hand, a
secured query service should still provide efficient query processing and
significantly reduce the in-house workload to fully realize the benefits of
cloud computing. We propose the RASP data perturbation method to provide secure
and efficient range query and kNN query services for protected data in the
cloud. The RASP data perturbation method combines order preserving encryption,
dimensionality expansion, random noise injection, and random projection, to
provide strong resilience to attacks on the perturbed data and queries. It also
preserves multidimensional ranges, which allows existing indexing techniques to
be applied to speedup range query processing. The kNN-R algorithm is designed
to work with the RASP range query algorithm to process the kNN queries. We have
carefully analyzed the attacks on data and queries under a precisely defined
threat model and realistic security assumptions. Extensive experiments have
been conducted to show the advantages of this approach on efficiency and
security.Comment: 18 pages, to appear in IEEE TKDE, accepted in December 201
PQL: A Declarative Query Language over Dynamic Biological Schemata
We introduce the PQL query language (PQL) used in the GeneSeek genetic data integration project. PQL incorporates many features of query languages for semi-structured data. To this we add the ability to express metadata constraints like intended semantics and database curation approach. These constraints guide the dynamic generation of potential query plans. This allows a single query to remain relevant even in the presence of source and mediated schemas that are continually evolving, as is often the case in data integration
Post-processing of association rules.
In this paper, we situate and motivate the need for a post-processing phase to the association rule mining algorithm when plugged into the knowledge discovery in databases process. Major research effort has already been devoted to optimising the initially proposed mining algorithms. When it comes to effectively extrapolating the most interesting knowledge nuggets from the standard output of these algorithms, one is faced with an extreme challenge, since it is not uncommon to be confronted with a vast amount of association rules after running the algorithms. The sheer multitude of generated rules often clouds the perception of the interpreters. Rightful assessment of the usefulness of the generated output introduces the need to effectively deal with different forms of data redundancy and data being plainly uninteresting. In order to do so, we will give a tentative overview of some of the main post-processing tasks, taking into account the efforts that have already been reported in the literature.
- …