249 research outputs found
Efficient Scalable Accurate Regression Queries in In-DBMS Analytics
Recent trends aim to incorporate advanced data analytics capabilities within DBMSs. Linear regression queries are fundamental to exploratory analytics and predictive modeling. However, computing their exact answers leaves a lot to be desired in terms of efficiency and scalability. We contribute a novel predictive analytics model and associated regression query processing algorithms, which are efficient, scalable and accurate. We focus on predicting the answers to two key query types that reveal dependencies between the values of different attributes: (i) mean-value queries and (ii) multivariate linear regression queries, both within specific data subspaces defined based on the values of other attributes. Our algorithms achieve many orders of magnitude improvement in query processing efficiency and nearperfect approximations of the underlying relationships among data attributes
Query-driven learning for predictive analytics of data subspace cardinality
Fundamental to many predictive analytics tasks is the ability to estimate the cardinality (number of data items) of multi-dimensional data subspaces, defined by query selections over datasets. This is crucial for data analysts dealing with, e.g., interactive data subspace explorations, data subspace visualizations, and in query processing optimization. However, in many modern data systems, predictive analytics may be (i) too costly money-wise, e.g., in clouds, (ii) unreliable, e.g., in modern Big Data query engines, where accurate statistics are difficult to obtain/maintain, or (iii) infeasible, e.g., for privacy issues. We contribute a novel, query-driven, function estimation model of analyst-defined data subspace cardinality. The proposed estimation model is highly accurate in terms of prediction and accommodating the well-known selection queries: multi-dimensional range and distance-nearest neighbors (radius) queries. Our function estimation model: (i) quantizes the vectorial query space, by learning the analysts’ access patterns over a data space, (ii) associates query vectors with their corresponding cardinalities of the analyst-defined data subspaces, (iii) abstracts and employs query vectorial similarity to predict the cardinality of an unseen/unexplored data subspace, and (iv) identifies and adapts to possible changes of the query subspaces based on the theory of optimal stopping. The proposed model is decentralized, facilitating the scaling-out of such predictive analytics queries. The research significance of the model lies in that (i) it is an attractive solution when data-driven statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different selection query types, and (iv) it offers a performance that is superior to that of data-driven approaches
Performance and scalability of indexed subgraph query processing methods
Graph data management systems have become very popular
as graphs are the natural data model for many applications.
One of the main problems addressed by these systems is subgraph
query processing; i.e., given a query graph, return all
graphs that contain the query. The naive method for processing
such queries is to perform a subgraph isomorphism
test against each graph in the dataset. This obviously does
not scale, as subgraph isomorphism is NP-Complete. Thus,
many indexing methods have been proposed to reduce the
number of candidate graphs that have to underpass the subgraph
isomorphism test. In this paper, we identify a set of
key factors-parameters, that influence the performance of
related methods: namely, the number of nodes per graph,
the graph density, the number of distinct labels, the number
of graphs in the dataset, and the query graph size. We then
conduct comprehensive and systematic experiments that analyze
the sensitivity of the various methods on the values of
the key parameters. Our aims are twofold: first to derive
conclusions about the algorithms’ relative performance, and,
second, to stress-test all algorithms, deriving insights as to
their scalability, and highlight how both performance and
scalability depend on the above factors. We choose six wellestablished
indexing methods, namely Grapes, CT-Index,
GraphGrepSX, gIndex, Tree+∆, and gCode, as representative
approaches of the overall design space, including the
most recent and best performing methods. We report on
their index construction time and index size, and on query
processing performance in terms of time and false positive
ratio. We employ both real and synthetic datasets. Specifi-
cally, four real datasets of different characteristics are used:
AIDS, PDBS, PCM, and PPI. In addition, we generate a
large number of synthetic graph datasets, empowering us to
systematically study the algorithms’ performance and scalability
versus the aforementioned key parameters
Scalable aggregation predictive analytics: a query-driven machine learning approach
We introduce a predictive modeling solution that provides high quality predictive analytics over aggregation queries in Big Data environments. Our predictive methodology is generally applicable in environments in which large-scale data owners may or may not restrict access to their data and allow only aggregation operators like COUNT to be executed over their data. In this context, our methodology is based on historical queries and their answers to accurately predict ad-hoc queries’ answers. We focus on the widely used set-cardinality, i.e., COUNT, aggregation query, as COUNT is a fundamental operator for both internal data system optimizations and for aggregation-oriented data exploration and predictive analytics. We contribute a novel, query-driven Machine Learning (ML) model whose goals are to: (i) learn the query-answer space from past issued queries, (ii) associate the query space with local linear regression & associative function estimators, (iii) define query similarity, and (iv) predict the cardinality of the answer set of unseen incoming queries, referred to the Set Cardinality Prediction (SCP) problem. Our ML model incorporates incremental ML algorithms for ensuring high quality prediction results. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general Big Data environments, which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for query-driven data exploration, and (iii) offers a performance (in terms of scalability, SCP accuracy, processing time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model evaluating its sensitivity, scalability and efficiency for quality predictive analytics. In addition, we report on the development and incorporation of our ML model in Spark showing its superior performance compared to the Spark’s COUNT method
Conceiving "network governance":The potential of the concepts of governmentality and normalization
Machine Unlearning in Learned Databases: An Experimental Analysis
Machine learning models based on neural networks (NNs) are enjoying
ever-increasing attention in the DB community. However, an important issue has
been largely overlooked, namely the challenge of dealing with the highly
dynamic nature of DBs, where data updates are fundamental, highly-frequent
operations. Although some recent research has addressed the issues of
maintaining updated NN models in the presence of new data insertions, the
effects of data deletions (a.k.a., "machine unlearning") remain a blind spot.
With this work, for the first time to our knowledge, we pose and answer the
following key questions: What is the effect of unlearning algorithms on
NN-based DB models? How do these effects translate to effects on downstream DB
tasks, such as selectivity estimation (SE), approximate query processing (AQP),
data generation (DG), and upstream tasks like data classification (DC)? What
metrics should we use to assess the impact and efficacy of unlearning
algorithms in learned DBs? Is the problem of machine unlearning in DBs
different from that of machine learning in DBs in the face of data insertions?
Is the problem of machine unlearning for DBs different from unlearning in the
ML literature? what are the overhead and efficiency of unlearning algorithms?
What is the sensitivity of unlearning on batching delete operations? If we have
a suitable unlearning algorithm, can we combine it with an algorithm handling
data insertions en route to solving the general adaptability/updatability
requirement in learned DBs in the face of both data inserts and deletes? We
answer these questions using a comprehensive set of experiments, various
unlearning algorithms, a variety of downstream DB tasks, and an upstream task
(DC), each with different NNs, and using a variety of metrics on a variety of
real datasets, making this also a first key step towards a benchmark for
learned DB unlearning.Comment: Accepted as a conference paper at SIGMOD 202
Towards Unbounded Machine Unlearning
Deep machine unlearning is the problem of `removing' from a trained neural
network a subset of its training set. This problem is very timely and has many
applications, including the key tasks of removing biases (RB), resolving
confusion (RC) (caused by mislabelled data in trained models), as well as
allowing users to exercise their `right to be forgotten' to protect User
Privacy (UP). This paper is the first, to our knowledge, to study unlearning
for different applications (RB, RC, UP), with the view that each has its own
desiderata, definitions for `forgetting' and associated metrics for forget
quality. For UP, we propose a novel adaptation of a strong Membership Inference
Attack for unlearning. We also propose SCRUB, a novel unlearning algorithm,
which is the only method that is consistently a top performer for forget
quality across the different application-dependent metrics for RB, RC, and UP.
At the same time, SCRUB is also consistently a top performer on metrics that
measure model utility (i.e. accuracy on retained data and generalization), and
is more efficient than previous work. The above are substantiated through a
comprehensive empirical evaluation against previous state-of-the-art
- …