5,295 research outputs found
Cuttlefish: A Lightweight Primitive for Adaptive Query Processing
Modern data processing applications execute increasingly sophisticated
analysis that requires operations beyond traditional relational algebra. As a
result, operators in query plans grow in diversity and complexity. Designing
query optimizer rules and cost models to choose physical operators for all of
these novel logical operators is impractical. To address this challenge, we
develop Cuttlefish, a new primitive for adaptively processing online query
plans that explores candidate physical operator instances during query
execution and exploits the fastest ones using multi-armed bandit reinforcement
learning techniques. We prototype Cuttlefish in Apache Spark and adaptively
choose operators for image convolution, regular expression matching, and
relational joins. Our experiments show Cuttlefish-based adaptive convolution
and regular expression operators can reach 72-99% of the throughput of an
all-knowing oracle that always selects the optimal algorithm, even when
individual physical operators are up to 105x slower than the optimal.
Additionally, Cuttlefish achieves join throughput improvements of up to 7.5x
compared with Spark SQL's query optimizer
Asynchronous Complex Analytics in a Distributed Dataflow Architecture
Scalable distributed dataflow systems have recently experienced widespread
adoption, with commodity dataflow engines such as Hadoop and Spark, and even
commodity SQL engines routinely supporting increasingly sophisticated analytics
tasks (e.g., support vector machines, logistic regression, collaborative
filtering). However, these systems' synchronous (often Bulk Synchronous
Parallel) dataflow execution model is at odds with an increasingly important
trend in the machine learning community: the use of asynchrony via shared,
mutable state (i.e., data races) in convex programming tasks, which has---in a
single-node context---delivered noteworthy empirical performance gains and
inspired new research into asynchronous algorithms. In this work, we attempt to
bridge this gap by evaluating the use of lightweight, asynchronous state
transfer within a commodity dataflow engine. Specifically, we investigate the
use of asynchronous sideways information passing (ASIP) that presents
single-stage parallel iterators with a Volcano-like intra-operator iterator
that can be used for asynchronous information passing. We port two synchronous
convex programming algorithms, stochastic gradient descent and the alternating
direction method of multipliers (ADMM), to use ASIPs. We evaluate an
implementation of ASIPs within on Apache Spark that exhibits considerable
speedups as well as a rich set of performance trade-offs in the use of these
asynchronous algorithms
Chasing Similarity: Distribution-aware Aggregation Scheduling (Extended Version)
Parallel aggregation is a ubiquitous operation in data analytics that is
expressed as GROUP BY in SQL, reduce in Hadoop, or segment in TensorFlow.
Parallel aggregation starts with an optional local pre-aggregation step and
then repartitions the intermediate result across the network. While local
pre-aggregation works well for low-cardinality aggregations, the network
communication cost remains significant for high-cardinality aggregations even
after local pre-aggregation. The problem is that the repartition-based
algorithm for high-cardinality aggregation does not fully utilize the network.
In this work, we first formulate a mathematical model that captures the
performance of parallel aggregation. We prove that finding optimal aggregation
plans from a known data distribution is NP-hard, assuming the Small Set
Expansion conjecture. We propose GRASP, a GReedy Aggregation Scheduling
Protocol that decomposes parallel aggregation into phases. GRASP is
distribution-aware as it aggregates the most similar partitions in each phase
to reduce the transmitted data size in subsequent phases. In addition, GRASP
takes the available network bandwidth into account when scheduling aggregations
in each phase to maximize network utilization. The experimental evaluation on
real data shows that GRASP outperforms repartition-based aggregation by 3.5x
and LOOM by 2.0x
NScale: Neighborhood-centric Large-Scale Graph Analytics in the Cloud
There is an increasing interest in executing complex analyses over large
graphs, many of which require processing a large number of multi-hop
neighborhoods or subgraphs. Examples include ego network analysis, motif
counting, personalized recommendations, and others. These tasks are not well
served by existing vertex-centric graph processing frameworks, where user
programs are only able to directly access the state of a single vertex. This
paper introduces NSCALE, a novel end-to-end graph processing framework that
enables the distributed execution of complex subgraph-centric analytics over
large-scale graphs in the cloud. NSCALE enables users to write programs at the
level of subgraphs rather than at the level of vertices. Unlike most previous
graph processing frameworks, which apply the user program to the entire graph,
NSCALE allows users to declaratively specify subgraphs of interest. Our
framework includes a novel graph extraction and packing (GEP) module that
utilizes a cost-based optimizer to partition and pack the subgraphs of interest
into memory on as few machines as possible. The distributed execution engine
then takes over and runs the user program in parallel, while respecting the
scope of the various subgraphs. Our experimental results show
orders-of-magnitude improvements in performance and drastic reductions in the
cost of analytics compared to vertex-centric approaches.Comment: 26 pages, 15 figures, 5 table
Differential Privacy Techniques for Cyber Physical Systems: A Survey
Modern cyber physical systems (CPSs) has widely being used in our daily lives
because of development of information and communication technologies (ICT).With
the provision of CPSs, the security and privacy threats associated to these
systems are also increasing. Passive attacks are being used by intruders to get
access to private information of CPSs. In order to make CPSs data more secure,
certain privacy preservation strategies such as encryption, and k-anonymity
have been presented in the past. However, with the advances in CPSs
architecture, these techniques also needs certain modifications. Meanwhile,
differential privacy emerged as an efficient technique to protect CPSs data
privacy. In this paper, we present a comprehensive survey of differential
privacy techniques for CPSs. In particular, we survey the application and
implementation of differential privacy in four major applications of CPSs named
as energy systems, transportation systems, healthcare and medical systems, and
industrial Internet of things (IIoT). Furthermore, we present open issues,
challenges, and future research direction for differential privacy techniques
for CPSs. This survey can serve as basis for the development of modern
differential privacy techniques to address various problems and data privacy
scenarios of CPSs.Comment: 46 pages, 12 figure
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space
Few prior works study deep learning on point sets. PointNet by Qi et al. is a
pioneer in this direction. However, by design PointNet does not capture local
structures induced by the metric space points live in, limiting its ability to
recognize fine-grained patterns and generalizability to complex scenes. In this
work, we introduce a hierarchical neural network that applies PointNet
recursively on a nested partitioning of the input point set. By exploiting
metric space distances, our network is able to learn local features with
increasing contextual scales. With further observation that point sets are
usually sampled with varying densities, which results in greatly decreased
performance for networks trained on uniform densities, we propose novel set
learning layers to adaptively combine features from multiple scales.
Experiments show that our network called PointNet++ is able to learn deep point
set features efficiently and robustly. In particular, results significantly
better than state-of-the-art have been obtained on challenging benchmarks of 3D
point clouds
Privacy in Deep Learning: A Survey
The ever-growing advances of deep learning in many areas including vision,
recommendation systems, natural language processing, etc., have led to the
adoption of Deep Neural Networks (DNNs) in production systems. The availability
of large datasets and high computational power are the main contributors to
these advances. The datasets are usually crowdsourced and may contain sensitive
information. This poses serious privacy concerns as this data can be misused or
leaked through various vulnerabilities. Even if the cloud provider and the
communication link is trusted, there are still threats of inference attacks
where an attacker could speculate properties of the data used for training, or
find the underlying model architecture and parameters. In this survey, we
review the privacy concerns brought by deep learning, and the mitigating
techniques introduced to tackle these issues. We also show that there is a gap
in the literature regarding test-time inference privacy, and propose possible
future research directions
PointHop: An Explainable Machine Learning Method for Point Cloud Classification
An explainable machine learning method for point cloud classification, called
the PointHop method, is proposed in this work. The PointHop method consists of
two stages: 1) local-to-global attribute building through iterative one-hop
information exchange, and 2) classification and ensembles. In the attribute
building stage, we address the problem of unordered point cloud data using a
space partitioning procedure and developing a robust descriptor that
characterizes the relationship between a point and its one-hop neighbor in a
PointHop unit. When we put multiple PointHop units in cascade, the attributes
of a point will grow by taking its relationship with one-hop neighbor points
into account iteratively. Furthermore, to control the rapid dimension growth of
the attribute vector associated with a point, we use the Saab transform to
reduce the attribute dimension in each PointHop unit. In the classification and
ensemble stage, we feed the feature vector obtained from multiple PointHop
units to a classifier. We explore ensemble methods to improve the
classification performance furthermore. It is shown by experimental results
that the PointHop method offers classification performance that is comparable
with state-of-the-art methods while demanding much lower training complexity.Comment: 13 pages with 9 figure
An Empirical Comparison of Big Graph Frameworks in the Context of Network Analysis
Complex networks are relational data sets commonly represented as graphs. The
analysis of their intricate structure is relevant to many areas of science and
commerce, and data sets may reach sizes that require distributed storage and
processing. We describe and compare programming models for distributed
computing with a focus on graph algorithms for large-scale complex network
analysis. Four frameworks - GraphLab, Apache Giraph, Giraph++ and Apache Flink
- are used to implement algorithms for the representative problems Connected
Components, Community Detection, PageRank and Clustering Coefficients. The
implementations are executed on a computer cluster to evaluate the frameworks'
suitability in practice and to compare their performance to that of the
single-machine, shared-memory parallel network analysis package NetworKit. Out
of the distributed frameworks, GraphLab and Apache Giraph generally show the
best performance. In our experiments a cluster of eight computers running
Apache Giraph enables the analysis of a network with about 2 billion edges,
which is too large for a single machine of the same type. However, for networks
that fit into memory of one machine, the performance of the shared-memory
parallel implementation is far better than the distributed ones. The study
provides experimental evidence for selecting the appropriate framework
depending on the task and data volume
Query-driven Frequent Co-occurring Term Extraction over Relational Data using MapReduce
In this paper we study how to efficiently compute \textit{frequent
co-occurring terms} (FCT) in the results of a keyword query in parallel using
the popular MapReduce framework. Taking as input a keyword query q and an
integer k, an FCT query reports the k terms that are not in q, but appear most
frequently in the results of the keyword query q over multiple joined
relations. The returned terms of FCT search can be used to do query expansion
and query refinement for traditional keyword search. Different from the method
of FCT search in a single platform, our proposed approach can efficiently
answer a FCT query using the MapReduce Paradigm without pre-computing the
results of the original keyword query, which is run in parallel platform. In
this work, we can output the final FCT search results by two MapReduce jobs:
the first is to extract the statistical information of the data; and the second
is to calculate the total frequency of each term based on the output of the
first job. At the two MapReduce jobs, we would guarantee the load balance of
mappers and the computational balance of reducers as much as possible.
Analytical and experimental evaluations demonstrate the efficiency and
scalability of our proposed approach using TPC-H benchmark datasets with
different sizes
- …