Search CORE

5,295 research outputs found

Cuttlefish: A Lightweight Primitive for Adaptive Query Processing

Author: Balazinska Magdalena
Cheung Alvin
Gehrke Johannes
Kaftan Tomer
Publication venue
Publication date: 26/02/2018
Field of study

Modern data processing applications execute increasingly sophisticated analysis that requires operations beyond traditional relational algebra. As a result, operators in query plans grow in diversity and complexity. Designing query optimizer rules and cost models to choose physical operators for all of these novel logical operators is impractical. To address this challenge, we develop Cuttlefish, a new primitive for adaptively processing online query plans that explores candidate physical operator instances during query execution and exploits the fastest ones using multi-armed bandit reinforcement learning techniques. We prototype Cuttlefish in Apache Spark and adaptively choose operators for image convolution, regular expression matching, and relational joins. Our experiments show Cuttlefish-based adaptive convolution and regular expression operators can reach 72-99% of the throughput of an all-knowing oracle that always selects the optimal algorithm, even when individual physical operators are up to 105x slower than the optimal. Additionally, Cuttlefish achieves join throughput improvements of up to 7.5x compared with Spark SQL's query optimizer

arXiv.org e-Print Archive

Asynchronous Complex Analytics in a Distributed Dataflow Architecture

Author: Bailis Peter
Franklin Michael J.
Ghodsi Ali
Gonzalez Joseph E.
Hellerstein Joseph M.
Jordan Michael I.
Stoica Ion
Publication venue
Publication date: 23/10/2015
Field of study

Scalable distributed dataflow systems have recently experienced widespread adoption, with commodity dataflow engines such as Hadoop and Spark, and even commodity SQL engines routinely supporting increasingly sophisticated analytics tasks (e.g., support vector machines, logistic regression, collaborative filtering). However, these systems' synchronous (often Bulk Synchronous Parallel) dataflow execution model is at odds with an increasingly important trend in the machine learning community: the use of asynchrony via shared, mutable state (i.e., data races) in convex programming tasks, which has---in a single-node context---delivered noteworthy empirical performance gains and inspired new research into asynchronous algorithms. In this work, we attempt to bridge this gap by evaluating the use of lightweight, asynchronous state transfer within a commodity dataflow engine. Specifically, we investigate the use of asynchronous sideways information passing (ASIP) that presents single-stage parallel iterators with a Volcano-like intra-operator iterator that can be used for asynchronous information passing. We port two synchronous convex programming algorithms, stochastic gradient descent and the alternating direction method of multipliers (ADMM), to use ASIPs. We evaluate an implementation of ASIPs within on Apache Spark that exhibits considerable speedups as well as a rich set of performance trade-offs in the use of these asynchronous algorithms

arXiv.org e-Print Archive

Chasing Similarity: Distribution-aware Aggregation Scheduling (Extended Version)

Author: Blanas Spyros
Liu Feilong
Salmasi Ario
Sidiropoulos Anastasios
Publication venue
Publication date: 29/11/2018
Field of study

Parallel aggregation is a ubiquitous operation in data analytics that is expressed as GROUP BY in SQL, reduce in Hadoop, or segment in TensorFlow. Parallel aggregation starts with an optional local pre-aggregation step and then repartitions the intermediate result across the network. While local pre-aggregation works well for low-cardinality aggregations, the network communication cost remains significant for high-cardinality aggregations even after local pre-aggregation. The problem is that the repartition-based algorithm for high-cardinality aggregation does not fully utilize the network. In this work, we first formulate a mathematical model that captures the performance of parallel aggregation. We prove that finding optimal aggregation plans from a known data distribution is NP-hard, assuming the Small Set Expansion conjecture. We propose GRASP, a GReedy Aggregation Scheduling Protocol that decomposes parallel aggregation into phases. GRASP is distribution-aware as it aggregates the most similar partitions in each phase to reduce the transmitted data size in subsequent phases. In addition, GRASP takes the available network bandwidth into account when scheduling aggregations in each phase to maximize network utilization. The experimental evaluation on real data shows that GRASP outperforms repartition-based aggregation by 3.5x and LOOM by 2.0x

arXiv.org e-Print Archive

NScale: Neighborhood-centric Large-Scale Graph Analytics in the Cloud

Author: Deshpande Amol
Lin Jimmy
Quamar Abdul
Publication venue
Publication date: 30/09/2015
Field of study

There is an increasing interest in executing complex analyses over large graphs, many of which require processing a large number of multi-hop neighborhoods or subgraphs. Examples include ego network analysis, motif counting, personalized recommendations, and others. These tasks are not well served by existing vertex-centric graph processing frameworks, where user programs are only able to directly access the state of a single vertex. This paper introduces NSCALE, a novel end-to-end graph processing framework that enables the distributed execution of complex subgraph-centric analytics over large-scale graphs in the cloud. NSCALE enables users to write programs at the level of subgraphs rather than at the level of vertices. Unlike most previous graph processing frameworks, which apply the user program to the entire graph, NSCALE allows users to declaratively specify subgraphs of interest. Our framework includes a novel graph extraction and packing (GEP) module that utilizes a cost-based optimizer to partition and pack the subgraphs of interest into memory on as few machines as possible. The distributed execution engine then takes over and runs the user program in parallel, while respecting the scope of the various subgraphs. Our experimental results show orders-of-magnitude improvements in performance and drastic reductions in the cost of analytics compared to vertex-centric approaches.Comment: 26 pages, 15 figures, 5 table

arXiv.org e-Print Archive

Differential Privacy Techniques for Cyber Physical Systems: A Survey

Author: Chen Jinjun
Hassan Muneeb Ul
Rehmani Mubashir Husain
Publication venue
Publication date: 27/09/2019
Field of study

Modern cyber physical systems (CPSs) has widely being used in our daily lives because of development of information and communication technologies (ICT).With the provision of CPSs, the security and privacy threats associated to these systems are also increasing. Passive attacks are being used by intruders to get access to private information of CPSs. In order to make CPSs data more secure, certain privacy preservation strategies such as encryption, and k-anonymity have been presented in the past. However, with the advances in CPSs architecture, these techniques also needs certain modifications. Meanwhile, differential privacy emerged as an efficient technique to protect CPSs data privacy. In this paper, we present a comprehensive survey of differential privacy techniques for CPSs. In particular, we survey the application and implementation of differential privacy in four major applications of CPSs named as energy systems, transportation systems, healthcare and medical systems, and industrial Internet of things (IIoT). Furthermore, we present open issues, challenges, and future research direction for differential privacy techniques for CPSs. This survey can serve as basis for the development of modern differential privacy techniques to address various problems and data privacy scenarios of CPSs.Comment: 46 pages, 12 figure

arXiv.org e-Print Archive

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

Author: Guibas Leonidas J.
Qi Charles R.
Su Hao
Yi Li
Publication venue
Publication date: 07/06/2017
Field of study

Few prior works study deep learning on point sets. PointNet by Qi et al. is a pioneer in this direction. However, by design PointNet does not capture local structures induced by the metric space points live in, limiting its ability to recognize fine-grained patterns and generalizability to complex scenes. In this work, we introduce a hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set. By exploiting metric space distances, our network is able to learn local features with increasing contextual scales. With further observation that point sets are usually sampled with varying densities, which results in greatly decreased performance for networks trained on uniform densities, we propose novel set learning layers to adaptively combine features from multiple scales. Experiments show that our network called PointNet++ is able to learn deep point set features efficiently and robustly. In particular, results significantly better than state-of-the-art have been obtained on challenging benchmarks of 3D point clouds

arXiv.org e-Print Archive

Privacy in Deep Learning: A Survey

Author: Esmaeilzadeh Hadi
Mireshghallah Fatemehsadat
Raskar Ramesh
Singh Abhishek
Taram Mohammadkazem
Vepakomma Praneeth
Publication venue
Publication date: 06/11/2020
Field of study

The ever-growing advances of deep learning in many areas including vision, recommendation systems, natural language processing, etc., have led to the adoption of Deep Neural Networks (DNNs) in production systems. The availability of large datasets and high computational power are the main contributors to these advances. The datasets are usually crowdsourced and may contain sensitive information. This poses serious privacy concerns as this data can be misused or leaked through various vulnerabilities. Even if the cloud provider and the communication link is trusted, there are still threats of inference attacks where an attacker could speculate properties of the data used for training, or find the underlying model architecture and parameters. In this survey, we review the privacy concerns brought by deep learning, and the mitigating techniques introduced to tackle these issues. We also show that there is a gap in the literature regarding test-time inference privacy, and propose possible future research directions

arXiv.org e-Print Archive

PointHop: An Explainable Machine Learning Method for Point Cloud Classification

Author: Kadam Pranav
Kuo C. -C. Jay
Liu Shan
You Haoxuan
Zhang Min
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/12/2019
Field of study

An explainable machine learning method for point cloud classification, called the PointHop method, is proposed in this work. The PointHop method consists of two stages: 1) local-to-global attribute building through iterative one-hop information exchange, and 2) classification and ensembles. In the attribute building stage, we address the problem of unordered point cloud data using a space partitioning procedure and developing a robust descriptor that characterizes the relationship between a point and its one-hop neighbor in a PointHop unit. When we put multiple PointHop units in cascade, the attributes of a point will grow by taking its relationship with one-hop neighbor points into account iteratively. Furthermore, to control the rapid dimension growth of the attribute vector associated with a point, we use the Saab transform to reduce the attribute dimension in each PointHop unit. In the classification and ensemble stage, we feed the feature vector obtained from multiple PointHop units to a classifier. We explore ensemble methods to improve the classification performance furthermore. It is shown by experimental results that the PointHop method offers classification performance that is comparable with state-of-the-art methods while demanding much lower training complexity.Comment: 13 pages with 9 figure

arXiv.org e-Print Archive

An Empirical Comparison of Big Graph Frameworks in the Context of Network Analysis

Author: Koch Jannis
Meyerhenke Henning
Staudt Christian L.
Vogel Maximilian
Publication venue
Publication date: 03/01/2016
Field of study

Complex networks are relational data sets commonly represented as graphs. The analysis of their intricate structure is relevant to many areas of science and commerce, and data sets may reach sizes that require distributed storage and processing. We describe and compare programming models for distributed computing with a focus on graph algorithms for large-scale complex network analysis. Four frameworks - GraphLab, Apache Giraph, Giraph++ and Apache Flink - are used to implement algorithms for the representative problems Connected Components, Community Detection, PageRank and Clustering Coefficients. The implementations are executed on a computer cluster to evaluate the frameworks' suitability in practice and to compare their performance to that of the single-machine, shared-memory parallel network analysis package NetworKit. Out of the distributed frameworks, GraphLab and Apache Giraph generally show the best performance. In our experiments a cluster of eight computers running Apache Giraph enables the analysis of a network with about 2 billion edges, which is too large for a single machine of the same type. However, for networks that fit into memory of one machine, the performance of the shared-memory parallel implementation is far better than the distributed ones. The study provides experimental evidence for selecting the appropriate framework depending on the task and data volume

arXiv.org e-Print Archive

Query-driven Frequent Co-occurring Term Extraction over Relational Data using MapReduce

Author: Li Jianxin
Liu Chengfei
Yao Liang
Yu Jeffrey Xu
Zhou Rui
Publication venue
Publication date: 10/01/2013
Field of study

In this paper we study how to efficiently compute \textit{frequent co-occurring terms} (FCT) in the results of a keyword query in parallel using the popular MapReduce framework. Taking as input a keyword query q and an integer k, an FCT query reports the k terms that are not in q, but appear most frequently in the results of the keyword query q over multiple joined relations. The returned terms of FCT search can be used to do query expansion and query refinement for traditional keyword search. Different from the method of FCT search in a single platform, our proposed approach can efficiently answer a FCT query using the MapReduce Paradigm without pre-computing the results of the original keyword query, which is run in parallel platform. In this work, we can output the final FCT search results by two MapReduce jobs: the first is to extract the statistical information of the data; and the second is to calculate the total frequency of each term based on the output of the first job. At the two MapReduce jobs, we would guarantee the load balance of mappers and the computational balance of reducers as much as possible. Analytical and experimental evaluations demonstrate the efficiency and scalability of our proposed approach using TPC-H benchmark datasets with different sizes

arXiv.org e-Print Archive