Search CORE

368 research outputs found

Asymptotically Optimal Approximation Algorithms for Coflow Scheduling

Author: Ahuja R. K.
Al-Fares M.
Al-Fares Mohammad
Peis B.
Zaharia M.
Zhao Y.
Publication venue
Publication date: 08/03/2018
Field of study

Many modern datacenter applications involve large-scale computations composed of multiple data flows that need to be completed over a shared set of distributed resources. Such a computation completes when all of its flows complete. A useful abstraction for modeling such scenarios is a {\em coflow}, which is a collection of flows (e.g., tasks, packets, data transmissions) that all share the same performance goal. In this paper, we present the first approximation algorithms for scheduling coflows over general network topologies with the objective of minimizing total weighted completion time. We consider two different models for coflows based on the nature of individual flows: circuits, and packets. We design constant-factor polynomial-time approximation algorithms for scheduling packet-based coflows with or without given flow paths, and circuit-based coflows with given flow paths. Furthermore, we give an

O(\log n/\log \log n)

-approximation polynomial time algorithm for scheduling circuit-based coflows where flow paths are not given (here

n

is the number of network edges). We obtain our results by developing a general framework for coflow schedules, based on interval-indexed linear programs, which may extend to other coflow models and objective functions and may also yield improved approximation bounds for specific network scenarios. We also present an experimental evaluation of our approach for circuit-based coflows that show a performance improvement of at least 22% on average over competing heuristics.Comment: Fixed minor typo

arXiv.org e-Print Archive

Crossref

GraphSE $^2$ : An Encrypted Graph Database for Privacy-Preserving Social Search

Author: Beaver D.
Chi Y.
Papadimitriou A.
Poddar R.
Slee M.
Xie D.
Yao A.C.
Zaharia M.
Zhang Y.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 15/05/2019
Field of study

In this paper, we propose GraphSE

^2

, an encrypted graph database for online social network services to address massive data breaches. GraphSE

^2

preserves the functionality of social search, a key enabler for quality social network services, where social search queries are conducted on a large-scale social graph and meanwhile perform set and computational operations on user-generated contents. To enable efficient privacy-preserving social search, GraphSE

^2

provides an encrypted structural data model to facilitate parallel and encrypted graph data access. It is also designed to decompose complex social search queries into atomic operations and realise them via interchangeable protocols in a fast and scalable manner. We build GraphSE

^2

with various queries supported in the Facebook graph search engine and implement a full-fledged prototype. Extensive evaluations on Azure Cloud demonstrate that GraphSE

^2

is practical for querying a social graph with a million of users.Comment: This is the full version of our AsiaCCS paper "GraphSE

^2

: An Encrypted Graph Database for Privacy-Preserving Social Search". It includes the security proof of the proposed scheme. If you want to cite our work, please cite the conference version of i

arXiv.org e-Print Archive

Crossref

LINVIEW: Incremental View Maintenance for Complex Analytical Queries

Author: Abadi D.
Arasu A.
Deng L.
Grama A.
Kamvar S.
Kraska T.
McSherry F.
Motwani R.
Press W.
Seeger M.
Stonebraker M.
Stonebraker M.
Venkataraman S.
Whaley C.
Zaharia M.
Zhang Y.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 09/05/2014
Field of study

Many analytics tasks and machine learning problems can be naturally expressed by iterative linear algebra programs. In this paper, we study the incremental view maintenance problem for such complex analytical queries. We develop a framework, called LINVIEW, for capturing deltas of linear algebra programs and understanding their computational cost. Linear algebra operations tend to cause an avalanche effect where even very local changes to the input matrices spread out and infect all of the intermediate results and the final view, causing incremental view maintenance to lose its performance benefit over re-evaluation. We develop techniques based on matrix factorizations to contain such epidemics of change. As a consequence, our techniques make incremental view maintenance of linear algebra practical and usually substantially cheaper than re-evaluation. We show, both analytically and experimentally, the usefulness of these techniques when applied to standard analytics tasks. Our evaluation demonstrates the efficiency of LINVIEW in generating parallel incremental programs that outperform re-evaluation techniques by more than an order of magnitude.Comment: 14 pages, SIGMO

arXiv.org e-Print Archive

Crossref

Low Latency Geo-distributed Data Analytics

Author: Agarwal S.
Ananthanarayanan G.
Ananthanarayanan G.
Ananthanarayanan G.
Ananthanarayanan G.
Boutin E.
Corbett J. C.
Rabkin A.
Sitaraman R.
Venkataraman S.
Vulimiri A.
Zaharia M.
Zaharia M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/12/2015
Field of study

Low latency analytics on geographically distributed dat-asets (across datacenters, edge clusters) is an upcoming and increasingly important challenge. The dominant approach of aggregating all the data to a single data-center significantly inflates the timeliness of analytics. At the same time, running queries over geo-distributed inputs using the current intra-DC analytics frameworks also leads to high query response times because these frameworks cannot cope with the relatively low and variable capacity of WAN links. We present Iridium, a system for low latency geo-distri-buted analytics. Iridium achieves low query response times by optimizing placement of both data and tasks of the queries. The joint data and task placement op-timization, however, is intractable. Therefore, Iridium uses an online heuristic to redistribute datasets among the sites prior to queries ’ arrivals, and places the tasks to reduce network bottlenecks during the query’s ex-ecution. Finally, it also contains a knob to budget WAN usage. Evaluation across eight worldwide EC2 re-gions using production queries show that Iridium speeds up queries by 3 × − 19 × and lowers WAN usage by 15% − 64 % compared to existing baselines

CiteSeerX

Crossref

LDM: Lineage-Aware Data Management in Multi-tier Storage Systems

Author: Anton Spivak
C Ré
J Dean
JN Khasnabish
M Grund
M Zaharia
R Balasubramonian
R Bose
Sangwhan Moon
W Lu
Yingyi Bu
Publication venue: Iowa State University Digital Repository
Publication date: 01/02/2019
Field of study

We design and develop LDM, a novel data management solution to cater the needs of applications exhibiting the lineage property, i.e. in which the current writes are future reads. In such a class of applications, slow writes significantly hurt the over-all performance of jobs, i.e. current writes determine the fate of next reads. We believe that in a large scale shared production cluster, the issues associated due to data management can be mitigated at a way higher layer in the hierarchy of the I/O path, even before requests to data access are made. Contrary to the current solutions to data management which are mostly reactive and/or based on heuristics, LDM is both deterministic and pro-active. We develop block-graphs, which enable LDM to capture the complete time-based data-task dependency associations, therefore use it to perform life-cycle management through tiering of data blocks. LDM amalgamates the information from the entire data center ecosystem, right from the application code, to file system mappings, the compute and storage devices topology, etc. to make oracle-like deterministic data management decisions. With trace-driven experiments, LDM is able to achieve 29–52% reduction in over-all data center workload execution time. Moreover, by deploying LDM with extensive pre-processing creates efficient data consumption pipelines, which also reduces write and read delays significantly

Digital Repository @ Iowa State University (ISU)

Crossref

Scipedia

Computable bounds in fork-join queueing systems

Author: Billingsley P.
Boxma O.
Duffield N.
Gibbens R. J.
Jiang Y.
Kesidis G.
Lebrecht A. S.
White T.
Zaharia M.
Publication venue: ACM
Publication date: 15/06/2015
Field of study

In a Fork-Join (FJ) queueing system an upstream fork station splits incoming jobs into N tasks to be further processed by N parallel servers, each with its own queue; the response time of one job is determined, at a downstream join station, by the maximum of the corresponding tasks' response times. This queueing system is useful to the modelling of multi-service systems subject to synchronization constraints, such as MapReduce clusters or multipath routing. Despite their apparent simplicity, FJ systems are hard to analyze. This paper provides the first computable stochastic bounds on the waiting and response time distributions in FJ systems. We consider four practical scenarios by combining 1a) renewal and 1b) non-renewal arrivals, and 2a) non-blocking and 2b) blocking servers. In the case of non blocking servers we prove that delays scale as O(logN), a law which is known for first moments under renewal input only. In the case of blocking servers, we prove that the same factor of log N dictates the stability region of the system. Simulation results indicate that our bounds are tight, especially at high utilizations, in all four scenarios. A remarkable insight gained from our results is that, at moderate to high utilizations, multipath routing 'makes sense' from a queueing perspective for two paths only, i.e., response times drop the most when N = 2; the technical explanation is that the resequencing (delay) price starts to quickly dominate the tempting gain due to multipath transmissions

Crossref

Warwick Research Archives Portal Repository

How to Win a Hot Dog Eating Contest: Distributed Incremental View Maintenance with Batch Updates

Author: Abadi D.
Gupta A.
Kornacker M.
Motwani R.
Peng D.
Ramakrishnan R.
Zaharia M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 25/04/2016
Field of study

In the quest for valuable information, modern big data applications continuously monitor streams of data. These applications demand low latency stream processing even when faced with high volume and velocity of incoming changes and the user’s desire to ask complex queries. In this paper, we study low-latency incremental computation of complex SQL queries in both local and distributed streaming environments. We develop a technique for the efficient incrementalization of queries with nested aggregates for batch updates. We identify the cases in which batch processing can boost the performance of incremental view maintenance but also demonstrate that tuple-at-a-time processing often can achieve better performance in local mode. Batch updates are essential for enabling distributed incremental view maintenance and amortizing the cost of network communication and synchronization. We show how to derive incremental programs optimized for running on large-scale processing platforms. Our implementation of distributed incremental view maintenance can process tens of million of tuples with few-second latency using hundreds of nodes

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Magnetic reconnection with Sweet-Parker characteristics in two-dimensional laboratory plasmas

Author: Biskamp D.
Braginskii S. I.
Hantao Ji
Kulsrud R. M.
Masaaki Yamada
Russell Kulsrud
Scott Hsu
Sorin Zaharia
Troy Carter
Publication venue: 'AIP Publishing'
Publication date: 01/01/1999
Field of study

Magnetic reconnection has been experimentally studied in a well-controlled, two-dimensional laboratory magnetohydrodynamic plasma. The observations are found to be both qualitatively and quantitatively consistent with a generalized Sweet-Parker model which incorporates compressibility, downstream pressure, and the effective resistivity. The latter is significantly enhanced over its classical values in the collisionless limit. This generalized Sweet-Parker model also applies to the case in which an unidirectional, sizable third magnetic component is present

Crossref

UNT Digital Library

Algorithms for crime prediction in smart cities through data mining

Author: A Viloria
A Viloria
C Silverstein
F Erlandsson
H-W Kang
K Kianmehr
LGA Alves
M Hahsler
M Hall
M Zaharia
R-E Fan
V Amelec
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

The concentration of police resources in conflict zones contributes to the reduction of crime in the region and the optimization of those resources. This paper presents the use of regression techniques to predict the number of criminal acts in Colombian municipalities. To this end, a set of data was generated merging the data from the Guardia Civil with public data on the demographic structure and voting trends in the municipalities. The best regressor obtained (Random Forests) achieves a RRSE (Root Relative Squared Error) of 40.12% and opens the way to keep incorporating public data of another type with greater predictive power. In addition, M5Rules were used to interpret the results

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Digital CUC