Search CORE

77 research outputs found

Worst-Case Optimal Algorithms for Parallel Query Processing

Author: Beame Paul
Koutris Paraschos
Suciu Dan
Publication venue
Publication date: 01/01/2016
Field of study

In this paper, we study the communication complexity for the problem of computing a conjunctive query on a large database in a parallel setting with

p

servers. In contrast to previous work, where upper and lower bounds on the communication were specified for particular structures of data (either data without skew, or data with specific types of skew), in this work we focus on worst-case analysis of the communication cost. The goal is to find worst-case optimal parallel algorithms, similar to the work of [18] for sequential algorithms. We first show that for a single round we can obtain an optimal worst-case algorithm. The optimal load for a conjunctive query

q

when all relations have size equal to

M

O(M/p^{1/\psi^*})

, where

\psi^*

is a new query-related quantity called the edge quasi-packing number, which is different from both the edge packing number and edge cover number of the query hypergraph. For multiple rounds, we present algorithms that are optimal for several classes of queries. Finally, we show a surprising connection to the external memory model, which allows us to translate parallel algorithms to external memory algorithms. This technique allows us to recover (within a polylogarithmic factor) several recent results on the I/O complexity for computing join queries, and also obtain optimal algorithms for other classes of queries

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Comparing MapReduce and pipeline implementations for counting triangles

Author: Pasarella Sánchez Ana Edelmira
Vidal Serodio Maria Esther
Zoltan Cristina
Publication venue
Publication date: 01/01/2016
Field of study

A generalized method to define the Divide & Conquer paradigm in order to have processors acting on its own data and scheduled in a parallel fashion. MapReduce is a programming model that follows this paradigm, and allows for the definition of efficient solutions by both decomposing a problem into steps on subsets of the input data and combining the results of each step to produce final results. Albeit used for the implementation of a wide variety of computational problems, MapReduce performance can be negatively affected whenever the replication factor grows or the size of the input is larger than the resources available at each processor. In this paper we show an alternative approach to implement the Divide & Conquer paradigm, named pipeline. The main features of pipeline are illustrated on a parallel implementation of the well-known problem of counting triangles in a graph. This problem is especially interesting either when the input graph does not fit in memory or is dynamically generated. To evaluate the properties of pipeline, a dynamic pipeline of processes and an ad-hoc version of MapReduce are implemented in the language Go, exploiting its ability to deal with channels and spawned processes. An empirical evaluation is conducted on graphs of different sizes and densities. Observed results suggest that pipeline allows for the implementation of an efficient solution of the problem of counting triangles in a graph, particularly, in dense and large graphs, drastically reducing the execution time with respect to the MapReduce implementation.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

MapReduce vs. pipelining counting triangles

Author: Pasarella Sánchez Ana Edelmira
Vidal Serodio Maria Esther
Zoltan Torres Ana Cristina
Publication venue: CEUR-WS.org
Publication date: 01/01/2016
Field of study

In this paper we follow an alternative approach named pipeline, to implement a parallel implementation of the well-known problem of counting triangles in a graph. This problem is especially interesting either when the input graph does not fit in memory or is dynamically generated. To be concrete, we implement a dynamic pipeline of processes and an ad-hoc MapReduce version using the language Go. We explote the ability of Go language to deal with channels and spawned processes. An empirical evaluation is conducted on graphs of different size and density. Observed results suggest that pipeline allows for the implementation of an efficient solution of the problem of counting triangles in a graph, particularly, in dense and large graphs, drastically reducing the execution time with respect to the MapReduce implementation.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Enumerating Subgraphs of Constant Sizes in External Memory

Author: Deng Shiyuan
Silvestri Francesco
Tao Yufei
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 26th International Conference on Database Theory (ICDT 2023)
Publication date: 01/01/2023
Field of study

We present an indivisible I/O-efficient algorithm for subgraph enumeration, where the objective is to list all the subgraphs of a massive graph G : = (V, E) that are isomorphic to a pattern graph Q having k = O(1) vertices. Our algorithm performs O((|E|^{k/2})/(M^{{k/2}-1} B) log_{M/B}(|E|/B) + (|E|^?)/(M^{?-1} B) I/Os with high probability, where ? is the fractional edge covering number of Q (it always holds ? ? k/2, regardless of Q), M is the number of words in (internal) memory, and B is the number of words in a disk block. Our solution is optimal in the class of indivisible algorithms for all pattern graphs with ? > k/2. When ? = k/2, our algorithm is still optimal as long as M/B ? (|E|/B)^? for any constant ? > 0

Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università di Padova

A Simple Parallel Algorithm for Natural Joins on Binary Relations

Author: Tao Yufei
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 23rd International Conference on Database Theory (ICDT 2020)
Publication date: 01/01/2020
Field of study

Dagstuhl Research Online Publication Server

Fully dynamic evaluation for conjunctive queries with free access patterns

Author: Zhang Haozhe
Publication venue
Publication date: 03/07/2023
Field of study

We study the problem of answering conjunctive queries with free access patterns under updates. A free access pattern is a partition of the free variables of the query into input and output. The query returns tuples over the output variables given a tuple of values over the input variables. We introduce a fully dynamic evaluation approach for such queries. It is fully dynamic in the sense that it supports both inserts and deletes of tuples to the input relations. Our approach computes a data structure that supports the enumeration of the output tuples and maintains it under single-tuple updates to the input data. We also give a syntactic characterization of those queries that admit constant time per single-tuple update and whose output tuples can be enumerated with constant delay given an input tuple. Finally, for triangle and hierarchical queries with free access patterns, we chart the complexity trade-oﬀs between the preprocessing time, update time and enumeration delay for such queries. The trade-oﬀs are strongly or weakly Pareto optimal for triangle and a class of hierarchical queries. Their optimality is predicated on the Online Boolean Matrix-Vector Multiplication conjecture

Oxford University Research Archive

A Near-Optimal Parallel Algorithm for Joining Binary Relations

Author: Ketsman Bas
Suciu Dan
Tao Yufei
Publication venue
Publication date: 19/10/2021
Field of study

We present a constant-round algorithm in the massively parallel computation (MPC) model for evaluating a natural join where every input relation has two attributes. Our algorithm achieves a load of

\tilde{O}(m/p^{1/\rho})

where

m

is the total size of the input relations,

p

is the number of machines,

\rho

is the join's fractional edge covering number, and

\tilde{O}(.)

hides a polylogarithmic factor. The load matches a known lower bound up to a polylogarithmic factor. At the core of the proposed algorithm is a new theorem (which we name {\em the isolated cartesian product theorem}) that provides fresh insight into the problem's mathematical structure. Our result implies that the {\em subgraph enumeration problem}, where the goal is to report all the occurrences of a constant-sized subgraph pattern, can be settled optimally (up to a polylogarithmic factor) in the MPC model.Comment: Short versions of this article appeared in PODS'17 and ICDT'20. The article is under submission to a journal. The red sentences are highlighted for the journal's reviewer

arXiv.org e-Print Archive

Episciences.org

Directory of Open Access Journals

Comparing MapReduce and pipeline implementations for counting triangles

Author: Pasarella Sánchez Ana Edelmira
Vidal Maria-Esther
Zoltan Torres Ana Cristina
Publication venue: 'Open Publishing Association'
Publication date: 01/01/2017
Field of study

A common method to define a parallel solution for a computational problem consists in finding a way to use the Divide and Conquer paradigm in order to have processors acting on its own data and scheduled in a parallel fashion. MapReduce is a programming model that follows this paradigm, and allows for the definition of efficient solutions by both decomposing a problem into steps on subsets of the input data and combining the results of each step to produce final results. Albeit used for the implementation of a wide variety of computational problems, MapReduce performance can be negatively affected whenever the replication factor grows or the size of the input is larger than the resources available at each processor. In this paper we show an alternative approach to implement the Divide and Conquer paradigm, named dynamic pipeline. The main features of dynamic pipelines are illustrated on a parallel implementation of the well-known problem of counting triangles in a graph. This problem is especially interesting either when the input graph does not fit in memory or is dynamically generated. To evaluate the properties of pipeline, a dynamic pipeline of processes and an ad-hoc version of MapReduce are implemented in the language Go, exploiting its ability to deal with channels and spawned processes. An empirical evaluation is conducted on graphs of different topologies, sizes, and densities. Observed results suggest that dynamic pipelines allows for an efficient implementation of the problem of counting triangles in a graph, particularly, in dense and large graphs, drastically reducing the execution time with respect to the MapReduce implementation.Peer ReviewedPostprint (published version

arXiv.org e-Print Archive

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Directory of Open Access Journals

A Near-Optimal Parallel Algorithm for Joining Binary Relations

Author: Bas Ketsman
Dan Suciu
Yufei Tao
Publication venue: Logical Methods in Computer Science e.V.
Publication date: 01/05/2022
Field of study

We present a constant-round algorithm in the massively parallel computation (MPC) model for evaluating a natural join where every input relation has two attributes. Our algorithm achieves a load of

\tilde{O}(m/p^{1/\rho})

where

m

is the total size of the input relations,

p

is the number of machines,

\rho

is the join's fractional edge covering number, and

\tilde{O}(.)

hides a polylogarithmic factor. The load matches a known lower bound up to a polylogarithmic factor. At the core of the proposed algorithm is a new theorem (which we name the "isolated cartesian product theorem") that provides fresh insight into the problem's mathematical structure. Our result implies that the subgraph enumeration problem, where the goal is to report all the occurrences of a constant-sized subgraph pattern, can be settled optimally (up to a polylogarithmic factor) in the MPC model

Directory of Open Access Journals

Tight Distributed Listing of Cliques

Author: Censor-Hillel Keren
Chang Yi-Jun
Gall François Le
Leitersdorf Dean
Publication venue
Publication date: 14/11/2020
Field of study

Much progress has recently been made in understanding the complexity landscape of subgraph finding problems in the CONGEST model of distributed computing. However, so far, very few tight bounds are known in this area. For triangle (i.e., 3-clique) listing, an optimal

\tilde{O}(n^{1/3})

-round distributed algorithm has been constructed by Chang et al.~[SODA 2019, PODC 2019]. Recent works of Eden et al.~[DISC 2019] and of Censor-Hillel et al.~[PODC 2020] have shown sublinear algorithms for

K_p

-listing, for each

p \geq 4

, but still leaving a significant gap between the upper bounds and the known lower bounds of the problem. In this paper, we completely close this gap. We show that for each

p \geq 4

, there is an

\tilde{O}(n^{1 - 2/p})

-round distributed algorithm that lists all

p

-cliques

K_p

in the communication network. Our algorithm is \emph{optimal} up to a polylogarithmic factor, due to the

\tilde{\Omega}(n^{1 - 2/p})

-round lower bound of Fischer et al.~[SPAA 2018], which holds even in the CONGESTED CLIQUE model. Together with the triangle-listing algorithm by Chang et al.~[SODA 2019, PODC 2019], our result thus shows that the round complexity of

K_p

-listing, for all

p

, is the same in both the CONGEST and CONGESTED CLIQUE models, at

\tilde{\Theta}(n^{1 - 2/p})

rounds. For

p=4

, our result additionally matches the

\tilde{\Omega}(n^{1/2})

lower bound for

K_4

-\emph{detection} by Czumaj and Konrad [DISC 2018], implying that the round complexities for detection and listing of

K_4

are equivalent in the CONGEST model.Comment: 21 pages. To appear in SODA 202

arXiv.org e-Print Archive