479 research outputs found
Parallel sparse matrix-vector multiplication as a test case for hybrid MPI+OpenMP programming
We evaluate optimized parallel sparse matrix-vector operations for two
representative application areas on widespread multicore-based cluster
configurations. First the single-socket baseline performance is analyzed and
modeled with respect to basic architectural properties of standard multicore
chips. Going beyond the single node, parallel sparse matrix-vector operations
often suffer from an unfavorable communication to computation ratio. Starting
from the observation that nonblocking MPI is not able to hide communication
cost using standard MPI implementations, we demonstrate that explicit overlap
of communication and computation can be achieved by using a dedicated
communication thread, which may run on a virtual core. We compare our approach
to pure MPI and the widely used "vector-like" hybrid programming strategy.Comment: 12 pages, 6 figure
Distributed-Memory Breadth-First Search on Massive Graphs
This chapter studies the problem of traversing large graphs using the
breadth-first search order on distributed-memory supercomputers. We consider
both the traditional level-synchronous top-down algorithm as well as the
recently discovered direction optimizing algorithm. We analyze the performance
and scalability trade-offs in using different local data structures such as CSR
and DCSC, enabling in-node multithreading, and graph decompositions such as 1D
and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451
Peregrine: A Pattern-Aware Graph Mining System
Graph mining workloads aim to extract structural properties of a graph by
exploring its subgraph structures. General purpose graph mining systems provide
a generic runtime to explore subgraph structures of interest with the help of
user-defined functions that guide the overall exploration process. However, the
state-of-the-art graph mining systems remain largely oblivious to the shape (or
pattern) of the subgraphs that they mine. This causes them to: (a) explore
unnecessary subgraphs; (b) perform expensive computations on the explored
subgraphs; and, (c) hold intermediate partial subgraphs in memory; all of which
affect their overall performance. Furthermore, their programming models are
often tied to their underlying exploration strategies, which makes it difficult
for domain users to express complex mining tasks.
In this paper, we develop Peregrine, a pattern-aware graph mining system that
directly explores the subgraphs of interest while avoiding exploration of
unnecessary subgraphs, and simultaneously bypassing expensive computations
throughout the mining process. We design a pattern-based programming model that
treats "graph patterns" as first class constructs and enables Peregrine to
extract the semantics of patterns, which it uses to guide its exploration. Our
evaluation shows that Peregrine outperforms state-of-the-art distributed and
single machine graph mining systems, and scales to complex mining tasks on
larger graphs, while retaining simplicity and expressivity with its
"pattern-first" programming approach.Comment: This is the full version of the paper appearing in the European
Conference on Computer Systems (EuroSys), 202
Scalable and fault-tolerant data stream processing on multi-core architectures
With increasing data volumes and velocity, many applications are shifting from the classical “process-after-store” paradigm to a stream processing model: data is produced and consumed as continuous streams. Stream processing captures latency-sensitive applications as diverse as credit card fraud detection and high-frequency trading. These applications are expressed as queries of algebraic operations (e.g., aggregation) over the most recent data using windows, i.e., finite evolving views over the input streams. To guarantee correct results, streaming applications require precise window semantics (e.g., temporal ordering) for operations that maintain state.
While high processing throughput and low latency are performance desiderata for stateful streaming applications, achieving both poses challenges. Computing the state of overlapping windows causes redundant aggregation operations: incremental execution (i.e., reusing previous results) reduces latency but prevents parallelization; at the same time, parallelizing window execution for stateful operations with precise semantics demands ordering guarantees and state access coordination. Finally, streams and state must be recovered to produce consistent and repeatable results in the event of failures.
Given the rise of shared-memory multi-core CPU architectures and high-speed networking, we argue that it is possible to address these challenges in a single node without compromising window semantics, performance, or fault-tolerance. In this thesis, we analyze, design, and implement stream processing engines (SPEs) that achieve high performance on multi-core architectures. To this end, we introduce new approaches for in-memory processing that address the previous challenges: (i) for overlapping windows, we provide a family of window aggregation techniques that enable computation sharing based on the algebraic properties of aggregation functions; (ii) for parallel window execution, we balance parallelism and incremental execution by developing abstractions for both and combining them to a novel design; and (iii) for reliable single-node execution, we enable strong fault-tolerance guarantees without sacrificing performance by reducing the required disk I/O bandwidth using a novel persistence model. We combine the above to implement an SPE that processes hundreds of millions of tuples per second with sub-second latencies. These results reveal the opportunity to reduce resource and maintenance footprint by replacing cluster-based SPEs with single-node deployments.Open Acces
- …