72,428 research outputs found
BriskStream: Scaling Data Stream Processing on Shared-Memory Multicore Architectures
We introduce BriskStream, an in-memory data stream processing system (DSPSs)
specifically designed for modern shared-memory multicore architectures.
BriskStream's key contribution is an execution plan optimization paradigm,
namely RLAS, which takes relative-location (i.e., NUMA distance) of each pair
of producer-consumer operators into consideration. We propose a branch and
bound based approach with three heuristics to resolve the resulting nontrivial
optimization problem. The experimental evaluations demonstrate that BriskStream
yields much higher throughput and better scalability than existing DSPSs on
multi-core architectures when processing different types of workloads.Comment: To appear in SIGMOD'1
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
In modern Commercial Off-The-Shelf (COTS) multicore systems, each core can
generate many parallel memory requests at a time. The processing of these
parallel requests in the DRAM controller greatly affects the memory
interference delay experienced by running tasks on the platform. In this paper,
we model a modern COTS multicore system which has a nonblocking last-level
cache (LLC) and a DRAM controller that prioritizes reads over writes. To
minimize interference, we focus on LLC and DRAM bank partitioned systems. Based
on the model, we propose an analysis that computes a safe upper bound for the
worst-case memory interference delay. We validated our analysis on a real COTS
multicore platform with a set of carefully designed synthetic benchmarks as
well as SPEC2006 benchmarks. Evaluation results show that our analysis is more
accurately capture the worst-case memory interference delay and provides safer
upper bounds compared to a recently proposed analysis which significantly
under-estimate the delay.Comment: Technical Repor
High Performance Direct Gravitational N-body Simulations on Graphics Processing Units
We present the results of gravitational direct -body simulations using the
commercial graphics processing units (GPU) NVIDIA Quadro FX1400 and GeForce
8800GTX, and compare the results with GRAPE-6Af special purpose hardware. The
force evaluation of the -body problem was implemented in Cg using the GPU
directly to speed-up the calculations. The integration of the equations of
motions were, running on the host computer, implemented in C using the 4th
order predictor-corrector Hermite integrator with block time steps. We find
that for a large number of particles (N \apgt 10^4) modern graphics
processing units offer an attractive low cost alternative to GRAPE special
purpose hardware. A modern GPU continues to give a relatively flat scaling with
the number of particles, comparable to that of the GRAPE. Using the same time
step criterion the total energy of the -body system was conserved better
than to one in on the GPU, which is only about an order of magnitude
worse than obtained with GRAPE. For N\apgt 10^6 the GeForce 8800GTX was about
20 times faster than the host computer. Though still about an order of
magnitude slower than GRAPE, modern GPU's outperform GRAPE in their low cost,
long mean time between failure and the much larger onboard memory; the
GRAPE-6Af holds at most 256k particles whereas the GeForce 8800GTF can hold 9
million particles in memory.Comment: Submitted to New Astronom
Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes
The ongoing hardware evolution exhibits an escalation in the number, as well
as in the heterogeneity, of computing resources. The pressure to maintain
reasonable levels of performance and portability forces application developers
to leave the traditional programming paradigms and explore alternative
solutions. PaStiX is a parallel sparse direct solver, based on a dynamic
scheduler for modern hierarchical manycore architectures. In this paper, we
study the benefits and limits of replacing the highly specialized internal
scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and
StarPU. The tasks graph of the factorization step is made available to the two
runtimes, providing them the opportunity to process and optimize its traversal
in order to maximize the algorithm efficiency for the targeted hardware
platform. A comparative study of the performance of the PaStiX solver on top of
its native internal scheduler, PaRSEC, and StarPU frameworks, on different
execution environments, is performed. The analysis highlights that these
generic task-based runtimes achieve comparable results to the
application-optimized embedded scheduler on homogeneous platforms. Furthermore,
they are able to significantly speed up the solver on heterogeneous
environments by taking advantage of the accelerators while hiding the
complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014
- …