30,969 research outputs found
On Characterizing the Data Movement Complexity of Computational DAGs for Parallel Execution
Technology trends are making the cost of data movement increasingly dominant,
both in terms of energy and time, over the cost of performing arithmetic
operations in computer systems. The fundamental ratio of aggregate data
movement bandwidth to the total computational power (also referred to the
machine balance parameter) in parallel computer systems is decreasing. It is
there- fore of considerable importance to characterize the inherent data
movement requirements of parallel algorithms, so that the minimal architectural
balance parameters required to support it on future systems can be well
understood. In this paper, we develop an extension of the well-known red-blue
pebble game to develop lower bounds on the data movement complexity for the
parallel execution of computational directed acyclic graphs (CDAGs) on parallel
systems. We model multi-node multi-core parallel systems, with the total
physical memory distributed across the nodes (that are connected through some
interconnection network) and in a multi-level shared cache hierarchy for
processors within a node. We also develop new techniques for lower bound
characterization of non-homogeneous CDAGs. We demonstrate the use of the
methodology by analyzing the CDAGs of several numerical algorithms, to develop
lower bounds on data movement for their parallel execution
High Performance Algorithms for Counting Collisions and Pairwise Interactions
The problem of counting collisions or interactions is common in areas as
computer graphics and scientific simulations. Since it is a major bottleneck in
applications of these areas, a lot of research has been carried out on such
subject, mainly focused on techniques that allow calculations to be performed
within pruned sets of objects. This paper focuses on how interaction
calculation (such as collisions) within these sets can be done more efficiently
than existing approaches. Two algorithms are proposed: a sequential algorithm
that has linear complexity at the cost of high memory usage; and a parallel
algorithm, mathematically proved to be correct, that manages to use GPU
resources more efficiently than existing approaches. The proposed and existing
algorithms were implemented, and experiments show a speedup of 21.7 for the
sequential algorithm (on small problem size), and 1.12 for the parallel
proposal (large problem size). By improving interaction calculation, this work
contributes to research areas that promote interconnection in the modern world,
such as computer graphics and robotics.Comment: Accepted in ICCS 2019 and published in Springer's LNCS series.
Supplementary content at https://mjsaldanha.com/articles/1-hpc-ssp
On Characterizing the Data Access Complexity of Programs
Technology trends will cause data movement to account for the majority of
energy expenditure and execution time on emerging computers. Therefore,
computational complexity will no longer be a sufficient metric for comparing
algorithms, and a fundamental characterization of data access complexity will
be increasingly important. The problem of developing lower bounds for data
access complexity has been modeled using the formalism of Hong & Kung's
red/blue pebble game for computational directed acyclic graphs (CDAGs).
However, previously developed approaches to lower bounds analysis for the
red/blue pebble game are very limited in effectiveness when applied to CDAGs of
real programs, with computations comprised of multiple sub-computations with
differing DAG structure. We address this problem by developing an approach for
effectively composing lower bounds based on graph decomposition. We also
develop a static analysis algorithm to derive the asymptotic data-access lower
bounds of programs, as a function of the problem size and cache size
I/O-optimal algorithms on grid graphs
Given a graph of which the n vertices form a regular two-dimensional grid,
and in which each (possibly weighted and/or directed) edge connects a vertex to
one of its eight neighbours, the following can be done in O(scan(n)) I/Os,
provided M = Omega(B^2): computation of shortest paths with non-negative edge
weights from a single source, breadth-first traversal, computation of a minimum
spanning tree, topological sorting, time-forward processing (if the input is a
plane graph), and an Euler tour (if the input graph is a tree). The
minimum-spanning tree algorithm is cache-oblivious. The best previously
published algorithms for these problems need Theta(sort(n)) I/Os. Estimates of
the actual I/O volume show that the new algorithms may often be very efficient
in practice.Comment: 12 pages' extended abstract plus 12 pages' appendix with details,
proofs and calculations. Has not been published in and is currently not under
review of any conference or journa
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from Concrete Concurrency Models
The upcoming many-core architectures require software developers to exploit
concurrency to utilize available computational power. Today's high-level
language virtual machines (VMs), which are a cornerstone of software
development, do not provide sufficient abstraction for concurrency concepts. We
analyze concrete and abstract concurrency models and identify the challenges
they impose for VMs. To provide sufficient concurrency support in VMs, we
propose to integrate concurrency operations into VM instruction sets.
Since there will always be VMs optimized for special purposes, our goal is to
develop a methodology to design instruction sets with concurrency support.
Therefore, we also propose a list of trade-offs that have to be investigated to
advise the design of such instruction sets.
As a first experiment, we implemented one instruction set extension for
shared memory and one for non-shared memory concurrency. From our experimental
results, we derived a list of requirements for a full-grown experimental
environment for further research
A Concurrent Language with a Uniform Treatment of Regions and Locks
A challenge for programming language research is to design and implement
multi-threaded low-level languages providing static guarantees for memory
safety and freedom from data races. Towards this goal, we present a concurrent
language employing safe region-based memory management and hierarchical locking
of regions. Both regions and locks are treated uniformly, and the language
supports ownership transfer, early deallocation of regions and early release of
locks in a safe manner
- …