646 research outputs found
Task-based Augmented Contour Trees with Fibonacci Heaps
This paper presents a new algorithm for the fast, shared memory, multi-core
computation of augmented contour trees on triangulations. In contrast to most
existing parallel algorithms our technique computes augmented trees, enabling
the full extent of contour tree based applications including data segmentation.
Our approach completely revisits the traditional, sequential contour tree
algorithm to re-formulate all the steps of the computation as a set of
independent local tasks. This includes a new computation procedure based on
Fibonacci heaps for the join and split trees, two intermediate data structures
used to compute the contour tree, whose constructions are efficiently carried
out concurrently thanks to the dynamic scheduling of task parallelism. We also
introduce a new parallel algorithm for the combination of these two trees into
the output global contour tree. Overall, this results in superior time
performance in practice, both in sequential and in parallel thanks to the
OpenMP task runtime. We report performance numbers that compare our approach to
reference sequential and multi-threaded implementations for the computation of
augmented merge and contour trees. These experiments demonstrate the run-time
efficiency of our approach and its scalability on common workstations. We
demonstrate the utility of our approach in data segmentation applications
Multicore-optimized wavefront diamond blocking for optimizing stencil updates
The importance of stencil-based algorithms in computational science has
focused attention on optimized parallel implementations for multilevel
cache-based processors. Temporal blocking schemes leverage the large bandwidth
and low latency of caches to accelerate stencil updates and approach
theoretical peak performance. A key ingredient is the reduction of data traffic
across slow data paths, especially the main memory interface. In this work we
combine the ideas of multi-core wavefront temporal blocking and diamond tiling
to arrive at stencil update schemes that show large reductions in memory
pressure compared to existing approaches. The resulting schemes show
performance advantages in bandwidth-starved situations, which are exacerbated
by the high bytes per lattice update case of variable coefficients. Our thread
groups concept provides a controllable trade-off between concurrency and memory
usage, shifting the pressure between the memory interface and the CPU. We
present performance results on a contemporary Intel processor
Submicron Systems Architecture Project: Semiannual Technical Report
No abstract available
Adaptive structured parallelism
Algorithmic skeletons abstract commonly-used patterns of parallel computation, communication, and interaction. Parallel programs are expressed by interweaving parameterised skeletons analogously to the way in which structured sequential programs are developed, using well-defined constructs. Skeletons provide top-down design composition and control inheritance throughout the program structure. Based on the algorithmic skeleton concept, structured parallelism provides a high-level parallel programming technique which
allows the conceptual description of parallel programs whilst fostering platform independence and algorithm abstraction. By decoupling the algorithm
specification from machine-dependent structural considerations, structured parallelism allows programmers to code programs regardless of how the computation and communications will be executed in the system platform.Meanwhile, large non-dedicated multiprocessing systems have long posed
a challenge to known distributed systems programming techniques as a result
of the inherent heterogeneity and dynamism of their resources. Scant research
has been devoted to the use of structural information provided by skeletons
in adaptively improving program performance, based on resource utilisation.
This thesis presents a methodology to improve skeletal parallel programming
in heterogeneous distributed systems by introducing adaptivity through resource awareness. As we hypothesise that a skeletal program should be able
to adapt to the dynamic resource conditions over time using its structural forecasting information, we have developed ASPara: Adaptive Structured Parallelism. ASPara is a generic methodology to incorporate structural information at compilation into a parallel program, which will help it to adapt at
execution
Tiled Algorithms for Matrix Computations on Multicore Architectures
The current computer architecture has moved towards the multi/many-core
structure. However, the algorithms in the current sequential dense numerical
linear algebra libraries (e.g. LAPACK) do not parallelize well on
multi/many-core architectures. A new family of algorithms, the tile algorithms,
has recently been introduced to circumvent this problem. Previous research has
shown that it is possible to write efficient and scalable tile algorithms for
performing a Cholesky factorization, a (pseudo) LU factorization, and a QR
factorization. The goal of this thesis is to study tiled algorithms in a
multi/many-core setting and to provide new algorithms which exploit the current
architecture to improve performance relative to current state-of-the-art
libraries while maintaining the stability and robustness of these libraries.Comment: PhD Thesis, 2012 http://math.ucdenver.ed
Virtualizing Data Parallel Systems for Portability, Productivity, and Performance.
Computer systems equipped with graphics processing units (GPUs) have become increasingly common over the last decade. In order to utilize the highly data parallel architecture of GPUs for general purpose applications, new programming models such as OpenCL and CUDA were introduced, showing that data parallel kernels on GPUs can achieve speedups by several orders of magnitude. With this success, applications from a variety of domains have been converted to use several complicated OpenCL/CUDA data parallel kernels to benefit from data parallel systems. Simultaneously, the software industry has experienced a massive growth in the amount of data to process, demanding more powerful workhorses for data parallel computation. Consequently, additional parallel computing devices such as extra GPUs and co-processors are attached to the system, expecting more performance and capability to process larger data.
However, these programming models expose hardware details to programmers, such as the number of computing devices, interconnects, and physical memory size of each device. This degrades productivity in the software development process as programmers must manually split the workload with regard to hardware characteristics. This process is tedious and prone to errors, and most importantly, it is hard to maximize the performance at compile time as programmers do not know the runtime behaviors that can affect the performance such as input size and device availability. Therefore, applications lack portability as they may fail to run due to limited physical memory or experience suboptimal performance across different systems.
To cope with these challenges, this thesis proposes a dynamic compiler framework that provides the OpenCL applications with an abstraction layer for physical devices. This abstraction layer virtualizes physical devices and memory sub-systems, and transparently orchestrates the execution of multiple data parallel kernels on multiple computing devices. The framework significantly improves productivity as it provides hardware portability, allowing programmers to write an OpenCL program without being concerned of the target devices. Our framework also maximizes performance by balancing the data parallel workload considering factors like kernel dependencies, device performance variation on workloads of different sizes, the data transfer cost over the interconnect between devices, and physical memory limits on each device.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113361/1/jhaeng_1.pd
Managing Overheads in Asynchronous Many-Task Runtime Systems
Asynchronous Many-Task (AMT) runtime systems are based on the idea of dividing an algorithm into small units of work, known as tasks. The runtime system is then responsible for scheduling and executing these tasks in an efficient manner by taking into account the resources provided to it and the associated data dependencies between the tasks. One of the primary challenges faced by AMTs is managing such fine-grained parallelism and the overheads associated with creating, scheduling and executing tasks. This work develops methodologies for assessing and managing overheads associated with fine-grained task execution in HPX, our exemplar Asynchronous Many-Task runtime system. Known optimization techniques, viz. active message coalescing, task inlining and parallel loop iteration chunking are applied to HPX. Active message coalescing, where messages bound to the same destination are aggregated into a single message, is presented as a solution to minimize overheads associated with fine-grained communications. Methodologies and metrics for analyzing fine-grained communication overheads are developed. The metrics identified and implemented in this research aid in evaluating network efficiency by giving us an intrinsic view of the underlying network overhead that would be difficult to measure using conventional methods. Task inlining, a method that allows runtime systems to manage the overheads introduced by a large number of tasks by merging tasks together into one thread of execution, is presented as a technique for minimizing fine-grained task overheads. A runtime policy that dynamically decides whether to inline a task is developed and evaluated on different processor architectures. A methodology to derive a largely machine independent constant that allows controlling task granularity is developed. Finally, the machine independent constant derived in the context of task inlining is applied to chunking of parallel loop iterations, which confirms its applicability to reduce overheads, in the context of finding the optimal chunk size of the combined loop iterations
HPCCP/CAS Workshop Proceedings 1998
This publication is a collection of extended abstracts of presentations given at the HPCCP/CAS (High Performance Computing and Communications Program/Computational Aerosciences Project) Workshop held on August 24-26, 1998, at NASA Ames Research Center, Moffett Field, California. The objective of the Workshop was to bring together the aerospace high performance computing community, consisting of airframe and propulsion companies, independent software vendors, university researchers, and government scientists and engineers. The Workshop was sponsored by the HPCCP Office at NASA Ames Research Center. The Workshop consisted of over 40 presentations, including an overview of NASA's High Performance Computing and Communications Program and the Computational Aerosciences Project; ten sessions of papers representative of the high performance computing research conducted within the Program by the aerospace industry, academia, NASA, and other government laboratories; two panel sessions; and a special presentation by Mr. James Bailey
- …