111 research outputs found
UPC++: A high-performance communication framework for asynchronous computation
UPC++ is a C++ library that supports high-performance computation via an asynchronous communication framework. This paper describes a new incarnation that differs substantially from its predecessor, and we discuss the reasons for our design decisions. We present new design features, including future-based asynchrony management, distributed objects, and generalized Remote Procedure Call (RPC). We show microbenchmark performance results demonstrating that one-sided Remote Memory Access (RMA) in UPC++ is competitive with MPI-3 RMA; on a Cray XC40 UPC++ delivers up to a 25% improvement in the latency of blocking RMA put, and up to a 33% bandwidth improvement in an RMA throughput test. We showcase the benefits of UPC++ with irregular applications through a pair of application motifs, a distributed hash table and a sparse solver component. Our distributed hash table in UPC++ delivers near-linear weak scaling up to 34816 cores of a Cray XC40. Our UPC++ implementation of the sparse solver component shows robust strong scaling up to 2048 cores, where it outperforms variants communicating using MPI by up to 3.1x. UPC++ encourages the use of aggressive asynchrony in low-overhead RMA and RPC, improving programmer productivity and delivering high performance in irregular applications
UniTi: Unified composition and time for multi-domain model-based design
To apply model-based design to embedded systems that interface with the physical world, including simulation and verification, current tools fall short. They must provide mathematical (model) definitions that stay close to the specification of the system. They must allow multiple domains, such as the continuous-time, discrete-time and dataflow domain, in a single model including well-defined interaction. They must support model transformations for refining a model during development. And most importantly, they must accurately include and simulate different notions of time in the model. UniTi is a model-based design flow and modelling and simulation environment that delivers on all these aspects. It is based on components that are signal transformations, and therefore mathematical functions. However, in each domain the representation of a signal differs. As components have the same structure in each domain, we can use unified composition operators to represent multiple domains in a single model. Furthermore, this composition provides a unified perspective on time in the domains, even though we differentiate between different notions of time. Time becomes a local property of the model, allowing us to represent and simulate time transformations such as time delays exactly without losing efficiency. Finally, model transformations are defined for such components, which are used for refining and developing the model and which are guided by the design steps in the design flow. We will formally define the domains, composition operators and transformations of UniTi and verify the approach with a case study on a phased array beamforming system
Scaling of Distributed Multi-Simulations on Multi-Core Clusters
International audienceDACCOSIM is a multi-simulation environment for continuous time systems, relying on FMI standard, making easy the design of a multi-simulation graph, and specially developed for multi-core PC clusters, in order to achieve speedup and size up. However, the distribution of the simulation graph remains complex and is still the responsibility of the simulation developer. This paper introduces DACCOSIM parallel and distributed architecture, and our strategies to achieve efficient multi-simulation graph distribution on multi-core clusters. Some performance experiments on two clusters, running up to 81 simulation components (FMU) and using up to 16 multi-core computing nodes, are shown. Performances measured on our faster cluster exhibit a good scalability, but some limitations of current DACCOSIM implementation are discussed
Recommended from our members
A Parallel Direct Method for Finite Element Electromagnetic Computations Based on Domain Decomposition
High performance parallel computing and direct (factorization-based) solution methods have been the two main trends in electromagnetic computations in recent years. When time-harmonic (frequency-domain) Maxwell\u27s equation are directly discretized with the Finite Element Method (FEM) or other Partial Differential Equation (PDE) methods, the resulting linear system of equations is sparse and indefinite, thus harder to efficiently factorize serially or in parallel than alternative methods e.g. integral equation solutions, that result in dense linear systems. State-of-the-art sparse matrix direct solvers such as MUMPS and PARDISO don\u27t scale favorably, have low parallel efficiency and high memory footprint. This work introduces a new class of sparse direct solvers based on domain decomposition method, termed Direct Domain Decomposition Method (D3M), which is reliable, memory efficient, and offers very good parallel scalability for arbitrary 3D FEM problems.
Unlike recent trends in approximate/low-rank solvers, this method focuses on `numerically exact\u27 solution methods as they are more reliable for complex `real-life\u27 models. The proposed method leverages physical insights at every stage of the development through a new symmetric domain decomposition method (DDM) with one set of Lagrange multipliers. Applying a special regularization scheme at the interfaces, either artificial loss or gain is introduced to each domain to eliminate non-physical internal resonances. A block-wise recursive algorithm based on Takahashi relationship is proposed for the efficient computation of discrete Dirichlet-to-Neumann (DtN) map to reduce the volumetric problem from all domains into an auxiliary surfacial problem defined on the domain interfaces only. Numerical results show up to 50% run-time saving in DtN map computation using the proposed block-wise recursive algorithm compared to alternative approaches. The auxiliary unknowns on the domain interfaces form a considerably (approximately an order of magnitude) smaller block-wise sparse matrix, which is efficiently factorized using a customized block LDL factorization with restricted pivoting to ensure stability.
The parallelization of the proposed D3M is realized based on Directed Acyclic Graph (DAG). Recent advances in parallel dense direct solvers, have shifted toward parallel implementation that rely on DAG scheduling to achieve highly efficient asynchronous parallel execution. However, adaptation of such schemes to sparse matrices is harder and often impractical. In D3M, computation of each domain\u27s discrete DtN map ``embarrassingly parallel\u27\u27, whereas the customized block LDLT is suitable for a block directed acyclic graph (B-DAG) task scheduling, similar to that used in dense matrix parallel direct solvers. In this approach, computations are represented as a sequence of small tasks that operate on domains of DDM or dense matrix blocks of the reduced matrix. These tasks can be statically scheduled for parallel execution using their DAG dependencies and weights that depend on estimates of computation and communication costs.
Comparisons with state-of-the-art exact direct solvers on electrically large problems suggest up to 20% better parallel efficiency, 30% - 3X less memory and slightly faster in runtime, while maintaining the same accuracy
Parameterized and multi-level tiled loop generation
Department Head: L. Darrell Whitley.2010 Summer.Includes bibliographical references.Tiling is a loop transformation that decomposes computations into a set of smaller computation blocks. The transformation has been proven to be useful for many high-level program optimizations, such as data locality optimization and exploiting coarse-grained parallelism, and crucial for architecture with limited resources, such as embedded systems, GPUs, and the Cell architecture. Data locality and parallelism will continue to serve as major vehicles for achieving high performance on modern architecture in multi-core era. In parameterized tiling the size of blocks is not fixed at compile time but remains a symbolic constant so that it can be selected/changed even at runtime. Parameterized tiled loops facilitate iterative and runtime optimizations, such as iterative compilation, auto-tuning and dynamic program adaption. In this dissertation we present a collection of techniques for generating parameterized and multi-level tiled loops from affine control loops and their parallelization. The tiled loop generation problem even for perfectly nested loops has been believed to have an exponential time complexity due to the heavy machinery like Fourier-Motzkin elimination. Disproving this decade-long belief, we provide a simple technique for generating tiled loop nests even from imperfectly nested loops. Our technique for perfectly nested loops consists of only syntactic processing that is applied only once and independently to each loop bound. Our approach to imperfectly nested loops is composed of a direct extension of the tiled code generation technique for perfectly nested loops and three simple optimizations on the resulting parameterized tiled loops. The generation as well as the optimizations are achieved only with purely syntactic processing, hence loop generation time remains negligible. We also present three schemes for multi-level tiling where tiling is applied more than once. All the schemes are scalable with respect to the number of tiling levels and can be combined to achieve better performance. To facilitate parallelization of parameterized tiled loops, we generate outermost tile-loops that are perfectly nested. We also provide a technique for statically restructuring parameterized tiled loops to the wavefront scheduling on shared memory system. Because the formulation of parameterized tiling does not fit into the well established polyhedral framework, such static restructuring has been a great challenge. However, we achieve this limited restructuring through a syntactic processing without any sophisticated machinery
LBM and SPH Scalability Using Task-based Programming
Computational Fluid Dynamics encompasses a great variety of numerical approaches that approximate solutions to the
Navier-Stokes equations, which generally describe the movements of viscous
uid substances. While the objectives of these
approaches are to capture related physical phenomena, the details of di erent methods lend them to particular classes of
problems, and scalable solutions are important to a large range of scienti c and engineering applications. In this paper,
we investigate the practical scalability of two proxy applications that are made to recreate the essential performance
characteristics of Lattice-Boltzmann Methods (LBM) and Smoothed Particle Hydrodyamics (SPH), using the former to
simulate the formation of vortices resulting from sustained, laminar ow, and the latter to simulate violent free surface ows without a mesh. The di ering scalability properties of these methods suggest di erent designs and programming
methods in order to exploit extreme scale computing platforms. In particular, we investigate implementations that enable
the use of task-based programming constructs, which have received attention in recent years as a means of enabling
improved parallel scalability by relaxing the synchronization requirements of classical, bulk-synchronous execution that
both LBM and SPH simulations exemplify. We nd that suitable adaptations of the central data structures suggest that
scalable LBM performance can be improved by tasking constructs in situations that are determined by an appropriate
match between the input problem and the platform's performance characteristics. This suggests an adaptive scheme to
identify and select the highest performing implementation at program initialization. The SPH implementation admits a
substantial performance gain by partitioning the physical domain into a greater number of independent tasks than the
number of participating processors, but its performance remains dependent on a powerful node architecture to support
conventional SMP workloads, suggesting that further algorithmic improvements beyond the bene ts of task programming
are required to make it a strong candidate for exascale computing
Doctor of Philosophy
dissertationInteractive editing and manipulation of digital media is a fundamental component in digital content creation. One media in particular, digital imagery, has seen a recent increase in popularity of its large or even massive image formats. Unfortunately, current systems and techniques are rarely concerned with scalability or usability with these large images. Moreover, processing massive (or even large) imagery is assumed to be an off-line, automatic process, although many problems associated with these datasets require human intervention for high quality results. This dissertation details how to design interactive image techniques that scale. In particular, massive imagery is typically constructed as a seamless mosaic of many smaller images. The focus of this work is the creation of new technologies to enable user interaction in the formation of these large mosaics. While an interactive system for all stages of the mosaic creation pipeline is a long-term research goal, this dissertation concentrates on the last phase of the mosaic creation pipeline - the composition of registered images into a seamless composite. The work detailed in this dissertation provides the technologies to fully realize interactive editing in mosaic composition on image collections ranging from the very small to massive in scale
- …