111 research outputs found

    An approach to locality-conscious load balancing and transparent memory hierarchy management with a global-address-space parallel programming model

    Full text link

    UPC++: A high-performance communication framework for asynchronous computation

    Get PDF
    UPC++ is a C++ library that supports high-performance computation via an asynchronous communication framework. This paper describes a new incarnation that differs substantially from its predecessor, and we discuss the reasons for our design decisions. We present new design features, including future-based asynchrony management, distributed objects, and generalized Remote Procedure Call (RPC). We show microbenchmark performance results demonstrating that one-sided Remote Memory Access (RMA) in UPC++ is competitive with MPI-3 RMA; on a Cray XC40 UPC++ delivers up to a 25% improvement in the latency of blocking RMA put, and up to a 33% bandwidth improvement in an RMA throughput test. We showcase the benefits of UPC++ with irregular applications through a pair of application motifs, a distributed hash table and a sparse solver component. Our distributed hash table in UPC++ delivers near-linear weak scaling up to 34816 cores of a Cray XC40. Our UPC++ implementation of the sparse solver component shows robust strong scaling up to 2048 cores, where it outperforms variants communicating using MPI by up to 3.1x. UPC++ encourages the use of aggressive asynchrony in low-overhead RMA and RPC, improving programmer productivity and delivering high performance in irregular applications

    UniTi: Unified composition and time for multi-domain model-based design

    Get PDF
    To apply model-based design to embedded systems that interface with the physical world, including simulation and verification, current tools fall short. They must provide mathematical (model) definitions that stay close to the specification of the system. They must allow multiple domains, such as the continuous-time, discrete-time and dataflow domain, in a single model including well-defined interaction. They must support model transformations for refining a model during development. And most importantly, they must accurately include and simulate different notions of time in the model. UniTi is a model-based design flow and modelling and simulation environment that delivers on all these aspects. It is based on components that are signal transformations, and therefore mathematical functions. However, in each domain the representation of a signal differs. As components have the same structure in each domain, we can use unified composition operators to represent multiple domains in a single model. Furthermore, this composition provides a unified perspective on time in the domains, even though we differentiate between different notions of time. Time becomes a local property of the model, allowing us to represent and simulate time transformations such as time delays exactly without losing efficiency. Finally, model transformations are defined for such components, which are used for refining and developing the model and which are guided by the design steps in the design flow. We will formally define the domains, composition operators and transformations of UniTi and verify the approach with a case study on a phased array beamforming system

    Scaling of Distributed Multi-Simulations on Multi-Core Clusters

    No full text
    International audienceDACCOSIM is a multi-simulation environment for continuous time systems, relying on FMI standard, making easy the design of a multi-simulation graph, and specially developed for multi-core PC clusters, in order to achieve speedup and size up. However, the distribution of the simulation graph remains complex and is still the responsibility of the simulation developer. This paper introduces DACCOSIM parallel and distributed architecture, and our strategies to achieve efficient multi-simulation graph distribution on multi-core clusters. Some performance experiments on two clusters, running up to 81 simulation components (FMU) and using up to 16 multi-core computing nodes, are shown. Performances measured on our faster cluster exhibit a good scalability, but some limitations of current DACCOSIM implementation are discussed

    Parameterized and multi-level tiled loop generation

    Get PDF
    Department Head: L. Darrell Whitley.2010 Summer.Includes bibliographical references.Tiling is a loop transformation that decomposes computations into a set of smaller computation blocks. The transformation has been proven to be useful for many high-level program optimizations, such as data locality optimization and exploiting coarse-grained parallelism, and crucial for architecture with limited resources, such as embedded systems, GPUs, and the Cell architecture. Data locality and parallelism will continue to serve as major vehicles for achieving high performance on modern architecture in multi-core era. In parameterized tiling the size of blocks is not fixed at compile time but remains a symbolic constant so that it can be selected/changed even at runtime. Parameterized tiled loops facilitate iterative and runtime optimizations, such as iterative compilation, auto-tuning and dynamic program adaption. In this dissertation we present a collection of techniques for generating parameterized and multi-level tiled loops from affine control loops and their parallelization. The tiled loop generation problem even for perfectly nested loops has been believed to have an exponential time complexity due to the heavy machinery like Fourier-Motzkin elimination. Disproving this decade-long belief, we provide a simple technique for generating tiled loop nests even from imperfectly nested loops. Our technique for perfectly nested loops consists of only syntactic processing that is applied only once and independently to each loop bound. Our approach to imperfectly nested loops is composed of a direct extension of the tiled code generation technique for perfectly nested loops and three simple optimizations on the resulting parameterized tiled loops. The generation as well as the optimizations are achieved only with purely syntactic processing, hence loop generation time remains negligible. We also present three schemes for multi-level tiling where tiling is applied more than once. All the schemes are scalable with respect to the number of tiling levels and can be combined to achieve better performance. To facilitate parallelization of parameterized tiled loops, we generate outermost tile-loops that are perfectly nested. We also provide a technique for statically restructuring parameterized tiled loops to the wavefront scheduling on shared memory system. Because the formulation of parameterized tiling does not fit into the well established polyhedral framework, such static restructuring has been a great challenge. However, we achieve this limited restructuring through a syntactic processing without any sophisticated machinery

    LBM and SPH Scalability Using Task-based Programming

    Get PDF
    Computational Fluid Dynamics encompasses a great variety of numerical approaches that approximate solutions to the Navier-Stokes equations, which generally describe the movements of viscous uid substances. While the objectives of these approaches are to capture related physical phenomena, the details of di erent methods lend them to particular classes of problems, and scalable solutions are important to a large range of scienti c and engineering applications. In this paper, we investigate the practical scalability of two proxy applications that are made to recreate the essential performance characteristics of Lattice-Boltzmann Methods (LBM) and Smoothed Particle Hydrodyamics (SPH), using the former to simulate the formation of vortices resulting from sustained, laminar ow, and the latter to simulate violent free surface ows without a mesh. The di ering scalability properties of these methods suggest di erent designs and programming methods in order to exploit extreme scale computing platforms. In particular, we investigate implementations that enable the use of task-based programming constructs, which have received attention in recent years as a means of enabling improved parallel scalability by relaxing the synchronization requirements of classical, bulk-synchronous execution that both LBM and SPH simulations exemplify. We nd that suitable adaptations of the central data structures suggest that scalable LBM performance can be improved by tasking constructs in situations that are determined by an appropriate match between the input problem and the platform's performance characteristics. This suggests an adaptive scheme to identify and select the highest performing implementation at program initialization. The SPH implementation admits a substantial performance gain by partitioning the physical domain into a greater number of independent tasks than the number of participating processors, but its performance remains dependent on a powerful node architecture to support conventional SMP workloads, suggesting that further algorithmic improvements beyond the bene ts of task programming are required to make it a strong candidate for exascale computing

    Doctor of Philosophy

    Get PDF
    dissertationInteractive editing and manipulation of digital media is a fundamental component in digital content creation. One media in particular, digital imagery, has seen a recent increase in popularity of its large or even massive image formats. Unfortunately, current systems and techniques are rarely concerned with scalability or usability with these large images. Moreover, processing massive (or even large) imagery is assumed to be an off-line, automatic process, although many problems associated with these datasets require human intervention for high quality results. This dissertation details how to design interactive image techniques that scale. In particular, massive imagery is typically constructed as a seamless mosaic of many smaller images. The focus of this work is the creation of new technologies to enable user interaction in the formation of these large mosaics. While an interactive system for all stages of the mosaic creation pipeline is a long-term research goal, this dissertation concentrates on the last phase of the mosaic creation pipeline - the composition of registered images into a seamless composite. The work detailed in this dissertation provides the technologies to fully realize interactive editing in mosaic composition on image collections ranging from the very small to massive in scale
    corecore