2,667 research outputs found

    Scalable data abstractions for distributed parallel computations

    Get PDF
    The ability to express a program as a hierarchical composition of parts is an essential tool in managing the complexity of software and a key abstraction this provides is to separate the representation of data from the computation. Many current parallel programming models use a shared memory model to provide data abstraction but this doesn't scale well with large numbers of cores due to non-determinism and access latency. This paper proposes a simple programming model that allows scalable parallel programs to be expressed with distributed representations of data and it provides the programmer with the flexibility to employ shared or distributed styles of data-parallelism where applicable. It is capable of an efficient implementation, and with the provision of a small set of primitive capabilities in the hardware, it can be compiled to operate directly on the hardware, in the same way stack-based allocation operates for subroutines in sequential machines

    High-performance and hardware-aware computing: proceedings of the second International Workshop on New Frontiers in High-performance and Hardware-aware Computing (HipHaC\u2711), San Antonio, Texas, USA, February 2011 ; (in conjunction with HPCA-17)

    Get PDF
    High-performance system architectures are increasingly exploiting heterogeneity. The HipHaC workshop aims at combining new aspects of parallel, heterogeneous, and reconfigurable microprocessor technologies with concepts of high-performance computing and, particularly, numerical solution methods. Compute- and memory-intensive applications can only benefit from the full hardware potential if all features on all levels are taken into account in a holistic approach

    Improving the scalability of parallel N-body applications with an event driven constraint based execution model

    Full text link
    The scalability and efficiency of graph applications are significantly constrained by conventional systems and their supporting programming models. Technology trends like multicore, manycore, and heterogeneous system architectures are introducing further challenges and possibilities for emerging application domains such as graph applications. This paper explores the space of effective parallel execution of ephemeral graphs that are dynamically generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The workloads are expressed using the semantics of an Exascale computing execution model called ParalleX. For comparison, results using conventional execution model semantics are also presented. We find improved load balancing during runtime and automatic parallelism discovery improving efficiency using the advanced semantics for Exascale computing.Comment: 11 figure

    A hybrid MPI-OpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence

    Full text link
    A hybrid scheme that utilizes MPI for distributed memory parallelism and OpenMP for shared memory parallelism is presented. The work is motivated by the desire to achieve exceptionally high Reynolds numbers in pseudospectral computations of fluid turbulence on emerging petascale, high core-count, massively parallel processing systems. The hybrid implementation derives from and augments a well-tested scalable MPI-parallelized pseudospectral code. The hybrid paradigm leads to a new picture for the domain decomposition of the pseudospectral grids, which is helpful in understanding, among other things, the 3D transpose of the global data that is necessary for the parallel fast Fourier transforms that are the central component of the numerical discretizations. Details of the hybrid implementation are provided, and performance tests illustrate the utility of the method. It is shown that the hybrid scheme achieves near ideal scalability up to ~20000 compute cores with a maximum mean efficiency of 83%. Data are presented that demonstrate how to choose the optimal number of MPI processes and OpenMP threads in order to optimize code performance on two different platforms.Comment: Submitted to Parallel Computin

    Resource-aware scheduling for 2D/3D multi-/many-core processor-memory systems

    Get PDF
    This dissertation addresses the complexities of 2D/3D multi-/many-core processor-memory systems, focusing on two key areas: enhancing timing predictability in real-time multi-core processors and optimizing performance within thermal constraints. The integration of an increasing number of transistors into compact chip designs, while boosting computational capacity, presents challenges in resource contention and thermal management. The first part of the thesis improves timing predictability. We enhance shared cache interference analysis for set-associative caches, advancing the calculation of Worst-Case Execution Time (WCET). This development enables accurate assessment of cache interference and the effectiveness of partitioned schedulers in real-world scenarios. We introduce TCPS, a novel task and cache-aware partitioned scheduler that optimizes cache partitioning based on task-specific WCET sensitivity, leading to improved schedulability and predictability. Our research explores various cache and scheduling configurations, providing insights into their performance trade-offs. The second part focuses on thermal management in 2D/3D many-core systems. Recognizing the limitations of Dynamic Voltage and Frequency Scaling (DVFS) in S-NUCA many-core processors, we propose synchronous thread migrations as a thermal management strategy. This approach culminates in the HotPotato scheduler, which balances performance and thermal safety. We also introduce 3D-TTP, a transient temperature-aware power budgeting strategy for 3D-stacked systems, reducing the need for Dynamic Thermal Management (DTM) activation. Finally, we present 3QUTM, a novel method for 3D-stacked systems that combines core DVFS and memory bank Low Power Modes with a learning algorithm, optimizing response times within thermal limits. This research contributes significantly to enhancing performance and thermal management in advanced processor-memory systems

    A Parallel Mesh-Adaptive Framework for Hyperbolic Conservation Laws

    Full text link
    We report on the development of a computational framework for the parallel, mesh-adaptive solution of systems of hyperbolic conservation laws like the time-dependent Euler equations in compressible gas dynamics or Magneto-Hydrodynamics (MHD) and similar models in plasma physics. Local mesh refinement is realized by the recursive bisection of grid blocks along each spatial dimension, implemented numerical schemes include standard finite-differences as well as shock-capturing central schemes, both in connection with Runge-Kutta type integrators. Parallel execution is achieved through a configurable hybrid of POSIX-multi-threading and MPI-distribution with dynamic load balancing. One- two- and three-dimensional test computations for the Euler equations have been carried out and show good parallel scaling behavior. The Racoon framework is currently used to study the formation of singularities in plasmas and fluids.Comment: late submissio

    Data Provenance and Management in Radio Astronomy: A Stream Computing Approach

    Get PDF
    New approaches for data provenance and data management (DPDM) are required for mega science projects like the Square Kilometer Array, characterized by extremely large data volume and intense data rates, therefore demanding innovative and highly efficient computational paradigms. In this context, we explore a stream-computing approach with the emphasis on the use of accelerators. In particular, we make use of a new generation of high performance stream-based parallelization middleware known as InfoSphere Streams. Its viability for managing and ensuring interoperability and integrity of signal processing data pipelines is demonstrated in radio astronomy. IBM InfoSphere Streams embraces the stream-computing paradigm. It is a shift from conventional data mining techniques (involving analysis of existing data from databases) towards real-time analytic processing. We discuss using InfoSphere Streams for effective DPDM in radio astronomy and propose a way in which InfoSphere Streams can be utilized for large antennae arrays. We present a case-study: the InfoSphere Streams implementation of an autocorrelating spectrometer, and using this example we discuss the advantages of the stream-computing approach and the utilization of hardware accelerators

    Building real-time embedded applications on QduinoMC: a web-connected 3D printer case study

    Full text link
    Single Board Computers (SBCs) are now emerging with multiple cores, ADCs, GPIOs, PWM channels, integrated graphics, and several serial bus interfaces. The low power consumption, small form factor and I/O interface capabilities of SBCs with sensors and actuators makes them ideal in embedded and real-time applications. However, most SBCs run non-realtime operating systems based on Linux and Windows, and do not provide a user-friendly API for application development. This paper presents QduinoMC, a multicore extension to the popular Arduino programming environment, which runs on the Quest real-time operating system. QduinoMC is an extension of our earlier single-core, real-time, multithreaded Qduino API. We show the utility of QduinoMC by applying it to a specific application: a web-connected 3D printer. This differs from existing 3D printers, which run relatively simple firmware and lack operating system support to spool multiple jobs, or interoperate with other devices (e.g., in a print farm). We show how QduinoMC empowers devices with the capabilities to run new services without impacting their timing guarantees. While it is possible to modify existing operating systems to provide suitable timing guarantees, the effort to do so is cumbersome and does not provide the ease of programming afforded by QduinoMC.http://www.cs.bu.edu/fac/richwest/papers/rtas_2017.pdfAccepted manuscrip
    corecore