137 research outputs found

    A domain-specific high-level programming model

    No full text
    International audienceNowadays, computing hardware continues to move toward more parallelism and more heterogeneity, to obtain more computing power. From personal computers to supercomputers, we can find several levels of parallelism expressed by the interconnections of multi-core and many-core accelerators. On the other hand, computing software needs to adapt to this trend, and programmers can use parallel programming models (PPM) to fulfil this difficult task. There are different PPMs available that are based on tasks, directives, or low level languages or library. These offer higher or lower abstraction levels from the architecture by handling their own syntax. However, to offer an efficient PPM with a greater (additional) high-levelabstraction level while saving on performance, one idea is to restrict this to a specific domain and to adapt it to a family of applications. In the present study, we propose a high-level PPM specific to digital signal processing applications. It is based on data-flow graph models of computation, and a dynamic runtime model of execution (StarPU). We show how the user can easily express this digital signal processing application, and can take advantage of task, data and graph parallelism in the implementation, to enhance the performances of targeted heterogeneous clusters composed of CPUs and different accelerators (e.g., GPU, Xeon Phi

    APPLICATION OF SECRETARY ALGORITHM TO DYNAMIC LOAD BALANCING IN USER-SPACE ON MULTICORE SYSTEMS

    Get PDF
    In recent years, multicore processors have been so prevalent in many types ofsystems and are now widely used even in commodities for a wide range of applications.Although multicore processors are clearly a popular hardware solution to problems thatwere not possible with traditional single-core processors, taking advantage of them areinevitably met by software challenges. As Amdahl’s law puts it, the performance gain islimited by the percentage of the software that cannot be run in parallel on multiple cores.Even when an application is “embarrassingly” parallelized by a careful design ofalgorithm and implementation, load balancing of tasks across different cores is a veryimportant and critical aspect in utilizing a multicore system as close to its fullest potentialas possible.In this paper, we investigate how a solution to a cardinal payoff variant of thesecretary problem can be applied to a proactive, decentralized, dynamic load balancingtechnique in user-space to assist single program, multiple data (SPMD) applications inmultiprogrammed environment so that all tasks can make roughly equal progressdistributed over all cores. We examine how this method compares with the default Linuxload balancer in terms of scalability and predictability. Our experiments show promisingresults that show our technique outperforms the default Linux scheduler by an average 40%speedup in multiprogrammed environment with less time variance among multipleexecutions

    Portable performance on heterogeneous architectures

    Get PDF
    Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now their graphics coprocessors (GPUs), not just their primary CPUs. But GPU programming and memory models differ dramatically from conventional CPUs, and the relative performance characteristics of the different processors vary widely between machines. Different processors within a system often perform best with different algorithms and memory usage patterns, and achieving the best overall performance may require mapping portions of programs across all types of resources in the machine. To address the problem of efficiently programming machines with increasingly heterogeneous computational resources, we propose a programming model in which the best mapping of programs to processors and memories is determined empirically. Programs define choices in how their individual algorithms may work, and the compiler generates further choices in how they can map to CPU and GPU processors and memory systems. These choices are given to an empirical autotuning framework that allows the space of possible implementations to be searched at installation time. The rich choice space allows the autotuner to construct poly-algorithms that combine many different algorithmic techniques, using both the CPU and the GPU, to obtain better performance than any one technique alone. Experimental results show that algorithmic changes, and the varied use of both CPUs and GPUs, are necessary to obtain up to a 16.5x speedup over using a single program configuration for all architectures.United States. Dept. of Energy (Award DE-SC0005288)United States. Defense Advanced Research Projects Agency (Award HR0011-10-9-0009)National Science Foundation (U.S.) (Award CCF-0632997

    A Model-based Design Framework for Application-specific Heterogeneous Systems

    Get PDF
    The increasing heterogeneity of computing systems enables higher performance and power efficiency. However, these improvements come at the cost of increasing the overall complexity of designing such systems. These complexities include constructing implementations for various types of processors, setting up and configuring communication protocols, and efficiently scheduling the computational work. The process for developing such systems is iterative and time consuming, with no well-defined performance goal. Current performance estimation approaches use source code implementations that require experienced developers and time to produce. We present a framework to aid in the design of heterogeneous systems and the performance tuning of applications. Our framework supports system construction: integrating custom hardware accelerators with existing cores into processors, integrating processors into cohesive systems, and mapping computations to processors to achieve overall application performance and efficient hardware usage. It also facilitates effective design space exploration using processor models (for both existing and future processors) that do not require source code implementations to estimate performance. We evaluate our framework using a variety of applications and implement them in systems ranging from low power embedded systems-on-chip (SoC) to high performance systems consisting of commercial-off-the-shelf (COTS) components. We show how the design process is improved, reducing the number of design iterations and unnecessary source code development ultimately leading to higher performing efficient systems

    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)

    Get PDF
    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.The PhD Symposium was a very good opportunity for the young researchers to share information and knowledge, to present their current research, and to discuss topics with other students in order to look for synergies and common research topics. The idea was very successful and the assessment made by the PhD Student was very good. It also helped to achieve one of the major goals of the NESUS Action: to establish an open European research network targeting sustainable solutions for ultrascale computing aiming at cross fertilization among HPC, large scale distributed systems, and big data management, training, contributing to glue disparate researchers working across different areas and provide a meeting ground for researchers in these separate areas to exchange ideas, to identify synergies, and to pursue common activities in research topics such as sustainable software solutions (applications and system software stack), data management, energy efficiency, and resilience.European Cooperation in Science and Technology. COS

    Portable OpenCL Out-of-Order Execution Framework for Heterogeneous Platforms

    Get PDF
    Heterogeneous computing has become a viable option in seeking computing performance, to the side of conventional homogeneous multi-/single-processor approaches. The advantage of heterogeneity is the possibility to choose the best device on the platform for different distinct workloads in the application to gain performance and/or to lower power consumption. The drawback of heterogeneity is the increased complexity of applications all the way from the programming models to different instruction sets and architectures of the devices. OpenCL is the first standard for programming heterogeneous platforms. OpenCL offers a uniform Application Program Interface (API) and device platform abstraction that allows all different types of devices to be programmed in the same platform portable way. OpenCL has been widely adopted by major software and chip manufacturers and is increasing in its popularity. OpenCL requires an implementation for the standard in order to be used. One such implementation is the POrtable Computing Language (pocl) open source project launched in Tampere University of Technology. The aim for pocl is in easy portability on different devices. One goal of pocl is improved performance portability using a kernel compiler that is able to adopt to different parallel hardware resources on the devices. This thesis describes an out-of-order execution framework for pocl. This work offers a flexible and simple API for efficient offloading of computation to the devices, and for synchronising computation between the main application and other devices on the platform. The focus in this thesis is set on the task level operation, with fast task launching and efficient exploitation of available task level parallelism. An interest has emerged for using OpenCL as a middleware for other parallel programming models. The programming models might be highly task parallel and the size of the tasks might be much smaller than in nominal OpenCL use cases that tend to focus on data parallelism. In the runtime implementation of the proposed framework the focus was in minimising overheads in task scheduling in order to improve scalability for said programming models

    Mapping parallelism to heterogeneous processors

    Get PDF
    Most embedded devices are based on heterogeneous Multiprocessor System on Chips (MPSoCs). These contain a variety of processors like CPUs, micro-controllers, DSPs, GPUs and specialised accelerators. The heterogeneity of these systems helps in achieving good performance and energy efficiency but makes programming inherently difficult. There is no single programming language or runtime to program such platforms. This thesis makes three contributions to these problems. First, it presents a framework that allows code in Single Program Multiple Data (SPMD) form to be mapped to a heterogeneous platform. The mapping space is explored, and it is shown that the best mapping depends on the metric used. Next, a compiler framework is presented which bridges the gap between the high -level programming model of OpenMP and the heterogeneous resources of MPSoCs. It takes OpenMP programs and generates code which runs on all processors. It delivers programming ease while exploiting heterogeneous resources. Finally, a compiler-based approach to runtime power management for heterogeneous cores is presented. Given an externally provided budget, the approach generates heterogeneous, partitioned code that attempts to give the best performance within that budget
    corecore