9,343 research outputs found

    A Domain Specific Approach to High Performance Heterogeneous Computing

    Full text link
    Users of heterogeneous computing systems face two problems: firstly, in understanding the trade-off relationships between the observable characteristics of their applications, such as latency and quality of the result, and secondly, how to exploit knowledge of these characteristics to allocate work to distributed computing platforms efficiently. A domain specific approach addresses both of these problems. By considering a subset of operations or functions, models of the observable characteristics or domain metrics may be formulated in advance, and populated at run-time for task instances. These metric models can then be used to express the allocation of work as a constrained integer program, which can be solved using heuristics, machine learning or Mixed Integer Linear Programming (MILP) frameworks. These claims are illustrated using the example domain of derivatives pricing in computational finance, with the domain metrics of workload latency or makespan and pricing accuracy. For a large, varied workload of 128 Black-Scholes and Heston model-based option pricing tasks, running upon a diverse array of 16 Multicore CPUs, GPUs and FPGAs platforms, predictions made by models of both the makespan and accuracy are generally within 10% of the run-time performance. When these models are used as inputs to machine learning and MILP-based workload allocation approaches, a latency improvement of up to 24 and 270 times over the heuristic approach is seen.Comment: 14 pages, preprint draft, minor revisio

    A Classification and Survey of Computer System Performance Evaluation Techniques

    Get PDF
    Classification and survey of computer system performance evaluation technique

    Blocked All-Pairs Shortest Paths Algorithm on Intel Xeon Phi KNL Processor: A Case Study

    Full text link
    Manycores are consolidating in HPC community as a way of improving performance while keeping power efficiency. Knights Landing is the recently released second generation of Intel Xeon Phi architecture. While optimizing applications on CPUs, GPUs and first Xeon Phi's has been largely studied in the last years, the new features in Knights Landing processors require the revision of programming and optimization techniques for these devices. In this work, we selected the Floyd-Warshall algorithm as a representative case study of graph and memory-bound applications. Starting from the default serial version, we show how data, thread and compiler level optimizations help the parallel implementation to reach 338 GFLOPS.Comment: Computer Science - CACIC 2017. Springer Communications in Computer and Information Science, vol 79

    Validation of a fault-tolerant multiprocessor: Baseline experiments and workload implementation

    Get PDF
    In the future, aircraft must employ highly reliable multiprocessors in order to achieve flight safety. Such computers must be experimentally validated before they are deployed. This project outlines a methodology for validating reliable multiprocessors. The methodology begins with baseline experiments, which tests a single phenomenon. As experiments progress, tools for performance testing are developed. The methodology is used, in part, on the Fault Tolerant Multiprocessor (FTMP) at NASA-Langley's AIRLAB facility. Experiments are designed to evaluate the fault-free performance of the system. Presented are the results of interrupt baseline experiments performed on FTMP. Interrupt causing exception conditions were tested, and several were found to have unimplemented interrupt handling software while one had an unimplemented interrupt vector. A synthetic workload model for realtime multiprocessors is then developed as an application level performance analysis tool. Details of the workload implementation and calibration are presented. Both the experimental methodology and the synthetic workload model are general enough to be applicable to reliable multiprocessors beside FTMP

    Fault-free validation of a fault-tolerant multiprocessor: Baseline experiments and workoad implementation

    Get PDF
    In the future, aircraft employing active control technology must use highly reliable multiprocessors in order to achieve flight safety. Such computers must be experimentally validated before they are deployed. This project outlines a methodology for doing fault-free validation of reliable multiprocessors. The methodology begins with baseline experiments, which test single phenomenon. As experiments progress, tools for performance testing are developed. This report presents the results of interrupt baseline experiments performed on the Fault-Tolerant Multiprocessor (FTMP) at NASA-Langley's AIRLAB. Interrupt-causing excepting conditions were tested, and several were found to have unimplemented interrupt handling software while one had an unimplemented interrupt vector. A synthetic workload model for realtime multiprocessors is then developed as an application level performance analysis tool. Details of the workload implementation and calibration are presented. Both the experimental methodology and the synthetic workload model are general enough to be applicable to reliable multi-processors besides FTMP

    POSTER: Exploiting asymmetric multi-core processors with flexible system sofware

    Get PDF
    Energy efficiency has become the main challenge for high performance computing (HPC). The use of mobile asymmetric multi-core architectures to build future multi-core systems is an approach towards energy savings while keeping high performance. However, it is not known yet whether such systems are ready to handle parallel applications. This paper fills this gap by evaluating emerging parallel applications on an asymmetric multi-core. We make use of the PARSEC benchmark suite and a processor that implements the ARM big.LITTLE architecture. We conclude that these applications are not mature enough to run on such systems, as they suffer from load imbalance. Furthermore, we explore the behaviour of dynamic scheduling solutions on either the Operating System (OS) or the runtime level. Comparing these approaches shows us that the most efficient scheduling takes place in the runtime level, influencing the future research towards such solutions.This work has been supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project receives funding from the EU's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 610402 and from the EU's H2020 Framework Programme (H2020/2014-2020) under grant agreement number 671697. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243).Peer ReviewedPostprint (author's final draft

    Adaptive Transactional Memories: Performance and Energy Consumption Tradeoffs

    Get PDF
    Energy efficiency is becoming a pressing issue, especially in large data centers where it entails, at the same time, a non-negligible management cost, an enhancement of hardware fault probability, and a significant environmental footprint. In this paper, we study how Software Transactional Memories (STM) can provide benefits on both power saving and the overall applications’ execution performance. This is related to the fact that encapsulating shared-data accesses within transactions gives the freedom to the STM middleware to both ensure consistency and reduce the actual data contention, the latter having been shown to affect the overall power needed to complete the application’s execution. We have selected a set of self-adaptive extensions to existing STM middlewares (namely, TinySTM and R-STM) to prove how self-adapting computation can capture the actual degree of parallelism and/or logical contention on shared data in a better way, enhancing even more the intrinsic benefits provided by STM. Of course, this benefit comes at a cost, which is the actual execution time required by the proposed approaches to precisely tune the execution parameters for reducing power consumption and enhancing execution performance. Nevertheless, the results hereby provided show that adaptivity is a strictly necessary requirement to reduce energy consumption in STM systems: Without it, it is not possible to reach any acceptable level of energy efficiency at all
    • …
    corecore