25,031 research outputs found

    Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from Concrete Concurrency Models

    Get PDF
    The upcoming many-core architectures require software developers to exploit concurrency to utilize available computational power. Today's high-level language virtual machines (VMs), which are a cornerstone of software development, do not provide sufficient abstraction for concurrency concepts. We analyze concrete and abstract concurrency models and identify the challenges they impose for VMs. To provide sufficient concurrency support in VMs, we propose to integrate concurrency operations into VM instruction sets. Since there will always be VMs optimized for special purposes, our goal is to develop a methodology to design instruction sets with concurrency support. Therefore, we also propose a list of trade-offs that have to be investigated to advise the design of such instruction sets. As a first experiment, we implemented one instruction set extension for shared memory and one for non-shared memory concurrency. From our experimental results, we derived a list of requirements for a full-grown experimental environment for further research

    RPPM : Rapid Performance Prediction of Multithreaded workloads on multicore processors

    Get PDF
    Analytical performance modeling is a useful complement to detailed cycle-level simulation to quickly explore the design space in an early design stage. Mechanistic analytical modeling is particularly interesting as it provides deep insight and does not require expensive offline profiling as empirical modeling. Previous work in mechanistic analytical modeling, unfortunately, is limited to single-threaded applications running on single-core processors. This work proposes RPPM, a mechanistic analytical performance model for multi-threaded applications on multicore hardware. RPPM collects microarchitecture-independent characteristics of a multi-threaded workload to predict performance on a previously unseen multicore architecture. The profile needs to be collected only once to predict a range of processor architectures. We evaluate RPPM's accuracy against simulation and report a performance prediction error of 11.2% on average (23% max). We demonstrate RPPM's usefulness for conducting design space exploration experiments as well as for analyzing parallel application performance

    Structured Parallel Programming and Cache Coherence in Multicore Architectures

    Get PDF
    It is clear that multicore processors have become the building blocks of today’s high-performance computing platforms. The advent of massively parallel single-chip microprocessors further emphasizes the gap that exists between parallel architectures and parallel programming maturity. Our research group, starting from the experiences on distributed and shared memory multiprocessor, was one of the first to propose a Structured Parallel Programming approach to bridge this gap. In this scenario, one of the biggest problems is that an application’s performance is often affected by the sharing pattern of data and its impact on Cache Coherence. Currently multicore platforms rely on hardware or automatic cache coherence techniques that allow programmers to develop programs without taking into account the problem. It is well known that standard coherency protocols are inefficient for certain data communication patterns and these inefficiencies will be amplified by the increased core number and the complex memory hierarchies. Following a structured parallelism approach, our methodology to attack these problems is based on two interrelated issues: structured parallelism paradigms and cost models (or performance models). Evaluating the performance of a program, although widely studied, is still an open problem in the research community and, notably, specific cost models to de- scribe multicores are missing. For this reason in this thesis, we define an abstract model for cache coherent architectures, which is able to capture the essential elements and the qualitative behaviors of multicore-based systems. Furthermore, we show how this abstract model combined with well known performance modelling techniques, such as analytical modelling (e.g., queueing models and stochastic process algebras) or simulations, provide an application- and architecture-dependent cost model to predict structured parallel applications performances. Starting out from the behavior and performance predictability of structured parallelism schemes, in this thesis we address the issue of cache coherence in multicore architectures, following an algorithm-dependent approach, a particular kind of software cache coherence solution characterized by explicit cache management strategies, which are specific of the algorithm to be executed. Notably, we ensure parallel correctness by exploiting architecture-specific mechanisms and by defining proper data structures in order to “emulate” cache coherence solutions in an efficient way for each computation. Algorithm-dependent cache coherence can be efficiently implemented at the support level of structured parallelism paradigms, with absolute transparency with respect to the application programmer. Moreover, by using the cost model, in this thesis we study and compare different algorithm-dependent implementations, such as those based on automatic cache coherence with respect to an original, non-automatic and lock-free solution based on interprocessor communications. Notably, with this latter implementation, in some cases, we are able to reduce the number of memory accesses, cache transfers and synchronizations and increasing computation parallelism with respect to the use of automatic cache coherence. Current architectures do not usually allow disabling automatic cache coherence. However, the emergence of many-core architectures somewhat changed the scenario, so that some architectures, such as the Tilera TilePro64, allow to control and disable the automatic cache coherence facilities. For this reason, in this thesis we finally apply our methodology to TilePro64 platform in order provide a further validation of the results obtained by our cost model

    TaskPoint: sampled simulation of task-based programs

    Get PDF
    Sampled simulation is a mature technique for reducing simulation time of single-threaded programs, but it is not directly applicable to simulation of multi-threaded architectures. Recent multi-threaded sampling techniques assume that the workload assigned to each thread does not change across multiple executions of a program. This assumption does not hold for dynamically scheduled task-based programming models. Task-based programming models allow the programmer to specify program segments as tasks which are instantiated many times and scheduled dynamically to available threads. Due to system noise and variation in scheduling decisions, two consecutive executions on the same machine typically result in different instruction streams processed by each thread. In this paper, we propose TaskPoint, a sampled simulation technique for dynamically scheduled task-based programs. We leverage task instances as sampling units and simulate only a fraction of all task instances in detail. Between detailed simulation intervals we employ a novel fast-forward mechanism for dynamically scheduled programs. We evaluate the proposed technique on a set of 19 task-based parallel benchmarks and two different architectures. Compared to detailed simulation, TaskPoint accelerates architectural simulation with 64 simulated threads by an average factor of 19.1 at an average error of 1.8% and a maximum error of 15.0%.This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493, SEV-2011-00067), the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), the RoMoL ERC Advanced Grant (GA 321253), the European HiPEAC Network of Excellence and the Mont-Blanc project (EU-FP7-610402 and EU-H2020-671697). M. Moreto has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship JCI-2012-15047. M. Casas is supported by the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the EUFP7 (contract 2013BP B 00243). T.Grass has been partially supported by the AGAUR of the Generalitat de Catalunya (grant 2013FI B 0058).Peer ReviewedPostprint (author's final draft

    DPP-PMRF: Rethinking Optimization for a Probabilistic Graphical Model Using Data-Parallel Primitives

    Full text link
    We present a new parallel algorithm for probabilistic graphical model optimization. The algorithm relies on data-parallel primitives (DPPs), which provide portable performance over hardware architecture. We evaluate results on CPUs and GPUs for an image segmentation problem. Compared to a serial baseline, we observe runtime speedups of up to 13X (CPU) and 44X (GPU). We also compare our performance to a reference, OpenMP-based algorithm, and find speedups of up to 7X (CPU).Comment: LDAV 2018, October 201
    • …
    corecore