5,349 research outputs found
RPPM : Rapid Performance Prediction of Multithreaded workloads on multicore processors
Analytical performance modeling is a useful complement to detailed cycle-level simulation to quickly explore the design space in an early design stage. Mechanistic analytical modeling is particularly interesting as it provides deep insight and does not require expensive offline profiling as empirical modeling. Previous work in mechanistic analytical modeling, unfortunately, is limited to single-threaded applications running on single-core processors.
This work proposes RPPM, a mechanistic analytical performance model for multi-threaded applications on multicore hardware. RPPM collects microarchitecture-independent characteristics of a multi-threaded workload to predict performance on a previously unseen multicore architecture. The profile needs to be collected only once to predict a range of processor architectures. We evaluate RPPM's accuracy against simulation and report a performance prediction error of 11.2% on average (23% max). We demonstrate RPPM's usefulness for conducting design space exploration experiments as well as for analyzing parallel application performance
Optimization of Discrete-parameter Multiprocessor Systems using a Novel Ergodic Interpolation Technique
Modern multi-core systems have a large number of design parameters, most of
which are discrete-valued, and this number is likely to keep increasing as chip
complexity rises. Further, the accurate evaluation of a potential design choice
is computationally expensive because it requires detailed cycle-accurate system
simulation. If the discrete parameter space can be embedded into a larger
continuous parameter space, then continuous space techniques can, in principle,
be applied to the system optimization problem. Such continuous space techniques
often scale well with the number of parameters.
We propose a novel technique for embedding the discrete parameter space into
an extended continuous space so that continuous space techniques can be applied
to the embedded problem using cycle accurate simulation for evaluating the
objective function. This embedding is implemented using simulation-based
ergodic interpolation, which, unlike spatial interpolation, produces the
interpolated value within a single simulation run irrespective of the number of
parameters. We have implemented this interpolation scheme in a cycle-based
system simulator. In a characterization study, we observe that the interpolated
performance curves are continuous, piece-wise smooth, and have low statistical
error. We use the ergodic interpolation-based approach to solve a large
multi-core design optimization problem with 31 design parameters. Our results
indicate that continuous space optimization using ergodic interpolation-based
embedding can be a viable approach for large multi-core design optimization
problems.Comment: A short version of this paper will be published in the proceedings of
IEEE MASCOTS 2015 conferenc
Coarse-grained reconfigurable array architectures
Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code
STATISTICAL MACHINE LEARNING BASED MODELING FRAMEWORK FOR DESIGN SPACE EXPLORATION AND RUN-TIME CROSS-STACK ENERGY OPTIMIZATION FOR MANY-CORE PROCESSORS
The complexity of many-core processors continues to grow as a larger number of heterogeneous cores are integrated on a single chip. Such systems-on-chip contains computing structures ranging from complex out-of-order cores, simple in-order cores, digital signal processors (DSPs), graphic processing units (GPUs), application specific processors, hardware accelerators, I/O subsystems, network-on-chip interconnects, and large caches arranged in complex hierarchies. While the industry focus is on putting higher number of cores on a single chip, the key challenge is to optimally architect these many-core processors such that performance, energy and area constraints are satisfied. The traditional approach to processor design through extensive cycle accurate simulations are ill-suited for designing many-core processors due to the large microarchitecture design space that must be explored. Additionally it is hard to optimize such complex processors and the applications that run on them statically at design time such that performance and energy constraints are met under dynamically changing operating conditions.
The dissertation establishes statistical machine learning based modeling framework that enables the efficient design and operation of many-core processors that meets performance, energy and area constraints. We apply the proposed framework to rapidly design the microarchitecture of a many-core processor for multimedia, computer graphics rendering, finance, and data mining applications derived from the Parsec benchmark. We further demonstrate the application of the framework in the joint run-time adaptation of both the application and microarchitecture such that energy availability
constraints are met
- …