45 research outputs found
Multicore Performance Prediction with MPET : Using Scalability Characteristics for Statistical Cross-Architecture Prediction
Multicore processors serve as target platforms in a broad variety of applications ranging from high-performance computing to embedded mobile computing and automotive applications. But, the required parallel programming opens up a huge design space of parallelization strategies each with potential bottlenecks. Therefore, an early estimation of an application’s performance is a desirable development tool. However, out-of-order execution, superscalar instruction pipelines, as well as communication costs and (shared-) cache effects essentially influence the performance of parallel programs. While offering low modeling effort and good simulation speed, current approximate analytic models provide moderate prediction results so far. Virtual prototyping requires a time-consuming simulation, but produces better accuracy. Furthermore, even existing statistical methods often require detailed knowledge of the hardware for characterization. In this work, we present a concept called Multicore Performance Evaluation Tool (MPET) and its evaluation for a statistical approach for performance prediction based on abstract runtime parameters, which describe an application’s scalability behavior and can be extracted from profiles without user input. These scalability parameters not only include information on the interference of software demands and hardware capabilities, but indicate bottlenecks as well. Depending on the database setup, we achieve a competitive accuracy of 20% mean prediction error (11% median), which we also demonstrate in a case study
Simulation Environment with Customized RISC-V Instructions for Logic-in-Memory Architectures
Nowadays, various memory-hungry applications like machine learning algorithms
are knocking "the memory wall". Toward this, emerging memories featuring
computational capacity are foreseen as a promising solution that performs data
process inside the memory itself, so-called computation-in-memory, while
eliminating the need for costly data movement. Recent research shows that
utilizing the custom extension of RISC-V instruction set architecture to
support computation-in-memory operations is effective. To evaluate the
applicability of such methods further, this work enhances the standard GNU
binary utilities to generate RISC-V executables with Logic-in-Memory (LiM)
operations and develop a new gem5 simulation environment, which simulates the
entire system (CPU, peripherals, etc.) in a cycle-accurate manner together with
a user-defined LiM module integrated into the system. This work provides a
modular testbed for the research community to evaluate potential LiM solutions
and co-designs between hardware and software
Recommended from our members
Scalable Emulation of Heterogeneous Systems
The breakdown of Dennard's transistor scaling has driven computing systems toward application-specific accelerators, which can provide orders-of-magnitude improvements in performance and energy efficiency over general-purpose processors.
To enable the radical departures from conventional approaches that heterogeneous systems entail, research infrastructure must be able to model processors, memory and accelerators, as well as system-level changes---such as operating system or instruction set architecture (ISA) innovations---that might be needed to realize the accelerators' potential. Unfortunately, existing simulation tools that can support such system-level research are limited by the lack of fast, scalable machine emulators to drive execution.
To fill this need, in this dissertation we first present a novel machine emulator design based on dynamic binary translation that makes the following improvements over the state of the art: it scales on multicore hosts while remaining memory efficient, correctly handles cross-ISA differences in atomic instruction semantics, leverages the host floating point (FP) unit to speed up FP emulation without sacrificing correctness, and can be efficiently instrumented to---among other possible uses---drive the execution of a full-system, cross-ISA simulator with support for accelerators.
We then demonstrate the utility of machine emulation for studying heterogeneous systems by leveraging it to make two additional contributions. First, we quantify the trade-offs in different coupling models for on-chip accelerators. Second, we present a technique to reuse the private memories of on-chip accelerators when they are otherwise inactive to expand the system's last-level cache, thereby reducing the opportunity cost of the accelerators' integration