5,824 research outputs found
RPPM : Rapid Performance Prediction of Multithreaded workloads on multicore processors
Analytical performance modeling is a useful complement to detailed cycle-level simulation to quickly explore the design space in an early design stage. Mechanistic analytical modeling is particularly interesting as it provides deep insight and does not require expensive offline profiling as empirical modeling. Previous work in mechanistic analytical modeling, unfortunately, is limited to single-threaded applications running on single-core processors.
This work proposes RPPM, a mechanistic analytical performance model for multi-threaded applications on multicore hardware. RPPM collects microarchitecture-independent characteristics of a multi-threaded workload to predict performance on a previously unseen multicore architecture. The profile needs to be collected only once to predict a range of processor architectures. We evaluate RPPM's accuracy against simulation and report a performance prediction error of 11.2% on average (23% max). We demonstrate RPPM's usefulness for conducting design space exploration experiments as well as for analyzing parallel application performance
Pushing the Limits of Machine Design: Automated CPU Design with AI
Design activity -- constructing an artifact description satisfying given
goals and constraints -- distinguishes humanity from other animals and
traditional machines, and endowing machines with design abilities at the human
level or beyond has been a long-term pursuit. Though machines have already
demonstrated their abilities in designing new materials, proteins, and computer
programs with advanced artificial intelligence (AI) techniques, the search
space for designing such objects is relatively small, and thus, "Can machines
design like humans?" remains an open question. To explore the boundary of
machine design, here we present a new AI approach to automatically design a
central processing unit (CPU), the brain of a computer, and one of the world's
most intricate devices humanity have ever designed. This approach generates the
circuit logic, which is represented by a graph structure called Binary
Speculation Diagram (BSD), of the CPU design from only external input-output
observations instead of formal program code. During the generation of BSD,
Monte Carlo-based expansion and the distance of Boolean functions are used to
guarantee accuracy and efficiency, respectively. By efficiently exploring a
search space of unprecedented size 10^{10^{540}}, which is the largest one of
all machine-designed objects to our best knowledge, and thus pushing the limits
of machine design, our approach generates an industrial-scale RISC-V CPU within
only 5 hours. The taped-out CPU successfully runs the Linux operating system
and performs comparably against the human-designed Intel 80486SX CPU. In
addition to learning the world's first CPU only from input-output observations,
which may reform the semiconductor industry by significantly reducing the
design cycle, our approach even autonomously discovers human knowledge of the
von Neumann architecture.Comment: 28 page
Memory Hierarchy Hardware-Software Co-design in Embedded Systems
The memory hierarchy is the main bottleneck in modern computer systems as the gap between the speed of the processor and the memory continues to grow larger. The situation in embedded systems is even worse. The memory hierarchy consumes a large amount of chip area and energy, which are precious resources in embedded systems. Moreover, embedded systems have multiple design objectives such as performance, energy consumption, and area, etc.
Customizing the memory hierarchy for specific applications is a very important way to take full advantage of limited resources to maximize the performance. However, the traditional custom memory hierarchy design methodologies are phase-ordered. They separate the application optimization from the memory hierarchy architecture design, which tend to result in local-optimal solutions. In traditional Hardware-Software co-design methodologies, much of the work has focused on utilizing reconfigurable logic to partition the computation. However, utilizing reconfigurable logic to perform the memory hierarchy design is seldom addressed.
In this paper, we propose a new framework for designing memory hierarchy for embedded systems. The framework will take advantage of the flexible reconfigurable logic to customize the memory hierarchy for specific applications. It combines the application optimization and memory hierarchy design together to obtain a global-optimal solution. Using the framework, we performed a case study to design a new software-controlled instruction memory that showed promising potential.Singapore-MIT Alliance (SMA
X86_64 vs Aarch64 Performance Validation with COTSon
In this study, we provide a set of architectural parameters for the HPLabs COTSon simulator that can be used to model existing processors, such as the Intel i7700 (X86_64 architecture) and the ARM A53 (Aarch64 architecture). We carry out an initial validation, by comparing the execution time while performing the weak scaling of the architecture, in the case of two common benchmarks.
We use the Recursive Fibonacci and Matrix Multiplication benchmarks for simplicity. By using the simulator, we can then further study the sensitivity of the architecture and derive which features
may matter most to evaluate the performance. Our goal here is to verify that the COTSon simulator can be used to model both the X86_64 and Aarch64 architectures. Based on this validation study,
we have the possibility to analyze the bottlenecks and desirable microarchitectural features of modern architectures
Virtualisation of FPGA-Resources for Concurrent User Designs Employing Partial Dynamic Reconfiguration
Reconfigurable hardware in a cloud environment is a power efficient way to increase the processing power of future data centers beyond today\'s maximum.
This work enhances an existing framework to support concurrent users on a virtualized reconfigurable FPGA resource. The FPGAs are used to provide a flexible, fast and very efficient platform for the user who has access through a simple cloud based interface.
A fast partial reconfiguration is achieved through the ICAP combined with a PCIe connection and a combination of custom and TCL scripts to control the tool flow.
This allows for a reconfiguration of a user space on a FPGA in a few milliseconds while providing a simple single-action interface to the user
Status and Future Perspectives for Lattice Gauge Theory Calculations to the Exascale and Beyond
In this and a set of companion whitepapers, the USQCD Collaboration lays out
a program of science and computing for lattice gauge theory. These whitepapers
describe how calculation using lattice QCD (and other gauge theories) can aid
the interpretation of ongoing and upcoming experiments in particle and nuclear
physics, as well as inspire new ones.Comment: 44 pages. 1 of USQCD whitepapers
- …