8 research outputs found

    NPS: A Framework for Accurate Program Sampling Using Graph Neural Network

    Full text link
    With the end of Moore's Law, there is a growing demand for rapid architectural innovations in modern processors, such as RISC-V custom extensions, to continue performance scaling. Program sampling is a crucial step in microprocessor design, as it selects representative simulation points for workload simulation. While SimPoint has been the de-facto approach for decades, its limited expressiveness with Basic Block Vector (BBV) requires time-consuming human tuning, often taking months, which impedes fast innovation and agile hardware development. This paper introduces Neural Program Sampling (NPS), a novel framework that learns execution embeddings using dynamic snapshots of a Graph Neural Network. NPS deploys AssemblyNet for embedding generation, leveraging an application's code structures and runtime states. AssemblyNet serves as NPS's graph model and neural architecture, capturing a program's behavior in aspects such as data computation, code path, and data flow. AssemblyNet is trained with a data prefetch task that predicts consecutive memory addresses. In the experiments, NPS outperforms SimPoint by up to 63%, reducing the average error by 38%. Additionally, NPS demonstrates strong robustness with increased accuracy, reducing the expensive accuracy tuning overhead. Furthermore, NPS shows higher accuracy and generality than the state-of-the-art GNN approach in code behavior learning, enabling the generation of high-quality execution embeddings

    Empirical performance assessment using soft-core processors on reconfigurable hardware

    Full text link
    Simulation has been the de facto standard method for per-formance evaluation of newly proposed ideas in computer architecture for many years. While simulation allows for theoretically arbitrary delity (at least to the level of cycle accuracy) as well as the ability to monitor the architecture without perturbing the execution itself, it suers from low eective delity and long execution times. We (and others) have advocated the use of empirical ex-perimentation on recongurable hardware for computer ar-chitecture performance assessment. In this paper, we de-scribe an empirical performance assessment subsystem im-plemented in recongurable hardware and illustrate its use. Results are presented that demonstrate the need for the types of performance assessment that recongurable hard-ware can provide

    SimFlex: Statistical Sampling of Computer System Simulation

    Get PDF
    Timing-accurate full-system multiprocessor simulations can take years because of architecture and application complexity. Statistical sampling makes simulation-based studies feasible by providing ten-thousand-fold reductions in simulation runtime and enabling thousand-way simulation parallelis

    Improving the Simulation Environment for Computer Architecture

    Get PDF
    This work presents the efforts to improve the simulation environment for computer architecture research through two major contributions: The addition of a three level cache hierarchy and implementation of a statistical sampling simulation framework. Full-system and micro-architectural simulation are the primary and most reliable research tools that the computer architecture community has. However, keeping the simulator up to date with the latest industry products is a challenging task, causing a growing time gap between the release of new commercial products and the implementation of their models in the simulators. Another problem architects have to deal with is the performance gap; the time spent on simulating one instruction is several orders of magnitude bigger than the time the real hardware takes to execute the same instruction. This leads to prohibitively long simulation times that, due to the always efficiency-focused industry trend, is also to be increased. As processors get more complex, so do the simulators. The performance improvement achieved by real hardware changes is too small compared to the overhead induced into the simulator while trying to replicate those same changes. Although a third level (L3) cache hierarchy is a common feature in current processors and its benefits in performance have been known for decades, currently, it is not supported in most full-system simulators. A modern full system simulator was extended to include a third level cache and experiments show that for the PARSEC benchmarks, the performance of the system with L3 is ≈ 30% better than the baseline. On the other hand the implementation of statistical sampling simulation allows a greater improvement in simulation performance while statistics theory guarantees that the subset of instructions executed are a representative sample of the benchmark behaviour. The experiments show a measured CPI error of less than 2.5% while achieving simulation time speed-ups of around 3X

    Scalable reconfigurable computing leveraging latency-insensitive channels

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (p. 190-197).Traditionally, FPGAs have been confined to the limited role of small, low-volume ASIC replacements and as circuit emulators. However, continued Moore's law scaling has given FPGAs new life as accelerators for applications that map well to fine-grained parallel substrates. Examples of such applications include processor modelling, compression, and digital signal processing. Although FPGAs continue to increase in size, some interesting designs still fail to fit in to a single FPGA. Many tools exist that partition RTL descriptions across FPGAs. Unfortunately, existing tools have low performance due to the inefficiency of maintaining the cycle-by-cycle behavior of RTL among discrete FPGAs. These tools are unsuitable for use in FPGA program acceleration, as the purpose of an accelerator is to make applications run faster. This thesis presents latency-insensitive channels, a language-level mechanism by which programmers express points in their their design at which the cycle-by-cycle behavior of the design may be modified by the compiler. By decoupling the timing of portions of the RTL from the high-level function of the program, designs may be mapped to multiple FPGAs without suffering the performance degradation observed in existing tools. This thesis demonstrates, using a diverse set of large designs, that FPGA programs described in terms of latency-insensitive channels obtain significant gains in design feasibility, compilation time, and run-time when mapped to multiple FPGAs.by Kermin Elliott Fleming, Jr.Ph.D
    corecore