12 research outputs found

    An Event-Triggered Programmable Prefetcher for Irregular Workloads

    Get PDF
    Many modern workloads compute on large amounts of data, often with irregular memory accesses. Current architectures perform poorly for these workloads, as existing prefetching techniques cannot capture the memory access patterns; these applications end up heavily memory-bound as a result. Although a number of techniques exist to explicitly configure a prefetcher with traversal patterns, gaining significant speedups, they do not generalise beyond their target data structures. Instead, we propose an event-triggered programmable prefetcher combining the flexibility of a general-purpose computational unit with an event-based programming model, along with compiler techniques to automatically generate events from the original source code with annotations. This allows more complex fetching decisions to be made, without needing to stall when intermediate results are required. Using our programmable prefetching system, combined with small prefetch kernels extracted from applications, we achieve an average 3.0x speedup in simulation for a variety of graph, database and HPC workloads.</jats:p

    Fast Key-Value Lookups with Node Tracker

    Get PDF
    Lookup operations for in-memory databases are heavily memory bound, because they often rely on pointer-chasing linked data structure traversals. They also have many branches that are hard-to-predict due to random key lookups. In this study, we show that although cache misses are the primary bottleneck for these applications, without a method for eliminating the branch mispredictions only a small fraction of the performance benefit is achieved through prefetching alone. We propose the Node Tracker (NT), a novel programmable prefetcher/pre-execution unit that is highly effective in exploiting inter key-lookup parallelism to improve single-thread performance. We extend NT with branch outcome streaming (BOS) to reduce branch mispredictions and show that this achieves an extra 3Ă— speedup. Finally, we evaluate the NT as a pre-execution unit and demonstrate that we can further improve the performance in both single- and multi-threaded execution modes. Our results show that, on average, NT improves single-thread performance by 4.1Ă— when used as a prefetcher; 11.9Ă— as a prefetcher with BOS; 14.9Ă— as a pre-execution unit and 18.8Ă— as a pre-execution unit with BOS. Finally, with 24 cores of the latter version, we achieve a speedup of 203Ă— and 11Ă— over the single-core and 24-core baselines, respectively

    Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques

    Full text link
    Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalances and waiting time due to memory latencies. Compiler optimization is one of the most effective solutions to tackle this problem. The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided with a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and producing desired application scalability. One solution to address this challenge is the use of runtime methods. This strategy can be implemented by delaying certain amount of code analysis to be done at runtime. In this research, we improve the parallel application performance generated by the OP2 compiler by leveraging HPX, a C++ runtime system, to provide runtime optimizations. These optimizations include asynchronous tasking, loop interleaving, dynamic chunk sizing, and data prefetching. The results of the research were evaluated using an Airfoil application which showed a 40-50% improvement in parallel performance.Comment: 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017

    Preliminary multicore architecture for Introspective Computing

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 243-245).This thesis creates a framework for Introspective Computing. Introspective Computing is a computing paradigm characterized by self-aware software. Self-aware software systems use hardware mechanisms to observe an application's execution so that they may adapt execution to improve performance, reduce power consumption, or balance user-defined fitness criteria over time-varying conditions in a system environment. We dub our framework Partner Cores. The Partner Cores framework builds upon tiled multicore architectures [11, 10, 25, 9], closely coupling cores such that one may be used to observe and optimize execution in another. Partner cores incrementally collect and analyze execution traces from code cores then exploit knowledge of the hardware to optimize execution. This thesis develops a tiled architecture for the Partner Cores framework that we dub Evolve. Evolve provides a versatile substrate upon which software may coordinate core partnerships and various forms of parallelism. To do so, Evolve augments a basic tiled architecture with introspection hardware and programmable functional units. Partner Cores software systems on the Evolve hardware may follow the style of helper threading [13, 12, 6] or utilize the programmable functional units in each core to evolve application-specific coprocessor engines. This thesis work develops two Partner Cores software systems: the Dynamic Partner-Assisted Branch Predictor and the Introspective L2 Memory System (IL2). The branch predictor employs a partner core as a coprocessor engine for general dynamic branch prediction in a corresponding code core. The IL2 retasks the silicon resources of partner cores as banks of an on-chip, distributed, software L2 cache for code cores.(cont.) The IL2 employs aggressive, application-specific prefetchers for minimizing cache miss penalties and DRAM power consumption. Our results and future work show that the branch predictor is able to sustain prediction for code core branch frequencies as high as one every 7 instructions with no degradation in accuracy; updated prediction directions are available in a low minimum of 20-21 instructions. For the IL2, we develop a pixel block prefetcher for the image data structure used in a JPEG encoder benchmark and show that a 50% improvement in absolute performance is attainable.by Jonathan M. Eastep.S.M

    AMC: Advanced Multi-accelerator Controller

    Get PDF
    The rapid advancement, use of diverse architectural features and introduction of High Level Synthesis (HLS) tools in FPGA technology have enhanced the capacity of data-level parallelism on a chip. A generic FPGA based HLS multi-accelerator system requires a microprocessor (master core) that manages memory and schedules accelerators. In a real environment, such HLS multi-accelerator systems do not give a perfect performance due to memory bandwidth issues. Thus, a system demands a memory manager and a scheduler that improves performance by managing and scheduling the multi-accelerator’s memory access patterns efficiently. In this article, we propose the integration of an intelligent memory system and efficient scheduler in the HLS-based multi-accelerator environment called Advanced Multi-accelerator Controller (AMC). The AMC system is evaluated with memory intensive accelerators, High Performance Computing (HPC) applications and implemented and tested on a Xilinx Virtex-5 ML505 evaluation FPGA board. The performance of the system is compared against the microprocessor-based systems that have been integrated with the operating system. Results show that the AMC based HLS multi-accelerator system achieves 10.4x and 7x of speedup compared to the MicroBlaze and Intel Core based HLS multi-accelerator systems.Peer ReviewedPostprint (author’s final draft

    Temporal Streaming of Shared Memory

    Get PDF
    Coherent read misses in shared-memory multiprocessors account for a substantial fraction of execution time in many important scientific and commercial workloads. We propose Temporal Streaming, to eliminate coherent read misses by streaming data to a processor in advance of the corresponding memory accesses. Temporal streaming dynamically identifies address sequences to be streamed by exploiting two common phenomena in shared-memory access patterns: (1) temporal address correlation — groups of shared addresses tend to be accessed together and in the same order, and (2) temporal stream locality — recently- accessed address streams are likely to recur. We present a practical design for temporal streaming. We evaluate our design using a combination of trace-driven and cycle- accurate full-system simulation of a cache-coherent distributed shared-memory system. We show that temporal streaming can eliminate 98% of coherent read misses in scientific applications, and between 43% and 60% in database and web server workloads. Our design yields speedups of 1.07 to 3.29 in scientific applications, and 1.06 to 1.21 in commercial workloads

    Symbiotic Subordinate Threading (SST)

    Get PDF
    Integration of multiple processor cores on a single die, relatively constant die sizes, increasing memory latencies, and emerging new applications create new challenges and opportunities for processor architects. How to build a multi-core processor that provides high single-thread performance while enabling high throughput through multi-programming? Conventional approaches for high single-thread performance use a large instruction window for memory latency tolerance, which requires large and complex cores. However, to be able to integrate more cores on the same die for high throughput, cores must be simpler and smaller. We present an architecture that obtains high performance for single-threaded applications in a multi-core environment, while using simpler cores to meet the high throughput requirement. Our scheme, called Symbiotic Subordinate Threading (SST), achieves the benefits of a large instruction window by utilizing otherwise idle cores to run dynamically constructed subordinate threads (a.k.a. {\em helper threads}) for the individual threads running on the active cores. In our proposed execution paradigm, the subordinate thread fetches and pre-processes instruction streams and retires processed instructions into a buffer for the main thread to consume. The subordinate thread executes a smaller version of the program executed by the main thread. As a result, it runs far ahead to warm up the data caches and fix branch miss-predictions for the main thread. In-flight instructions are present in the subordinate thread, the buffer, and the main thread, forming a very large effective instruction window for single-thread out-of-order execution. Moreover, using a simple technique of identifying the subordinate thread non-speculative results, the main thread can integrate the subordinate thread's non-speculative results directly into its state without having to execute their corresponding instructions. In this way, the main thread is sped up because it also executes a smaller version of the program, and the total number of instructions executed is minimized, thereby achieving an efficient utilization of the hardware resources. The proposed SST architecture does not require large register files, issue queues, load/store queues, or reorder buffers. In addition, it incurs only minor hardware additions/changes. Experimental results show remarkable latency-hiding capabilities of the proposed SST architecture, outperforming existing architectures that share similar high-level microarchitecture

    A novel access pattern-based multi-core memory architecture

    Get PDF
    Increasingly High-Performance Computing (HPC) applications run on heterogeneous multi-core platforms. The basic reason of the growing popularity of these architectures is their low power consumption, and high throughput oriented nature. However, this throughput imposes a requirement on the data to be supplied in a high throughput manner for the multi-core system. This results in the necessity of an efficient management of on-chip and off-chip memory data transfers, which is a significant challenge. Complex regular and irregular memory data transfer patterns are becoming widely dominant for a range of application domains including the scientific, image and signal processing. Data accesses can be arranged in independent patterns that an efficient memory management can exploit. The software based approaches using general purpose caches and on-chip memories are beneficial to some extent. However, the task of efficient data management for the throughput oriented devices could be improved by providing hardware mechanisms that exploit the knowledge of access patterns in memory management and scheduling of accesses for a heterogeneous multi-core architecture. The focus of this thesis is to present architectural explorations for a novel access pattern-based multi-core memory architecture. In general, the thesis covers four main aspects of memory system in this research. These aspects can be categorized as: i) Uni-core Memory System for Regular Data Pattern. ii) Multi-core Memory System for Regular Data Pattern. iii) Uni-core Memory System for Irregular Data Pattern. and iv) Multi-core Memory System for Irregular Data Pattern.Les aplicacions de computació d'alt rendiment (HPC) s'executen cada vegada més en plataformes heterogènies de múltiples nuclis. El motiu bàsic de la creixent popularitat d'aquestes arquitectures és el seu baix consum i la seva natura orientada a alt throughput. No obstant, aquest thoughput imposa el requeriment de que les dades es proporcionin al sistema també amb alt throughput. Això resulta en la necessitat de gestionar eficientment les trasferències de memòria (dins i fora del chip), un repte significatiu. Els patrons de transferències de memòria regulars però complexos així com els irregulars són cada vegada més dominants per a diversos dominis d'aplicacions, incloent el científic i el processat d'imagte i senyals. Aquests accessos a dades poden ser organitzats en patrons independents que un gestor de memòria eficient pot explotar. Els mètodes basats en programari emprant memòries cau de propòsit general i memòries al chip són beneficioses fins a cert punt. No obstant, la tasca de gestionar eficientment les transferències de dades per a dispositius orientats a throughput pot ser millorada oferint mecanismes hardware que explotin el coneixement dels patrons d'accés de les aplicacions, així com la planificació dels accessos a una arquitectura de múltiples nuclis. Aquesta tesis està enfocada a explorar una arquitectura de memòria novedosa per a processadors de múltiples nuclis, basada en els patrons d'accés. En general, la recerca de la tesis cobreix quatres aspectes principals del sistema de memòria. Aquests aspectes són: i) sistema de memòria per a un únic nucli amb patrons regulars, ii) sistema de memòria per a múltiples nuclis amb patrons regulars, iii) sistema de memòria per a un únic nucli amb patrons irregulars, iv) sistema de memòria per a múltiples nuclis amb patrons irregulars

    Compiler and Runtime Optimization Techniques for Implementation Scalable Parallel Applications

    Get PDF
    The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided by a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and desired application scalability. These compiler techniques should consider both the static information gathered at compile time and dynamic analysis captured at runtime about the system to generate a safe parallel application. On the other hand, runtime information is often speculative. Solely relying on it doesn\u27t guarantee maximal parallel performance. So collecting information at compile time could significantly improve the runtime techniques performance. The goal is achieved in this research by introducing new techniques proposed for both compiler and runtime system that enable them to contribute with each other and utilize both static and dynamic analysis information to maximize application parallel performance. In the proposed framework, a compiler can implement dynamic runtime methods in its parallelization optimizations and a runtime system can apply static information in its parallelization methods implementation. The proposed techniques are able to use high-level programming abstractions and machine learning to relieve the programmer of difficult and tedious decisions that can significantly affect program behavior and performance
    corecore