106 research outputs found

    Using an Adaptive HPC Runtime System to Reconfigure the Cache Hierarchy

    Full text link
    The cache hierarchy often consumes a large portion of a processor’s energy. To save energy in HPC environments, this paper proposes software-controlled reconfiguration of the cache hierarchy with an adaptive runtime system. Our approach addresses the two major limitations associated with other methods that reconfigure the caches: predicting the application’s future and finding the best cache hierarchy configuration. Our approach uses formal language theory to express the application’s pattern and help predict its future. Furthermore, it uses the prevalent Single Program Multiple Data (SPMD) model of HPC codes to find the best configuration in parallel quickly. Our experiments using cycle-level simulations indicate that 67 % of the cache energy can be saved with only a 2.4 % performance penalty on average. Moreover, we demonstrate that, for some applica-tions, switching to a software-controlled reconfigurable streaming buffer configuration can improve performance by up to 30 % and save 75 % of the cache energy. I

    FPGA-based Query Acceleration for Non-relational Databases

    Get PDF
    Database management systems are an integral part of today’s everyday life. Trends like smart applications, the internet of things, and business and social networks require applications to deal efficiently with data in various data models close to the underlying domain. Therefore, non-relational database systems provide a wide variety of database models, like graphs and documents. However, current non-relational database systems face performance challenges due to the end of Dennard scaling and therefore performance scaling of CPUs. In the meanwhile, FPGAs have gained traction as accelerators for data management. Our goal is to tackle the performance challenges of non-relational database systems with FPGA acceleration and, at the same time, address design challenges of FPGA acceleration itself. Therefore, we split this thesis up into two main lines of work: graph processing and flexible data processing. Because of the lacking benchmark practices for graph processing accelerators, we propose GraphSim. GraphSim is able to reproduce runtimes of these accelerators based on a memory access model of the approach. Through this simulation environment, we extract three performance-critical accelerator properties: asynchronous graph processing, compressed graph data structure, and multi-channel memory. Since these accelerator properties have not been combined in one system, we propose GraphScale. GraphScale is the first scalable, asynchronous graph processing accelerator working on a compressed graph and outperforms all state-of-the-art graph processing accelerators. Focusing on accelerator flexibility, we propose PipeJSON as the first FPGA-based JSON parser for arbitrary JSON documents. PipeJSON is able to achieve parsing at line-speed, outperforming the fastest, vectorized parsers for CPUs. Lastly, we propose the subgraph query processing accelerator GraphMatch which outperforms state-of-the-art CPU systems for subgraph query processing and is able to flexibly switch queries during runtime in a matter of clock cycles

    An FPGA implementation of an investigative many-core processor, Fynbos : in support of a Fortran autoparallelising software pipeline

    Get PDF
    Includes bibliographical references.In light of the power, memory, ILP, and utilisation walls facing the computing industry, this work examines the hypothetical many-core approach to finding greater compute performance and efficiency. In order to achieve greater efficiency in an environment in which Moore’s law continues but TDP has been capped, a means of deriving performance from dark and dim silicon is needed. The many-core hypothesis is one approach to exploiting these available transistors efficiently. As understood in this work, it involves trading in hardware control complexity for hundreds to thousands of parallel simple processing elements, and operating at a clock speed sufficiently low as to allow the efficiency gains of near threshold voltage operation. Performance is there- fore dependant on exploiting a new degree of fine-grained parallelism such as is currently only found in GPGPUs, but in a manner that is not as restrictive in application domain range. While removing the complex control hardware of traditional CPUs provides space for more arithmetic hardware, a basic level of control is still required. For a number of reasons this work chooses to replace this control largely with static scheduling. This pushes the burden of control primarily to the software and specifically the compiler, rather not to the programmer or to an application specific means of control simplification. An existing legacy tool chain capable of autoparallelising sequential Fortran code to the degree of parallelism necessary for many-core exists. This work implements a many-core architecture to match it. Prototyping the design on an FPGA, it is possible to examine the real world performance of the compiler-architecture system to a greater degree than simulation only would allow. Comparing theoretical peak performance and real performance in a case study application, the system is found to be more efficient than any other reviewed, but to also significantly under perform relative to current competing architectures. This failing is apportioned to taking the need for simple hardware too far, and an inability to implement static scheduling mitigating tactics due to lack of support for such in the compiler

    A Reconfigurable Processor for Heterogeneous Multi-Core Architectures

    Get PDF
    A reconfigurable processor is a general-purpose processor coupled with an FPGA-like reconfigurable fabric. By deploying application-specific accelerators, performance for a wide range of applications can be improved with such a system. In this work concepts are designed for the use of reconfigurable processors in multi-tasking scenarios and as part of multi-core systems

    Towards Closing the Programmability-Efficiency Gap using Software-Defined Hardware

    Full text link
    The past decade has seen the breakdown of two important trends in the computing industry: Moore’s law, an observation that the number of transistors in a chip roughly doubles every eighteen months, and Dennard scaling, that enabled the use of these transistors within a constant power budget. This has caused a surge in domain-specific accelerators, i.e. specialized hardware that deliver significantly better energy efficiency than general-purpose processors, such as CPUs. While the performance and efficiency of such accelerators are highly desirable, the fast pace of algorithmic innovation and non-recurring engineering costs have deterred their widespread use, since they are only programmable across a narrow set of applications. This has engendered a programmability-efficiency gap across contemporary platforms. A practical solution that can close this gap is thus lucrative and is likely to engender broad impact in both academic research and the industry. This dissertation proposes such a solution with a reconfigurable Software-Defined Hardware (SDH) system that morphs parts of the hardware on-the-fly to tailor to the requirements of each application phase. This system is designed to deliver near-accelerator-level efficiency across a broad set of applications, while retaining CPU-like programmability. The dissertation first presents a fixed-function solution to accelerate sparse matrix multiplication, which forms the basis of many applications in graph analytics and scientific computing. The solution consists of a tiled hardware architecture, co-designed with the outer product algorithm for Sparse Matrix-Matrix multiplication (SpMM), that uses on-chip memory reconfiguration to accelerate each phase of the algorithm. A proof-of-concept is then presented in the form of a prototyped 40 nm Complimentary Metal-Oxide Semiconductor (CMOS) chip that demonstrates energy efficiency and performance per die area improvements of 12.6x and 17.1x over a high-end CPU, and serves as a stepping stone towards a full SDH system. The next piece of the dissertation enhances the proposed hardware with reconfigurability of the dataflow and resource sharing modes, in order to extend acceleration support to a set of common parallelizable workloads. This reconfigurability lends the system the ability to cater to discrete data access and compute patterns, such as workloads with extensive data sharing and reuse, workloads with limited reuse and streaming access patterns, among others. Moreover, this system incorporates commercial cores and a prototyped software stack for CPU-level programmability. The proposed system is evaluated on a diverse set of compute-bound and memory-bound kernels that compose applications in the domains of graph analytics, machine learning, image and language processing. The evaluation shows average performance and energy-efficiency gains of 5.0x and 18.4x over the CPU. The final part of the dissertation proposes a runtime control framework that uses low-cost monitoring of hardware performance counters to predict the next best configuration and reconfigure the hardware, upon detecting a change in phase or nature of data within the application. In comparison to prior work, this contribution targets multicore CGRAs, uses low-overhead decision tree based predictive models, and incorporates reconfiguration cost-awareness into its policies. Compared to the best-average static (non-reconfiguring) configuration, the dynamically reconfigurable system achieves a 1.6x improvement in performance-per-Watt in the Energy-Efficient mode of operation, or the same performance with 23% lower energy in the Power-Performance mode, for SpMM across a suite of real-world inputs. The proposed reconfiguration mechanism itself outperforms the state-of-the-art approach for dynamic runtime control by up to 2.9x in terms of energy-efficiency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169859/1/subh_1.pd

    Towards adaptive balanced computing (ABC) using reconfigurable functional caches (RFCs)

    Get PDF
    The general-purpose computing processor performs a wide range of functions. Although the performance of general-purpose processors has been steadily increasing, certain software technologies like multimedia and digital signal processing applications demand ever more computing power. Reconfigurable computing has emerged to combine the versatility of general-purpose processors with the customization ability of ASICs. The basic premise of reconfigurability is to provide better performance and higher computing density than fixed configuration processors. Most of the research in reconfigurable computing is dedicated to on-chip functional logic. If computing resources are adaptable to the computing requirement, the maximum performance can be achieved. To overcome the gap between processor and memory technology, the size of on-chip cache memory has been consistently increasing. The larger cache memory capacity, though beneficial in general, does not guarantee a higher performance for all the applications as they may not utilize all of the cache efficiently. To utilize on-chip resources effectively and to accelerate the performance of multimedia applications specifically, we propose a new architecture---Adaptive Balanced Computing (ABC). ABC uses dynamic resource configuration of on-chip cache memory by integrating Reconfigurable Functional Caches (RFC). RFC can work as a conventional cache or as a specialized computing unit when necessary. In order to convert a cache memory to a computing unit, we include additional logic to embed multi-bit output LUTs into the cache structure. We add the reconfigurability of cache memory to a conventional processor with minimal modification to the load/store microarchitecture and with minimal compiler assistance. ABC architecture utilizes resources more efficiently by reconfiguring the cache memory to computing units dynamically. The area penalty for this reconfiguration is about 50--60% of the memory cell cache array-only area with faster cache access time. In a base array cache (parallel decoding caches), the area penalty is 10--20% of the data array with 1--2% increase in the cache access time. However, we save 27% for FIR and 44% for DCT/IDCT in area with respect to memory cell array cache and about 80% for both applications with respect to base array cache if we were to implement all these units separately (such as ASICs). The simulations with multimedia and DSP applications (DCT/IDCT and FIR/IIR) show that the resource configuration with the RFC speedups ranging from 1.04X to 3.94X in overall applications and from 2.61X to 27.4X in the core computations. The simulations with various parameters indicate that the impact of reconfiguration can be minimized if an appropriate cache organization is selected

    RISPP: A Run-time Adaptive Reconfigurable Embedded Processor

    Get PDF
    This Ph.D. thesis describes a new approach for adaptive processors using a reconfigurable fabric (embedded FPGA) to implement application-specific accelerators. A novel modular Special Instruction composition is presented along with a run-time system that exploits the provided adaptivity. The approach was simulated and prototyped using and FPGA. Comparisons with state-of-the-art appl.-specific and reconf. processors demonstrate significant improvements according the performance and efficiency

    The Design of a System Architecture for Mobile Multimedia Computers

    Get PDF
    This chapter discusses the system architecture of a portable computer, called Mobile Digital Companion, which provides support for handling multimedia applications energy efficiently. Because battery life is limited and battery weight is an important factor for the size and the weight of the Mobile Digital Companion, energy management plays a crucial role in the architecture. As the Companion must remain usable in a variety of environments, it has to be flexible and adaptable to various operating conditions. The Mobile Digital Companion has an unconventional architecture that saves energy by using system decomposition at different levels of the architecture and exploits locality of reference with dedicated, optimised modules. The approach is based on dedicated functionality and the extensive use of energy reduction techniques at all levels of system design. The system has an architecture with a general-purpose processor accompanied by a set of heterogeneous autonomous programmable modules, each providing an energy efficient implementation of dedicated tasks. A reconfigurable internal communication network switch exploits locality of reference and eliminates wasteful data copies

    Performance and Memory Space Optimizations for Embedded Systems

    Get PDF
    Embedded systems have three common principles: real-time performance, low power consumption, and low price (limited hardware). Embedded computers use chip multiprocessors (CMPs) to meet these expectations. However, one of the major problems is lack of efficient software support for CMPs; in particular, automated code parallelizers are needed. The aim of this study is to explore various ways to increase performance, as well as reducing resource usage and energy consumption for embedded systems. We use code restructuring, loop scheduling, data transformation, code and data placement, and scratch-pad memory (SPM) management as our tools in different embedded system scenarios. The majority of our work is focused on loop scheduling. Main contributions of our work are: We propose a memory saving strategy that exploits the value locality in array data by storing arrays in a compressed form. Based on the compressed forms of the input arrays, our approach automatically determines the compressed forms of the output arrays and also automatically restructures the code. We propose and evaluate a compiler-directed code scheduling scheme, which considers both parallelism and data locality. It analyzes the code using a locality parallelism graph representation, and assigns the nodes of this graph to processors.We also introduce an Integer Linear Programming based formulation of the scheduling problem. We propose a compiler-based SPM conscious loop scheduling strategy for array/loop based embedded applications. The method is to distribute loop iterations across parallel processors in an SPM-conscious manner. The compiler identifies potential SPM hits and misses, and distributes loop iterations such that the processors have close execution times. We present an SPM management technique using Markov chain based data access. We propose a compiler directed integrated code and data placement scheme for 2-D mesh based CMP architectures. Using a Code-Data Affinity Graph (CDAG) to represent the relationship between loop iterations and array data, it assigns the sets of loop iterations to processing cores and sets of data blocks to on-chip memories. We present a memory bank aware dynamic loop scheduling scheme for array intensive applications.The goal is to minimize the number of memory banks needed for executing the group of loop iterations
    • …
    corecore