Search CORE

128 research outputs found

Multiple Supply Voltages aware Energy-efficient High-level Synthesis for HDR Architectures

Author: 阿部晋矢
Publication venue: 戸川, 望
Publication date: 06/02/2012
Field of study

Compiler and Architecture Design for Coarse-Grained Programmable Accelerators

Author
Publication venue
Publication date: 01/01/2015
Field of study

abstract: The holy grail of computer hardware across all market segments has been to sustain performance improvement at the same pace as silicon technology scales. As the technology scales and the size of transistors shrinks, the power consumption and energy usage per transistor decrease. On the other hand, the transistor density increases significantly by technology scaling. Due to technology factors, the reduction in power consumption per transistor is not sufficient to offset the increase in power consumption per unit area. Therefore, to improve performance, increasing energy-efficiency must be addressed at all design levels from circuit level to application and algorithm levels. At architectural level, one promising approach is to populate the system with hardware accelerators each optimized for a specific task. One drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low as they perform one specific function. Using software programmable accelerators is an alternative approach to achieve high energy-efficiency and programmability. Due to intrinsic characteristics of software accelerators, they can exploit both instruction level parallelism and data level parallelism. Coarse-Grained Reconfigurable Architecture (CGRA) is a software programmable accelerator consists of a number of word-level functional units. Motivated by promising characteristics of software programmable accelerators, the potentials of CGRAs in future computing platforms is studied and an end-to-end CGRA research framework is developed. This framework consists of three different aspects: CGRA architectural design, integration in a computing system, and CGRA compiler. First, the design and implementation of a CGRA and its instruction set is presented. This design is then modeled in a cycle accurate system simulator. The simulation platform enables us to investigate several problems associated with a CGRA when it is deployed as an accelerator in a computing system. Next, the problem of mapping a compute intensive region of a program to CGRAs is formulated. From this formulation, several efficient algorithms are developed which effectively utilize CGRA scarce resources very well to minimize the running time of input applications. Finally, these mapping algorithms are integrated in a compiler framework to construct a compiler for CGRADissertation/ThesisDoctoral Dissertation Computer Science 201

ASU Digital Repository

Recommended from our members

Scalable hardware memory disambiguation

Author: Sethumadhavan Lakshminarasimhan, 1978-
Publication venue
Publication date: 01/12/2007
Field of study

This dissertation deals with one of the long-standing problems in Computer Architecture – the problem of memory disambiguation. Microprocessors typically reorder memory instructions during execution to improve concurrency. Such microprocessors use hardware memory structures for memory disambiguation, known as LoadStore Queues (LSQs), to ensure that memory instruction dependences are satisfied even when the memory instructions execute out-of-order. A typical LSQ implementation (circa 2006) holds all in-flight memory instructions in a physically centralized LSQ and performs a fully associative search on all buffered instructions to ensure that memory dependences are satisfied. These LSQ implementations do not scale because they use large, fully associative structures, which are known to be slow and power hungry. The increasing trend towards distributed microarchitectures further exacerbates these problems. As on-chip wire delays increase and high-performance processors become necessarily distributed, centralized structures such as the LSQ can limit scalability. This dissertation describes techniques to create scalable LSQs in both centralized and distributed microarchitectures. The problems and solutions described in this thesis are motivated and validated by real system designs. The dissertation starts with a description of the partitioned primary memory system of the TRIPS processor, of which the LSQ is an important component, and then through a series of optimizations describes how the power, area, and centralization problems of the LSQ can be solved with minor performance losses (if at all) even for large number of in flight memory instructions. The four solutions described in this dissertation — partitioning, filtering, late binding and efficient overflow management — enable power-, area-efficient, distributed and scalable LSQs, which in turn enable aggressive large-window processors capable of simultaneously executing thousands of instructions. To mitigate the power problem, we replaced the power-hungry, fully associative search with a power-efficient hash table lookup using a simple address-based Bloom filter. Bloom filters are probabilistic data structures used for testing set membership and can be used to quickly check if an instruction with the same data address is likely to be found in the LSQ without performing the associative search. Bloom filters typically eliminate more than 80% of the associative searches and they are highly effective because in most programs, it is uncommon for loads and stores to have the same data address and be in execution simultaneously. To rectify the area problem, we observe the fact that only a small fraction of all memory instructions are dependent, that only such dependent instructions need to be buffered in the LSQ, and that these instructions need to be in the LSQ only for certain parts of the pipelined execution. We propose two mechanisms to exploit these observations. The first mechanism, area filtering, is a hardware mechanism that couples Bloom filters and dependence predictors to dynamically identify and buffer only those instructions which are likely to be dependent. The second mechanism, late binding, reduces the occupancy and hence size of the LSQ. Both of these optimizations allows the number of LSQ slots to be reduced by up to one-half compared to a traditional organization without any performance degradation. Finally, we describe a new decentralized LSQ design for handling LSQ structural hazards in distributed microarchitectures. Decentralization of LSQs, and to a large extent distributed microarchitectures with memory speculation, has proved to be impractical because of the high performance penalties associated with the mechanisms for dealing with hazards. To solve this problem, we applied classic flow-control techniques from interconnection networks for handling resource con- flicts. The first method, memory-side buffering, buffers the overflowing instructions in a separate buffer near the LSQs. The second scheme, execution-side NACKing, sends the overflowing instruction back to the issue window from which it is later re-issued. The third scheme, network buffering, uses the buffers in the interconnection network between the execution units and memory to hold instructions when the LSQ is full, and uses virtual channel flow control to avoid deadlocks. The network buffering scheme is the most robust of all the overflow schemes and shows less than 1% performance degradation due to overflows for a subset of SPEC CPU 2000 and EEMBC benchmarks on a cycle-accurate simulator that closely models the TRIPS processor. The techniques proposed in this dissertation are independent, architectureneutral and their cumulative benefits result in LSQs that can be partitioned at a fine granularity and have low design complexity. Each of these partitions selectively buffers only memory instructions with true dependences and can be closely coupled with the execution units thus minimizing power, area, and latency. Such LSQ designs with near-ideal characteristics are well suited for microarchitectures with thousands of instructions in-flight and may enable even more aggressive microarchitectures in the future.Computer Science

Texas ScholarWorks

빠른 성능조건 만족을 위한 임계경로를 고려하는 상위 수준 합성

Author: 이석현
Publication venue: 서울대학교 대학원
Publication date: 01/02/2014
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 2. 최기영.Rapid advancement of process technology enables designers to integrate various functions onto a single chip and to realize diverse requirements of customers, but productivity of system designers has improved too slowly to make optimal design in time-to-market. Since designing at higher levels of abstraction reduces the number of design instances to be considered to acquire an optimal design, it improves quality of system as well as reduces design time and cost. High-level synthesis, which maps behavioral description models to register-transfer models, can improve design productivity drastically, and thus, it has been one of the important issues in electronic system level design. Centralized controllers commonly used in high-level synthesis often require long wires and cause high load capacitance, and that is why critical paths typically occur on paths from controllers to data registers instead of paths from data registers to data registers. However, conventional high-level synthesis has focused on delays within a datapath, making it difficult to solve the timing closure problem during physical synthesis. This thesis presents hardware architecture with a distributed controller, which makes the timing closure problem much easier. A novel critical-path-aware high-level synthesis flow is also presented for synthesizing such hardware through datapath partitioning, register binding, and controller optimization. We explore the design space related to the number of partitions, which is an important design parameter for target architecture. According to our experiments, the proposed approach reduces the critical path delay excluding FUs by 29.3% and that including FUs by 10.0%, with 2.2% area overhead on average compared to centralized controller architecture. We also propose two approaches, clock gating and register constrained flow, to alleviate high peak current problem which is caused by the proposed approach. These approaches suppress the peak current overhead to keep it less than 3.6%.Chapter 1 Introduction 1 Chapter 2 Background 7 2.1 High-level Synthesis 7 2.2 Subtasks of High-level Synthesis 8 2.2.1 Operation Scheduling and FU Binding 8 2.2.2 Register Binding 10 2.2.3 Controller Synthesis 11 2.2.4 Functional Pipelining Technique for High-level Synthesis 11 2.3 Centralized Controller Architecture 12 2.4 Design Closure Problem in High-level Synthesis 15 2.5 Thesis Contribution 18 Chapter 3 Target Architecture and Overall flow 21 3.1 Target Architecture 21 3.2 Overall flow 23 Chapter 4 Critical-Path-Aware Datapath Partitioning 27 4.1 Introduction 27 4.2 Problem Formulation 30 4.3 Proposed Algorithm 32 4.4 Exploring Design Space for the Number of Partitions 36 Chapter 5 Critical-Path-Aware Register Binding 39 5.1 Introduction 39 5.2 Problem Formulation 40 5.3 Proposed Algorithm 43 Chapter 6 Critical-Path-Aware Controller Optimization 49 6.1 Introduction 49 6.2 Problem Formulation 50 6.3 Proposed Algorithm 55 Chapter 7 Evaluation 63 7.1 Experimental Setup 63 7.2 Design Parameters and Computation Time 66 7.3 Analysis Critical Path Delay on Distributed Controller Architecture 68 7.4 Analysis of Performance and Area 70 7.5 Energy Consumption 78 7.6 Analysis on Register Overhead 80 7.6.1 Clock Gating Approach 81 7.6.2 Register Constrained Approach 84 7.6.3 Combined Approach 86 7.7 Join to Conventional Optimization Techniques on HLS 87 7.8 Comparison with DRFM Binding Approach 87 Chapter 8 Conclusion and Future Work 89 8.1 Summary 89 8.2 Future Work 90 Bibliography 93 Abstract in Korean 103Docto

SNU Open Repository and Archive

High-level automation of custom hardware design for high-performance computing

Author: Papakonstantinou Alexandros
Publication venue
Publication date: 01/12/2012
Field of study

This dissertation focuses on efficient generation of custom processors from high-level language descriptions. Our work exploits compiler-based optimizations and transformations in tandem with high-level synthesis (HLS) to build high-performance custom processors. The goal is to offer a common multiplatform high-abstraction programming interface for heterogeneous compute systems where the benefits of custom reconfigurable (or fixed) processors can be exploited by the application developers. The research presented in this dissertation supports the following thesis: In an increasingly heterogeneous compute environment it is important to leverage the compute capabilities of each heterogeneous processor efficiently. In the case of FPGA and ASIC accelerators this can be achieved through HLS-based flows that (i) extract parallelism at coarser than basic block granularities, (ii) leverage common high-level parallel programming languages, and (iii) employ high-level source-to-source transformations to generate high-throughput custom processors. First, we propose a novel HLS flow that extracts instruction level parallelism beyond the boundary of basic blocks from C code. Subsequently, we describe FCUDA, an HLS-based framework for mapping fine-grained and coarse-grained parallelism from parallel CUDA kernels onto spatial parallelism. FCUDA provides a common programming model for acceleration on heterogeneous devices (i.e. GPUs and FPGAs). Moreover, the FCUDA framework balances multilevel granularity parallelism synthesis using efficient techniques that leverage fast and accurate estimation models (i.e. do not rely on lengthy physical implementation tools). Finally, we describe an advanced source-to-source transformation framework for throughput-driven parallelism synthesis (TDPS), which appropriately restructures CUDA kernel code to maximize throughput on FPGA devices. We have integrated the TDPS framework into the FCUDA flow to enable automatic performance porting of CUDA kernels designed for the GPU architecture onto the FPGA architecture

Illinois Digital Environment for Access to Learning and Scholarship Repository

Performance Aspects of Synthesizable Computing Systems

Author: Schleuniger Pascal
Publication venue: Technical University of Denmark
Publication date: 01/01/2014
Field of study

Online Research Database In Technology

Multicore architecture optimizations for HPC applications

Author: Milić Uglješa
Publication venue: Universitat Politècnica de Catalunya
Publication date: 22/11/2017
Field of study

From single-core CPUs to detachable compute accelerators, supercomputers made a tremendous progress by using available transistors on chip and specializing hardware for a given type of computation. Today, compute nodes used in HPC employ multi-core CPUs tailored for serial execution and multiple accelerators (many-core devices or GPUs) for throughput computing. However, designing next-generation HPC system requires not only the performance improvement but also better energy efficiency. Current trend of reaching exascale level of computation asks for at least an order of magnitude increase in both of these metrics. This thesis explores HPC-specific optimizations in order to make better utilization of the available transistors and to improve performance by transparently executing parallel code across multiple GPU accelerators. First, we analyze several HPC benchmark suites, compare them against typical desktop applications, and identify the differences which advocate for proper core tailoring. Moreover, within the HPC applications, we evaluate serial and parallel code sections separately, resulting in an Asymmetric Chip Multiprocessor (ACMP) design with one core optimized for single-thread performance and many lean cores for parallel execution. Our results presented here suggests downsizing of core front-end structures providing an HPC-tailored lean core which saves 16% of the core area and 7% of power, without performance loss. Further improving an ACMP design, we identify that multiple lean cores run the same code during parallel regions. This motivated us to evaluate the idea where lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections presented in this thesis show additional 11% area savings with a 5% energy reduction at no performance cost. These area and power savings might be attractive for many-core accelerators either for increasing the performance per area and power unit, or adding additional cores and thus improving the performance for the same hardware budget. Finally, in this thesis we study the effects of future NUMA accelerators comprised of multiple GPU devices. Reaching the limits of a single-GPU die size, next-generation GPU compute accelerators will likely embrace multi-socket designs increasing the core count and memory bandwidth. However, maintaining the UMA behavior of a single-GPU in multi-GPU systems without code rewriting stands as a challenge. We investigate multi-socket NUMA GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for performance scaling of future compute nodes used in HPC.Empezando por CPUs de un solo procesador, y pasando por aceleradores discretos, los supercomputadores han avanzado enormemente utilizando todos los transistores disponibles en el chip, y especializando los diseños para cada tipo de cálculo. Actualmente, los nodos de cálculo de un sistema de Computación de Altas Prestaciones (CAP) utilizan CPUs de múltiples procesadores, optimizados para el cálculo serial de instrucciones, y múltiples aceleradores (aceleradores gráficos, o many-core), optimizados para el cálculo paralelo. El diseño de un sistema CAP de nueva generación requiere no solo mejorar el rendimiento de cálculo, sino también mejorar la eficiencia energética. La siguiente generación de sistemas requiere mejorar un orden de magnitud en ambas métricas simultáneamente. Esta tesis doctoral explora optimizaciones específicas para sistemas CAP para hacer un mejor uso de los transistores, y para mejorar las prestaciones de forma transparente ejecutando las aplicaciones en múltiples aceleradores en paralelo. Primero, analizamos varios conjuntos de aplicaciones CAP, y las comparamos con aplicaciones para servidores y escritorio, identificando las principales diferencias que nos indican cómo ajustar la arquitectura para CAP. En las aplicaciones CAP, también analizamos la parte secuencial del código y la parte paralela de forma separada, . El resultado de este análisis nos lleva a proponer una arquitectura multiprocesador asimétrica (ACMP) , con un procesador optimizado para el código secuencial, y múltiples procesadores, más pequeños, optimizados para el procesamiento paralelo. Nuestros resultados muestran que reducir el tamaño de las estructuras del front-end (fetch, y predicción de saltos) en los procesadores paralelos nos proporciona un 16% extra de área en el chip, y una reducción de consumo del 7%. Como mejora a nuestra arquitectura ACMP, proponemos explotar el hecho de que todos los procesadores paralelos ejecutan el mismo código al mismo tiempo. Evaluamos una propuesta en que los procesadores paralelos comparten la caché de instrucciones, con la intención de que uno de ellos precargue las instrucciones para los demás procesadores (prefetching), sin aumentar la latencia media de acceso. Nuestra exploración de los distintos parámetros determina que el punto óptimo requiere una interconexión de alto ancho de banda para acceder a la caché compartida, y el uso de unos pocos line buffers para mantener el ancho de banda y la latencia necesarios. Nuestras proyecciones muestran un ahorro adicional del 11% en área y el 5% en energía, sin impacto en el rendimiento. Estos ahorros de área y energía permiten a un multiprocesador incrementar la eficiencia energética, o aumentar el rendimiento añadiendo procesador adicionales. Por último, estudiamos el efecto de usar múltiples aceleradores (GPU) en una arquitectura con tiempo de acceso a memoria no uniforme (NUMA). Una vez alcanzado el límite de número de transistores y tamaño máximo por chip, la siguiente generación de aceleradores deberá utilizar múltiples chips para aumentar el número de procesadores y el ancho de banda de acceso a memoria. Sin embargo, es muy difícil mantener la ilusión de un tiempo de acceso a memoria uniforme en un sistema multi-GPU sin reescribir el código de la aplicación. Nuestra investigación sobre sistemas multi-GPU muestra retos significativos en el diseño de la interconexión entre las GPU y la jerarquía de memorias cache. Nuestros resultados muestran que se puede explotar el comportamiento en fases de las aplicaciones para optimizar la configuración de la interconexión y las cachés de forma dinámica, minimizando el impacto de la arquitectura NUMA. Nuestro diseño mejora el rendimiento de un sistema con una única GPU en 1.5x, 2.3x y 3.2x (el 89%, 84%, y 76% del máximo teórico) usando 2, 4, y 8 GPUs en paralelo. Siendo su implementación posible hoy en dia, los nodos de cálculo con múltiples aceleradores son una alternativa atractiva para futuros sistemas CAP.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

Multicore architecture optimizations for HPC applications

Author: Milić Uglješa
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2017
Field of study

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Phantom redundancy: a register transfer level technique for gracefully degradable data path synthesis

Author: B. Iyer
I. Koren
R. Karri
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Characterization and Avoidance of Critical Pipeline Structures in Aggressive Superscalar Processors

Author: Sassone Peter G.
Publication venue: Georgia Institute of Technology
Publication date: 20/07/2005
Field of study

In recent years, with only small fractions of modern processors now accessible in a single cycle, computer architects constantly fight against propagation issues across the die. Unfortunately this trend continues to shift inward, and now the even most internal features of the pipeline are designed around communication, not computation. To address the inward creep of this constraint, this work focuses on the characterization of communication within the pipeline itself, architectural techniques to avoid it when possible, and layout co-design for early detection of problems. I present work in creating a novel detection tool for common case operand movement which can rapidly characterize an applications dataflow patterns. The results produced are suitable for exploitation as a small number of patterns can describe a significant portion of modern applications. Work on dynamic dependence collapsing takes the observations from the pattern results and shows how certain groups of operations can be dynamically grouped, avoiding unnecessary communication between individual instructions. This technique also amplifies the efficiency of pipeline data structures such as the reorder buffer, increasing both IPC and frequency. I also identify the same sets of collapsible instructions at compile time, producing the same benefits with minimal hardware complexity. This technique is also done in a backward compatible manner as the groups are exposed by simple reordering of the binarys instructions. I present aggressive pipelining approaches for these resources which avoids the critical timing often presumed necessary in aggressive superscalar processors. As these structures are designed for the worst case, pipelining them can produce greater frequency benefit than IPC loss. I also use the observation that the dynamic issue order for instructions in aggressive superscalar processors is predictable. Thus, a hardware mechanism is introduced for caching the wakeup order for groups of instructions efficiently. These wakeup vectors are then used to speculatively schedule instructions, avoiding the dynamic scheduling when it is not necessary. Finally, I present a novel approach to fast and high-quality chip layout. By allowing architects to quickly evaluate what if scenarios during early high-level design, chip designs are less likely to encounter implementation problems later in the process.Ph.D.Committee Chair: Scott Wills; Committee Member: David Schimmel; Committee Member: Gabriel Loh; Committee Member: Hsien-Hsin Lee; Committee Member: Yorai Ward

Scholarly Materials And Research @ Georgia Tech