128 research outputs found

    Adaptation of High Performance and High Capacity Reconfigurable Systems to OpenCL Programming Environments

    Full text link
    [EN] In this work, we adapt a reconfigurable computer system based on FPGA technologies to OpenCL programming environments. The reconfigurable system is part of a compute prototype of the MANGO European project that includes 96 FPGAs. To optimize the use and to obtain its maximum performance, it is essential to adapt it to heterogeneous systems programming environments such as OpenCL, which simplifies its programming. In this work, all the necessary activities for correct implementation of the software and hardware layer required for its use in OpenCL will be carried out, as well as an evaluation of the performance obtained and the flexibility offered by the solution provided. This work has been performed during an internship of 5 months. The internship is linked to an agreement between UPV and UniNa (Università degli Studi di Napoli Federico II).[ES] En este trabajo se va a realizar la adaptación de un sistema reconfigurable de cómputo basado en tecnologías de FPGAs hacia entornos de programación en OpenCL. El sistema reconfigurable forma parte de un prototipo de cálculo del proyecto Europeo MANGO que incluye 96 FPGAs. Con el fin de optimizar el uso y de obtener sus máximas prestaciones, se hace imprescindible una adaptación a entornos de programación de sistemas heterogéneos como OpenCL, lo cual simplifica su programación y uso. En este trabajo se realizarán todas las actividades necesarias para una correcta implementación de la capa software y hardware necesaria para su uso en OpenCL así como una evaluación de las prestaciones obtenidas y de la flexibilidad ofrecida por la solución aportada. Este trabajo se ha llevado a término durante una estancia de cinco meses en la Universitat Politécnica de Valéncia. Esta estancia está vinculada a un acuerdo entre la Universitat Politécnica de Valéncia y la Università degli Studi di Napoli Federico IIRusso, D. (2020). Adaptation of High Performance and High Capacity Reconfigurable Systems to OpenCL Programming Environments. http://hdl.handle.net/10251/150393TFG

    Graphics Processing Unit-Based Computer-Aided Design Algorithms for Electronic Design Automation

    Get PDF
    The electronic design automation (EDA) tools are a specific set of software that play important roles in modern integrated circuit (IC) design. These software automate the design processes of IC with various stages. Among these stages, two important EDA design tools are the focus of this research: floorplanning and global routing. Specifically, the goal of this study is to parallelize these two tools such that their execution time can be significantly shortened on modern multi-core and graphics processing unit (GPU) architectures. The GPU hardware is a massively parallel architecture, enabling thousands of independent threads to execute concurrently. Although a small set of EDA tools can benefit from using GPU to accelerate their speed, most algorithms in this field are designed with the single-core paradigm in mind. The floorplanning and global routing algorithms are among the latter, and difficult to render any speedup on the GPU due to their inherent sequential nature. This work parallelizes the floorplanning and global routing algorithm through a novel approach and results in significant speedups for both tools implemented on the GPU hardware. Specifically, with a complete overhaul of solution space and design space exploration, a GPU-based floorplanning algorithm is able to render 4-166X speedup, while achieving similar or improved solutions compared with the sequential algorithm. The GPU-based global routing algorithm is shown to achieve significant speedup against existing state-of-the-art routers, while delivering competitive solution quality. Importantly, this parallel model for global routing renders a stable solution that is independent from the level of parallelism. In summary, this research has shown that through a design paradigm overhaul, sequential algorithms can also benefit from the massively parallel architecture. The findings of this study have a positive impact on the efficiency and design quality of modern EDA design flow

    Hierarchical and Nesting Approaches for the Facility Layout Problem

    Get PDF
    The Facility Layout Problem (FLP) seeks to determine the dimensions, coordinates and arrangement of rectangular departments within a given facility. The goal is to minimize the cost of inter-department flow. It has several real-world applications, including the design of manufacturing and warehousing facilities and electronic chips. Despite being studied for several decades, the FLP is still very difficult to solve for facilities with thirty or more departments. Thus, many heuristic approaches have been developed to solve the problem in a reasonable time. One such approach tackles the problem in two stages. In the first, some decision, usually the relative positioning of the departments, is fixed. In the second, an easier restricted problem is solved. This thesis explores hierarchical and nesting approaches for the FLP in an attempt to leverage the fact that smaller instances of the FLP can be solved to optimality relatively quickly. The goal is to find ways in which the FLP can be decomposed into several smaller problems and recombined to form a high-quality solution to the original problem. Hierarchical approaches use clustering or related methods to generate a tree where the leaves are the original departments and the root is the facility. The intermediate nodes are super-departments within an overall layout. A new hierarchical approach for the FLP is presented which performs layouts down this tree in a manner that controls dead-space and generates high-quality solutions. The approach provides solutions competitive with the best-known solutions on benchmark instances from the literature, with up to 8% improvement. The success of the hierarchical approach provided the motivation for a new formulation that nests departments within super-departments. The resulting formulation is even more difficult to solve directly than the original FLP; however, it is suitable for a two-stage solution approach. The first stage determines the assignment of departments to super-departments and the relative positioning of the super-departments. In the second stage, the remainder of the formulation is solved. The approach is found to provide better solutions than the hierarchical approach. Solutions are found with up to 14% improvement over the best-known solutions from the literature

    Automatic synthesis and optimization of chip multiprocessors

    Get PDF
    The microprocessor technology has experienced an enormous growth during the last decades. Rapid downscale of the CMOS technology has led to higher operating frequencies and performance densities, facing the fundamental issue of power dissipation. Chip Multiprocessors (CMPs) have become the latest paradigm to improve the power-performance efficiency of computing systems by exploiting the parallelism inherent in applications. Industrial and prototype implementations have already demonstrated the benefits achieved by CMPs with hundreds of cores.CMP architects are challenged to take many complex design decisions. Only a few of them are:- What should be the ratio between the core and cache areas on a chip?- Which core architectures to select?- How many cache levels should the memory subsystem have?- Which interconnect topologies provide efficient on-chip communication?These and many other aspects create a complex multidimensional space for architectural exploration. Design Automation tools become essential to make the architectural exploration feasible under the hard time-to-market constraints. The exploration methods have to be efficient and scalable to handle future generation on-chip architectures with hundreds or thousands of cores.Furthermore, once a CMP has been fabricated, the need for efficient deployment of the many-core processor arises. Intelligent techniques for task mapping and scheduling onto CMPs are necessary to guarantee the full usage of the benefits brought by the many-core technology. These techniques have to consider the peculiarities of the modern architectures, such as availability of enhanced power saving techniques and presence of complex memory hierarchies.This thesis has several objectives. The first objective is to elaborate the methods for efficient analytical modeling and architectural design space exploration of CMPs. The efficiency is achieved by using analytical models instead of simulation, and replacing the exhaustive exploration with an intelligent search strategy. Additionally, these methods incorporate high-level models for physical planning. The related contributions are described in Chapters 3, 4 and 5 of the document.The second objective of this work is to propose a scalable task mapping algorithm onto general-purpose CMPs with power management techniques, for efficient deployment of many-core systems. This contribution is explained in Chapter 6 of this document.Finally, the third objective of this thesis is to address the issues of the on-chip interconnect design and exploration, by developing a model for simultaneous topology customization and deadlock-free routing in Networks-on-Chip. The developed methodology can be applied to various classes of the on-chip systems, ranging from general-purpose chip multiprocessors to application-specific solutions. Chapter 7 describes the proposed model.The presented methods have been thoroughly tested experimentally and the results are described in this dissertation. At the end of the document several possible directions for the future research are proposed

    Energy-aware synthesis for networks on chip architectures

    Full text link
    The Network on Chip (NoC) paradigm was introduced as a scalable communication infrastructure for future System-on-Chip applications. Designing application specific customized communication architectures is critical for obtaining low power, high performance solutions. Two significant design automation problems are the creation of an optimized configuration, given application requirement the implementation of this on-chip network. Automating the design of on-chip networks requires models for estimating area and energy, algorithms to effectively explore the design space and network component libraries and tools to generate the hardware description. Chip architects are faced with managing a wide range of customization options for individual components, routers and topology. As energy is of paramount importance, the effectiveness of any custom NoC generation approach lies in the availability of good energy models to effectively explore the design space. This thesis describes a complete NoC synthesis flow, called NoCGEN, for creating energy-efficient custom NoC architectures. Three major automation problems are addressed: custom topology generation, energy modeling and generation. An iterative algorithm is proposed to generate application specific point-to-point and packet-switched networks. The algorithm explores the design space for efficient topologies using characterized models and a system-level floorplanner for evaluating placement and wire-energy. Prior to our contribution, building an energy model required careful analysis of transistor or gate implementations. To alleviate the burden, an automated linear regression-based methodology is proposed to rapidly extract energy models for many router designs. The resulting models are cycle accurate with low-complexity and found to be within 10% of gate-level energy simulations, and execute several orders of magnitude faster than gate-level simulations. A hardware description of the custom topology is generated using a parameterizable library and custom HDL generator. Fully reusable and scalable network components (switches, crossbars, arbiters, routing algorithms) are described using a template approach and are used to compose arbitrary topologies. A methodology for building and composing routers and topologies using a template engine is described. The entire flow is implemented as several demonstrable extensible tools with powerful visualization functionality. Several experiments are performed to demonstrate the design space exploration capabilities and compare it against a competing min-cut topology generation algorithm

    Understanding the thermal implications of multicore architectures

    Get PDF
    Multicore architectures are becoming the main design paradigm for current and future processors. The main reason is that multicore designs provide an effective way of overcoming instruction-level parallelism (ILP) limitations by exploiting thread-level parallelism (TLP). In addition, it is a power and complexity-effective way of taking advantage of the huge number of transistors that can be integrated on a chip. On the other hand, today's higher than ever power densities have made temperature one of the main limitations of microprocessor evolution. Thermal management in multicore architectures is a fairly new area. Some works have addressed dynamic thermal management in bi/quad-core architectures. This work provides insight and explores different alternatives for thermal management in multicore architectures with 16 cores. Schemes employing both energy reduction and activity migration are explored and improvements for thread migration schemes are proposed.Peer ReviewedPostprint (published version

    Regular Datapaths on Field-Programmable Gate Arrays

    Get PDF
    Field-Programmable Gate Arrays (FPGAs) are a recent kind of programmable logic device. They allow the implementation of integrated digital electronic circuits without requiring the complex optical, chemical and mechanical processes used in a conventional chip fabrication. FPGAs can be embedded in traditional system designflows to perform prototyping and emulation tasks. In addition, they also enable novel applications such as configurable computers with hardware dynamically adaptable to a specific problem. The growing chip capacity now allows even the implementation of CPUs and DSPs on single FPGAs. However, current design automation tools trace their roots to times of very limited FPGA sizes, and are primarily optimized for the implementation of random glue logic. The wide datapaths common to CPUs and DSPs are only processed with reduced performance. This thesis presents Structured Design Implementation (SDI), a suite of specialized tools coordinated by a common strategy, which aims to efficiently map even larger regular datapaths to FPGAs. In all steps, regularity is preserved whenever possible, or restored after disruptive operations were required. The circuits are composed from parametrizable modules providing a variety of logical, arithmetical and storage functions. For each module, multiple target FPGA-specific implementation alternatives may be generated in both gatelevel netlist and layout views. A floorplanner based on a genetic algorithm is then used to simultaneously choose an actual implementation from the set of alternatives for each module, and to arrange the selected module implementations in a linear placement. The floorplanning operation optimizes for short routing delays, high routability, and fit into the target FPGA.Field-Programmable Gate-Arrays (FPGAs) sind eine noch junge Art von programmierbaren Logikbausteinen. Sie erlauben die Implementierung von integrierten Digitalschaltungen ohne die komplizierten optischen, chemischen und mechanischen Prozesse, die normalerweise für die Chipfertigung erforderlich sind. FPGAs können im Rahmen konventioneller Entwurfsmethoden zu Emulationszwecken und Prototyp-Aufbauten herangezogen werden. Sie erlauben aber auch völlig neue Anwendungen wie rekonfigurierbare Computer, deren Hardware dynamisch an ein spezielles Problem angepaßt werden kann. Die gewachsene Chip-Kapazität erlaubt nun sogar die Implementierung von CPUs und digitalen Signalprozessoren (DSPs) auf einem einzelnen FPGA. Die Leistungsfähigkeit der entstandenen Schaltungen wird jedoch durch die zur Zeit erhältlichen CAD-Werkzeuge limitiert, da diese noch auf stark beschränkte FPGA-Größen ausgerichtet sind und primär der platzsparenden Verarbeitung unregelmäßiger Logik dienen. Die breiten Datenpfade in Bit-Slice-Struktur, die den Kern vieler CPUs und DSPs darstellen, werden nur suboptimal behandelt. Diese Arbeit stellt Structured Design Implementation (SDI) vor, ein System von spezialisierten CAD-Werkzeugen, die auch größere reguläre Datenpfade effizient auf FPGAs abbilden. In allen Verarbeitungsschritten wird dabei die bestehende Regularität soweit wie möglich erhalten oder nach regularitätsvernichtenden Operationen wiederhergestellt. Zur Schaltungseingabe steht eine Bibliothek von allgemeinen Modulen aus den Bereichen Logik, Arithmetik und Speicherung bereit. Diese können durch Belegung verschiedener Parameter wie Bit-Breiten und Datentypen an aktuelle Anforderungen angepaßt werden

    Efficient approaches in interconnect-driven floorplanning.

    Get PDF
    Lai Tsz Wai.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 123-129).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- VLSI Design Cycle --- p.2Chapter 1.2 --- Physical Design Cycle --- p.4Chapter 1.3 --- Floorplanning --- p.7Chapter 1.3.1 --- Types of Floorplan and Floorplan Representations --- p.11Chapter 1.3.2 --- Interconnect-driven Floorplanning --- p.13Chapter 1.4 --- Motivations and Contributions --- p.17Chapter 1.5 --- Organization of this Thesis --- p.18Chapter 2 --- Literature Review on Floorplan Representation --- p.20Chapter 2.1 --- Slicing Floorplan Representation --- p.20Chapter 2.1.1 --- Normalized Polish Expression --- p.20Chapter 2.2 --- Non-slicing Floorplan Representations --- p.21Chapter 2.2.1 --- Sequence Pair (SP) --- p.21Chapter 2.2.2 --- Bounded-sliceline Grid (BSG) --- p.23Chapter 2.2.3 --- O-tree --- p.25Chapter 2.2.4 --- B*-tree --- p.26Chapter 2.3 --- Mosaic Floorplan Representations --- p.28Chapter 2.3.1 --- Corner Block List (CBL) --- p.28Chapter 2.3.2 --- Twin Binary Trees (TBT) --- p.31Chapter 2.3.3 --- Twin Binary Sequences (TBS) --- p.32Chapter 2.4 --- Summary --- p.34Chapter 3 --- Literature Review on Interconnect Optimization in Floorplan- ning --- p.37Chapter 3.1 --- Wirelength Estimation --- p.37Chapter 3.2 --- Congestion Optimization --- p.38Chapter 3.2.1 --- Integrated Floorplanning and Interconnect Planning --- p.41Chapter 3.2.2 --- Multi-layer Global Wiring Planning (GWP) --- p.43Chapter 3.2.3 --- Estimating Routing Congestion using Probabilistic Anal- ysis --- p.44Chapter 3.2.4 --- Congestion Minimization During Placement --- p.46Chapter 3.2.5 --- Modelling and Minimization of Routing Congestion --- p.48Chapter 3.3 --- Buffer Planning --- p.49Chapter 3.3.1 --- Buffer Clustering with Feasible Region --- p.51Chapter 3.3.2 --- Routability-driven Repeater Clustering Algorithm with Iterative Deletion --- p.55Chapter 3.3.3 --- Planning Buffer Locations by Network Flow --- p.58Chapter 3.3.4 --- Buffer Planning using Integer Multicommodity Flow --- p.60Chapter 3.3.5 --- Buffer Planning Problem using Tile Graph --- p.60Chapter 3.3.6 --- Probabilistic Analysis for Buffer Block Planning --- p.62Chapter 3.3.7 --- Fast Buffer Planning and Congestion Optimization --- p.63Chapter 3.4 --- Summary --- p.66Chapter 4 --- Congestion Evaluation: Wire Density Model --- p.68Chapter 4.1 --- Introduction --- p.68Chapter 4.2 --- Overview of Our Floorplanner --- p.70Chapter 4.3 --- Wire Density Model --- p.71Chapter 4.3.1 --- Computation of Ni --- p.72Chapter 4.3.2 --- Computation of Pi --- p.74Chapter 4.3.3 --- Usage of Mirror TBT --- p.76Chapter 4.4 --- Implementation --- p.76Chapter 4.4.1 --- Efficient Calculation of Ni --- p.76Chapter 4.4.2 --- Solving the LCA Problem Efficiently --- p.81Chapter 4.4.3 --- Cost Function --- p.81Chapter 4.4.4 --- Complexity --- p.81Chapter 4.5 --- Experimental Results --- p.82Chapter 4.6 --- Conclusion --- p.83Chapter 5 --- Buffer Planning: Simple Buffer Planning Method --- p.85Chapter 5.1 --- Introduction --- p.85Chapter 5.2 --- Variable Interval Buffer Insertion Constraint --- p.87Chapter 5.3 --- Overview of Our Floorplanner --- p.88Chapter 5.4 --- Buffer Planning --- p.89Chapter 5.4.1 --- Feasible Grids --- p.89Chapter 5.4.2 --- Table Look-up Approach --- p.89Chapter 5.5 --- Implementation --- p.91Chapter 5.5.1 --- Building the Look-up Tables --- p.91Chapter 5.5.2 --- An Example of Look-up Table Construction --- p.94Chapter 5.5.3 --- A Faster Approach for Building the Look-up Tables --- p.101Chapter 5.5.4 --- An Example of the Faster Look-up Table Construction --- p.105Chapter 5.5.5 --- I/O Pin Locations --- p.106Chapter 5.5.6 --- Cost Function --- p.110Chapter 5.5.7 --- Complexity --- p.111Chapter 5.6 --- Experimental Results --- p.112Chapter 5.6.1 --- Selected Value for A --- p.112Chapter 5.6.2 --- Performance of Our Floorplanner --- p.113Chapter 5.7 --- Conclusion --- p.116Chapter 6 --- Conclusion --- p.118Chapter A --- An Efficient Algorithm for the Least Common Ancestor Prob- lem --- p.120Bibliography --- p.12

    Managing HBM’s bandwidth in Multi-Die FPGAs using Overlay NoCs

    Get PDF
    We can improve HBM bandwidth distribution and utilization on a multi-die FPGA like Xilinx Alveo U280 by using Overlay Network-on-Chips (NoCs). HBM in Xilinx Alveo U280 offers 8GBs of memory capacity with a theoretical maximum bandwidth of 460 GBps, but all the thirty-two HBM ports in Xilinx Alveo U280 are exposed to the FPGA fabric in only one die. As a result, processing elements assigned to other dies must use the scarcely available and challenging to use Super Long Lines (SLL) to access the HBM’s bandwidth. Furthermore, HBM is fractured internally into thirty-two smaller memories called pseudo channels. They are connected together by a hardened and flawed cross-bar, which enables global accesses from any of the HBM ports, but introduces several throughput bottlenecks, degrading the achievable throughput when the entire memory space is used. An Overlay Hybrid NoC combining the features of Hoplite and Butterfly Fat Trees (BFT) NoC offers a high-frequency solution for distributing HBM’s bandwidth across all three dies, as well as overcoming the throughput bottleneck introduced by the internal cross-bar. The Hybrid NoC combines multiple high-frequency Ring NoCs for inter-die communication and Butterfly Fat tree NoCs for intra-die communication. In addition, the routing capability of the NoC can be modified to supplant the HBM’s internal cross-bar for global accesses. We demonstrate this in Xilinx Alveo 280 using synthetic benchmarks and two application-based benchmarks, Dense matrix-matrix multiplication (DMM) and Sparse Matrix-Vector multiplication (SPMV). Our experiments show that NoCs can improve throughput utilization by as much as ×8.6 for single-flit global accesses,×1.7 for multi-flit global accesses with burst length 16, and as much as ×1.4 for SpMV benchmark
    corecore