128 research outputs found
Adaptation of High Performance and High Capacity Reconfigurable Systems to OpenCL Programming Environments
[EN] In this work, we adapt a reconfigurable computer system based on FPGA
technologies to OpenCL programming environments. The reconfigurable system
is part of a compute prototype of the MANGO European project that includes 96
FPGAs. To optimize the use and to obtain its maximum performance, it is essential to adapt it to heterogeneous systems programming environments such as
OpenCL, which simplifies its programming. In this work, all the necessary activities for correct implementation of the software and hardware layer required for
its use in OpenCL will be carried out, as well as an evaluation of the performance
obtained and the flexibility offered by the solution provided.
This work has been performed during an internship of 5 months. The internship is linked to an agreement between UPV and UniNa (Università degli Studi
di Napoli Federico II).[ES] En este trabajo se va a realizar la adaptación de un sistema reconfigurable de
cómputo basado en tecnologías de FPGAs hacia entornos de programación en
OpenCL. El sistema reconfigurable forma parte de un prototipo de cálculo del
proyecto Europeo MANGO que incluye 96 FPGAs. Con el fin de optimizar el
uso y de obtener sus máximas prestaciones, se hace imprescindible una adaptación a entornos de programación de sistemas heterogéneos como OpenCL, lo cual
simplifica su programación y uso. En este trabajo se realizarán todas las actividades necesarias para una correcta implementación de la capa software y hardware
necesaria para su uso en OpenCL así como una evaluación de las prestaciones
obtenidas y de la flexibilidad ofrecida por la solución aportada.
Este trabajo se ha llevado a término durante una estancia de cinco meses en
la Universitat Politécnica de Valéncia. Esta estancia está vinculada a un acuerdo
entre la Universitat Politécnica de Valéncia y la Università degli Studi di Napoli
Federico IIRusso, D. (2020). Adaptation of High Performance and High Capacity Reconfigurable Systems to OpenCL Programming Environments. http://hdl.handle.net/10251/150393TFG
Graphics Processing Unit-Based Computer-Aided Design Algorithms for Electronic Design Automation
The electronic design automation (EDA) tools are a specific set of software that play important roles in modern integrated circuit (IC) design. These software automate the design processes of IC with various stages. Among these stages, two important EDA design tools are the focus of this research: floorplanning and global routing. Specifically, the goal of this study is to parallelize these two tools such that their execution time can be significantly shortened on modern multi-core and graphics processing unit (GPU) architectures. The GPU hardware is a massively parallel architecture, enabling thousands of independent threads to execute concurrently. Although a small set of EDA tools can benefit from using GPU to accelerate their speed, most algorithms in this field are designed with the single-core paradigm in mind. The floorplanning and global routing algorithms are among the latter, and difficult to render any speedup on the GPU due to their inherent sequential nature.
This work parallelizes the floorplanning and global routing algorithm through a novel approach and results in significant speedups for both tools implemented on the GPU hardware. Specifically, with a complete overhaul of solution space and design space exploration, a GPU-based floorplanning algorithm is able to render 4-166X speedup, while achieving similar or improved solutions compared with the sequential algorithm. The GPU-based global routing algorithm is shown to achieve significant speedup against existing state-of-the-art routers, while delivering competitive solution quality. Importantly, this parallel model for global routing renders a stable solution that is independent from the level of parallelism. In summary, this research has shown that through a design paradigm overhaul, sequential algorithms can also benefit from the massively parallel architecture. The findings of this study have a positive impact on the efficiency and design quality of modern EDA design flow
Hierarchical and Nesting Approaches for the Facility Layout Problem
The Facility Layout Problem (FLP) seeks to determine the dimensions, coordinates and arrangement of rectangular departments within a given facility. The goal is to minimize the cost of inter-department flow. It has several real-world applications, including the design of manufacturing and warehousing facilities and electronic chips. Despite being studied for several decades, the FLP is still very difficult to solve for facilities with thirty or more departments. Thus, many heuristic approaches have been developed to solve the problem in a reasonable time. One such approach tackles the problem in two stages. In the first, some decision, usually the relative positioning of the departments, is fixed. In the second, an easier restricted problem is solved.
This thesis explores hierarchical and nesting approaches for the FLP in an attempt to leverage the fact that smaller instances of the FLP can be solved to optimality relatively quickly. The goal is to find ways in which the FLP can be decomposed into several smaller problems and recombined to form a high-quality solution to the original problem.
Hierarchical approaches use clustering or related methods to generate a tree where the leaves are the original departments and the root is the facility. The intermediate nodes are super-departments within an overall layout. A new hierarchical approach for the FLP is presented which performs layouts down this tree in a manner that controls dead-space and generates high-quality solutions. The approach provides solutions competitive with the best-known solutions on benchmark instances from the literature, with up to 8% improvement.
The success of the hierarchical approach provided the motivation for a new formulation that nests departments within super-departments. The resulting formulation is even more difficult to solve directly than the original FLP; however, it is suitable for a two-stage solution approach. The first stage determines the assignment of departments to super-departments and the relative positioning of the super-departments. In the second stage, the remainder of the formulation is solved. The approach is found to provide better solutions than the hierarchical approach. Solutions are found with up to 14% improvement over the best-known solutions from the literature
Automatic synthesis and optimization of chip multiprocessors
The microprocessor technology has experienced an enormous growth during the last decades. Rapid downscale of the CMOS technology has led to higher operating frequencies and performance densities, facing the fundamental issue of power dissipation. Chip Multiprocessors (CMPs) have become the latest paradigm to improve the power-performance efficiency of computing systems by exploiting the parallelism inherent in applications. Industrial and prototype implementations have already demonstrated the benefits achieved by CMPs with hundreds of cores.CMP architects are challenged to take many complex design decisions. Only a few of them are:- What should be the ratio between the core and cache areas on a chip?- Which core architectures to select?- How many cache levels should the memory subsystem have?- Which interconnect topologies provide efficient on-chip communication?These and many other aspects create a complex multidimensional space for architectural exploration. Design Automation tools become essential to make the architectural exploration feasible under the hard time-to-market constraints. The exploration methods have to be efficient and scalable to handle future generation on-chip architectures with hundreds or thousands of cores.Furthermore, once a CMP has been fabricated, the need for efficient deployment of the many-core processor arises. Intelligent techniques for task mapping and scheduling onto CMPs are necessary to guarantee the full usage of the benefits brought by the many-core technology. These techniques have to consider the peculiarities of the modern architectures, such as availability of enhanced power saving techniques and presence of complex memory hierarchies.This thesis has several objectives. The first objective is to elaborate the methods for efficient analytical modeling and architectural design space exploration of CMPs. The efficiency is achieved by using analytical models instead of simulation, and replacing the exhaustive exploration with an intelligent search strategy. Additionally, these methods incorporate high-level models for physical planning. The related contributions are described in Chapters 3, 4 and 5 of the document.The second objective of this work is to propose a scalable task mapping algorithm onto general-purpose CMPs with power management techniques, for efficient deployment of many-core systems. This contribution is explained in Chapter 6 of this document.Finally, the third objective of this thesis is to address the issues of the on-chip interconnect design and exploration, by developing a model for simultaneous topology customization and deadlock-free routing in Networks-on-Chip. The developed methodology can be applied to various classes of the on-chip systems, ranging from general-purpose chip multiprocessors to application-specific solutions. Chapter 7 describes the proposed model.The presented methods have been thoroughly tested experimentally and the results are described in this dissertation. At the end of the document several possible directions for the future research are proposed
Energy-aware synthesis for networks on chip architectures
The Network on Chip (NoC) paradigm was introduced as a scalable communication infrastructure for future System-on-Chip applications. Designing application specific customized communication architectures is critical for obtaining low power, high performance solutions. Two significant design automation problems are the creation of an optimized configuration, given application requirement the implementation of this on-chip network. Automating the design of on-chip networks requires models for estimating area and energy, algorithms to effectively explore the design space and network component libraries and tools to generate the hardware description. Chip architects are faced with managing a wide range of customization options for individual components, routers and topology. As energy is of paramount importance, the effectiveness of any custom NoC generation approach lies in the availability of good energy models to effectively explore the design space. This thesis describes a complete NoC synthesis flow, called NoCGEN, for creating energy-efficient custom NoC architectures. Three major automation problems are addressed: custom topology generation, energy modeling and generation. An iterative algorithm is proposed to generate application specific point-to-point and packet-switched networks. The algorithm explores the design space for efficient topologies using characterized models and a system-level floorplanner for evaluating placement and wire-energy. Prior to our contribution, building an energy model required careful analysis of transistor or gate implementations. To alleviate the burden, an automated linear regression-based methodology is proposed to rapidly extract energy models for many router designs. The resulting models are cycle accurate with low-complexity and found to be within 10% of gate-level energy simulations, and execute several orders of magnitude faster than gate-level simulations. A hardware description of the custom topology is generated using a parameterizable library and custom HDL generator. Fully reusable and scalable network components (switches, crossbars, arbiters, routing algorithms) are described using a template approach and are used to compose arbitrary topologies. A methodology for building and composing routers and topologies using a template engine is described. The entire flow is implemented as several demonstrable extensible tools with powerful visualization functionality. Several experiments are performed to demonstrate the design space exploration capabilities and compare it against a competing min-cut topology generation algorithm
Understanding the thermal implications of multicore architectures
Multicore architectures are becoming the main design paradigm for current and future processors. The main reason is that multicore designs provide an effective way of overcoming instruction-level parallelism (ILP) limitations by exploiting thread-level parallelism (TLP). In addition, it is a power and complexity-effective way of taking advantage of the huge number of transistors that can be integrated on a chip. On the other hand, today's higher than ever power densities have made temperature one of the main limitations of microprocessor evolution. Thermal management in multicore architectures is a fairly new area. Some works have addressed dynamic thermal management in bi/quad-core architectures. This work provides insight and explores different alternatives for thermal management in multicore architectures with 16 cores. Schemes employing both energy reduction and activity migration are explored and improvements for thread migration schemes are proposed.Peer ReviewedPostprint (published version
Regular Datapaths on Field-Programmable Gate Arrays
Field-Programmable Gate Arrays (FPGAs) are a recent kind of programmable logic device. They allow the implementation of integrated digital electronic circuits without requiring the complex optical, chemical and mechanical processes used in a conventional chip fabrication. FPGAs can be embedded in traditional system designflows to perform prototyping and emulation tasks. In addition, they also enable novel applications such as configurable computers with hardware dynamically adaptable to a specific problem. The growing chip capacity now allows even the implementation of CPUs and DSPs on single FPGAs. However, current design automation tools trace their roots to times of very limited FPGA sizes, and are primarily optimized for the implementation of random glue logic. The wide datapaths common to CPUs and DSPs are only processed with reduced performance. This thesis presents Structured Design Implementation (SDI), a suite of specialized tools coordinated by a common strategy, which aims to efficiently map even larger regular datapaths to FPGAs. In all steps, regularity is preserved whenever possible, or restored after disruptive operations were required. The circuits are composed from parametrizable modules providing a variety of logical, arithmetical and storage functions. For each module, multiple target FPGA-specific implementation alternatives may be generated in both gatelevel netlist and layout views. A floorplanner based on a genetic algorithm is then used to simultaneously choose an actual implementation from the set of alternatives for each module, and to arrange the selected module implementations in a linear placement. The floorplanning operation optimizes for short routing delays, high routability, and fit into the target FPGA.Field-Programmable Gate-Arrays (FPGAs) sind eine noch junge Art von programmierbaren Logikbausteinen. Sie erlauben die Implementierung von integrierten Digitalschaltungen ohne die komplizierten optischen, chemischen und mechanischen Prozesse, die normalerweise für die Chipfertigung erforderlich sind. FPGAs können im Rahmen konventioneller Entwurfsmethoden zu Emulationszwecken und Prototyp-Aufbauten herangezogen werden. Sie erlauben aber auch völlig neue Anwendungen wie rekonfigurierbare Computer, deren Hardware dynamisch an ein spezielles Problem angepaßt werden kann. Die gewachsene Chip-Kapazität erlaubt nun sogar die Implementierung von CPUs und digitalen Signalprozessoren (DSPs) auf einem einzelnen FPGA. Die Leistungsfähigkeit der entstandenen Schaltungen wird jedoch durch die zur Zeit erhältlichen CAD-Werkzeuge limitiert, da diese noch auf stark beschränkte FPGA-Größen ausgerichtet sind und primär der platzsparenden Verarbeitung unregelmäßiger Logik dienen. Die breiten Datenpfade in Bit-Slice-Struktur, die den Kern vieler CPUs und DSPs darstellen, werden nur suboptimal behandelt. Diese Arbeit stellt Structured Design Implementation (SDI) vor, ein System von spezialisierten CAD-Werkzeugen, die auch größere reguläre Datenpfade effizient auf FPGAs abbilden. In allen Verarbeitungsschritten wird dabei die bestehende Regularität soweit wie möglich erhalten oder nach regularitätsvernichtenden Operationen wiederhergestellt. Zur Schaltungseingabe steht eine Bibliothek von allgemeinen Modulen aus den Bereichen Logik, Arithmetik und Speicherung bereit. Diese können durch Belegung verschiedener Parameter wie Bit-Breiten und Datentypen an aktuelle Anforderungen angepaßt werden
Efficient approaches in interconnect-driven floorplanning.
Lai Tsz Wai.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 123-129).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- VLSI Design Cycle --- p.2Chapter 1.2 --- Physical Design Cycle --- p.4Chapter 1.3 --- Floorplanning --- p.7Chapter 1.3.1 --- Types of Floorplan and Floorplan Representations --- p.11Chapter 1.3.2 --- Interconnect-driven Floorplanning --- p.13Chapter 1.4 --- Motivations and Contributions --- p.17Chapter 1.5 --- Organization of this Thesis --- p.18Chapter 2 --- Literature Review on Floorplan Representation --- p.20Chapter 2.1 --- Slicing Floorplan Representation --- p.20Chapter 2.1.1 --- Normalized Polish Expression --- p.20Chapter 2.2 --- Non-slicing Floorplan Representations --- p.21Chapter 2.2.1 --- Sequence Pair (SP) --- p.21Chapter 2.2.2 --- Bounded-sliceline Grid (BSG) --- p.23Chapter 2.2.3 --- O-tree --- p.25Chapter 2.2.4 --- B*-tree --- p.26Chapter 2.3 --- Mosaic Floorplan Representations --- p.28Chapter 2.3.1 --- Corner Block List (CBL) --- p.28Chapter 2.3.2 --- Twin Binary Trees (TBT) --- p.31Chapter 2.3.3 --- Twin Binary Sequences (TBS) --- p.32Chapter 2.4 --- Summary --- p.34Chapter 3 --- Literature Review on Interconnect Optimization in Floorplan- ning --- p.37Chapter 3.1 --- Wirelength Estimation --- p.37Chapter 3.2 --- Congestion Optimization --- p.38Chapter 3.2.1 --- Integrated Floorplanning and Interconnect Planning --- p.41Chapter 3.2.2 --- Multi-layer Global Wiring Planning (GWP) --- p.43Chapter 3.2.3 --- Estimating Routing Congestion using Probabilistic Anal- ysis --- p.44Chapter 3.2.4 --- Congestion Minimization During Placement --- p.46Chapter 3.2.5 --- Modelling and Minimization of Routing Congestion --- p.48Chapter 3.3 --- Buffer Planning --- p.49Chapter 3.3.1 --- Buffer Clustering with Feasible Region --- p.51Chapter 3.3.2 --- Routability-driven Repeater Clustering Algorithm with Iterative Deletion --- p.55Chapter 3.3.3 --- Planning Buffer Locations by Network Flow --- p.58Chapter 3.3.4 --- Buffer Planning using Integer Multicommodity Flow --- p.60Chapter 3.3.5 --- Buffer Planning Problem using Tile Graph --- p.60Chapter 3.3.6 --- Probabilistic Analysis for Buffer Block Planning --- p.62Chapter 3.3.7 --- Fast Buffer Planning and Congestion Optimization --- p.63Chapter 3.4 --- Summary --- p.66Chapter 4 --- Congestion Evaluation: Wire Density Model --- p.68Chapter 4.1 --- Introduction --- p.68Chapter 4.2 --- Overview of Our Floorplanner --- p.70Chapter 4.3 --- Wire Density Model --- p.71Chapter 4.3.1 --- Computation of Ni --- p.72Chapter 4.3.2 --- Computation of Pi --- p.74Chapter 4.3.3 --- Usage of Mirror TBT --- p.76Chapter 4.4 --- Implementation --- p.76Chapter 4.4.1 --- Efficient Calculation of Ni --- p.76Chapter 4.4.2 --- Solving the LCA Problem Efficiently --- p.81Chapter 4.4.3 --- Cost Function --- p.81Chapter 4.4.4 --- Complexity --- p.81Chapter 4.5 --- Experimental Results --- p.82Chapter 4.6 --- Conclusion --- p.83Chapter 5 --- Buffer Planning: Simple Buffer Planning Method --- p.85Chapter 5.1 --- Introduction --- p.85Chapter 5.2 --- Variable Interval Buffer Insertion Constraint --- p.87Chapter 5.3 --- Overview of Our Floorplanner --- p.88Chapter 5.4 --- Buffer Planning --- p.89Chapter 5.4.1 --- Feasible Grids --- p.89Chapter 5.4.2 --- Table Look-up Approach --- p.89Chapter 5.5 --- Implementation --- p.91Chapter 5.5.1 --- Building the Look-up Tables --- p.91Chapter 5.5.2 --- An Example of Look-up Table Construction --- p.94Chapter 5.5.3 --- A Faster Approach for Building the Look-up Tables --- p.101Chapter 5.5.4 --- An Example of the Faster Look-up Table Construction --- p.105Chapter 5.5.5 --- I/O Pin Locations --- p.106Chapter 5.5.6 --- Cost Function --- p.110Chapter 5.5.7 --- Complexity --- p.111Chapter 5.6 --- Experimental Results --- p.112Chapter 5.6.1 --- Selected Value for A --- p.112Chapter 5.6.2 --- Performance of Our Floorplanner --- p.113Chapter 5.7 --- Conclusion --- p.116Chapter 6 --- Conclusion --- p.118Chapter A --- An Efficient Algorithm for the Least Common Ancestor Prob- lem --- p.120Bibliography --- p.12
Recommended from our members
Cross-Layer Pathfinding for Off-Chip Interconnects
Off-chip interconnects for integrated circuits (ICs) today induce a diverse design space, spanning many different applications that require transmission of data at various bandwidths, latencies and link lengths. Off-chip interconnect design solutions are also variously sensitive to system performance, power and cost metrics, while also having a strong impact on these metrics. The costs associated with off-chip interconnects include die area, package (PKG) and printed circuit board (PCB) area, technology and bill of materials (BOM). Choices made regarding off-chip interconnects are fundamental to product definition, architecture, design implementation and technology enablement. Given their cross-layer impact, it is imperative that a cross-layer approach be employed to architect and analyze off-chip interconnects up front, so that a top-down design flow can comprehend the cross-layer impacts and correctly assess the system performance, power and cost tradeoffs for off-chip interconnects. Chip architects are not exposed to all the tradeoffs at the physical and circuit implementation or technology layers, and often lack the tools to accurately assess off-chip interconnects. Furthermore, the collaterals needed for a detailed analysis are often lacking when the chip is architected; these include circuit design and layout, PKG and PCB layout, and physical floorplan and implementation. To address the need for a framework that enables architects to assess the system-level impact of off-chip interconnects, this thesis presents power-area-timing (PAT) models for off-chip interconnects, optimization and planning tools with the appropriate abstraction using these PAT models, and die/PKG/PCB co-design methods that help expose the off-chip interconnect cross-layer metrics to the die/PKG/PCB design flows. Together, these models, tools and methods enable cross-layer optimization that allows for a top-down definition and exploration of the design space and helps converge on the correct off-chip interconnect implementation and technology choice. The tools presented cover off-chip memory interfaces for mobile and server products, silicon photonic interfaces, 2.5D silicon interposers and 3D through-silicon vias (TSVs). The goal of the cross-layer framework is to assess the key metrics of the interconnect (such as timing, latency, active/idle/sleep power, and area/cost) at an appropriate level of abstraction by being able to do this across layers of the design flow. In additional to signal interconnect, this thesis also explores the need for such cross-layer pathfinding for power distribution networks (PDN), where the system-on-chip (SoC) floorplan and pinmap must be optimized before the collateral layouts for PDN analysis are ready. Altogether, the developed cross-layer pathfinding methodology for off-chip interconnects enables more rapid and thorough exploration of a vast design space of off-chip parallel and serial links, inter-die and inter-chiplet links and silicon photonics. Such exploration will pave the way for off-chip interconnect technology enablement that is optimized for system needs. The basis of the framework can be extended to cover other interconnect technology as well, since it fundamentally relates to system-level metrics that are common to all off-chip interconnects
Managing HBM’s bandwidth in Multi-Die FPGAs using Overlay NoCs
We can improve HBM bandwidth distribution and utilization on a multi-die FPGA like
Xilinx Alveo U280 by using Overlay Network-on-Chips (NoCs). HBM in Xilinx Alveo U280
offers 8GBs of memory capacity with a theoretical maximum bandwidth of 460 GBps, but
all the thirty-two HBM ports in Xilinx Alveo U280 are exposed to the FPGA fabric in
only one die. As a result, processing elements assigned to other dies must use the scarcely
available and challenging to use Super Long Lines (SLL) to access the HBM’s bandwidth.
Furthermore, HBM is fractured internally into thirty-two smaller memories called pseudo
channels. They are connected together by a hardened and flawed cross-bar, which enables
global accesses from any of the HBM ports, but introduces several throughput bottlenecks,
degrading the achievable throughput when the entire memory space is used.
An Overlay Hybrid NoC combining the features of Hoplite and Butterfly Fat Trees
(BFT) NoC offers a high-frequency solution for distributing HBM’s bandwidth across all
three dies, as well as overcoming the throughput bottleneck introduced by the internal
cross-bar. The Hybrid NoC combines multiple high-frequency Ring NoCs for inter-die
communication and Butterfly Fat tree NoCs for intra-die communication. In addition, the
routing capability of the NoC can be modified to supplant the HBM’s internal cross-bar
for global accesses. We demonstrate this in Xilinx Alveo 280 using synthetic benchmarks
and two application-based benchmarks, Dense matrix-matrix multiplication (DMM) and
Sparse Matrix-Vector multiplication (SPMV). Our experiments show that NoCs can improve throughput utilization by as much as ×8.6 for single-flit global accesses,×1.7 for
multi-flit global accesses with burst length 16, and as much as ×1.4 for SpMV benchmark
- …