551 research outputs found
Empowering parallel computing with field programmable gate arrays
After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements
Exploration and Design of High Performance Variation Tolerant On-Chip Interconnects
Siirretty Doriast
Doctor of Philosophy
dissertationCommunication surpasses computation as the power and performance bottleneck in forthcoming exascale processors. Scaling has made transistors cheap, but on-chip wires have grown more expensive, both in terms of latency as well as energy. Therefore, the need for low energy, high performance interconnects is highly pronounced, especially for long distance communication. In this work, we examine two aspects of the global signaling problem. The first part of the thesis focuses on a high bandwidth asynchronous signaling protocol for long distance communication. Asynchrony among intellectual property (IP) cores on a chip has become necessary in a System on Chip (SoC) environment. Traditional asynchronous handshaking protocol suffers from loss of throughput due to the added latency of sending the acknowledge signal back to the sender. We demonstrate a method that supports end-to-end communication across links with arbitrarily large latency, without limiting the bandwidth, so long as line variation can be reliably controlled. We also evaluate the energy and latency improvements as a result of the design choices made available by this protocol. The use of transmission lines as a physical interconnect medium shows promise for deep submicron technologies. In our evaluations, we notice a lower energy footprint, as well as vastly reduced wire latency for transmission line interconnects. We approach this problem from two sides. Using field solvers, we investigate the physical design choices to determine the optimal way to implement these lines for a given back-end-of-line (BEOL) stack. We also approach the problem from a system designer's viewpoint, looking at ways to optimize the lines for different performance targets. This work analyzes the advantages and pitfalls of implementing asynchronous channel protocols for communication over long distances. Finally, the innovations resulting from this work are applied to a network-on-chip design example and the resulting power-performance benefits are reported
Exploiting level sensitive latches in wire pipelining
The present research presents procedures for exploitation of level sensitive latches in wire pipelining. The user gives a Steiner tree, having a signal source and set of destination or sinks, and the location in rectangular plane, capacitive load and required arrival time at each of the destinations. The user also defines a library of non-clocked (buffer) elements and clocked elements (flip-flop and latch), also known as synchronous elements. The first procedure performs concurrent repeater and synchronous element insertion in a bottom-up manner to find the minimum latency that may be achieved between the source and the destinations. The second procedure takes additional input (required latency) for each destination, derived from previous procedure, and finds the repeater and synchronous element assignments for all internal nodes of the Steiner tree, which minimize overall area used. These procedures utilize the latency and area advantages of latch based pipelining over flip-flop based pipelining. The second procedure suggests two methods to tackle the challenges that exist in a latch based design. The deferred delay padding technique is introduced, which removes the short path violations for latches with minimal extra cost
The development and performance evaluation of PIF logic functional blocks
In the deep submicron range of integrated circuit design, interconnects and not the logical gates are causing the performance bottleneck. The number of available transistors increase by a factor of 2 every technology node but interconnects do not scale with devices, devices scale down faster and thus the present designs need to be scalable and reusable. Pipeline interconnect free (PIF) logic methodology has a potential to solve these current design problems. In PIF logic methodology the global interconnects are replaced by a chain of logical gates. PIF logic uses only one type of gate which can be connected only to the adjacent eight gates making the gate and the interconnect modeling easier. In order to migrate from one technology node to other, just one PIF cell needs to be redesigned. The PIF cell in new technology node can replace the present cells thus making PIF logic based circuits fully reusable. This thesis implements PIF design methodology to develop two libraries consisting of combinational and sequential functional blocks such as adder, shift registers, multiplexers, decoders and encoders. The performance of these functional blocks is compared with the standard cell implementation with respect to the quality metrics (power dissipation, propagation delay and layout area)
Low Power Standard Cell Library Design For Application Specific Integrated Circuit
With the expansion of portable and wireless electronics product in the current
market demand, the focus of designing VLSI system has shifted from high speed to
low power domain. This requires chip designers to minimize power consumption at
all design level such as system, algorithm, architecture, circuit and technology.
The objective of this work is to develop low power CMOS standard cell
library to be used in application specific integrated circuit (ASIC) design flow. The
design methodology focuses on all aspect of circuit design: transistor size, logic
style, layout style, cell topology, and circuit design for minimum power
consumption. The standard cell library is targeted for general-purpose application,
especially in microprocessor design. For rapid design implementation, the library is
designed to be used together with the commercial logic synthesis and automatic cell
placement and routing tools. Results show that the microprocessor targeted to the
low power library gives 44% power saving compared to the conventional library,
with both designs operate at the same clock frequency of 50MHz
Recommended from our members
Architecting SkyBridge-CMOS
As the scaling of CMOS approaches fundamental limits, revolutionary technology beyond the end of CMOS roadmap is essential to continue the progress and miniaturization of integrated circuits. Recent research efforts in 3-D circuit integration explore pathways of continuing the scaling by co-designing for device, circuit, connectivity, heat and manufacturing challenges in a 3-D fabric-centric manner. SkyBridge fabric is one such approach that addresses fine-grained integration in 3-D, achieves orders of magnitude benefits over projected scaled 2-D CMOS, and provides a pathway for continuing scaling beyond 2-D CMOS.
However, SkyBridge fabric utilizes only single type transistors in order to reduce manufacture complexity, which limits its circuit implementation to dynamic logic. This design choice introduces multiple challenges for SkyBridge such as high switching power consumption, susceptibility to noise, and increased complexity for clocking. In this thesis we propose a new 3-D fabric, similar in mindset to SkyBridge, but with static logic circuit implementation in order to mitigate the afore-mentioned challenges. We present an integrated framework to realize static circuits with vertical nanowires, and co-design it across all layers spanning fundamental fabric structures to large circuits. The new fabric, named as SkyBridge-CMOS, introduces new technology, structures and circuit designs to meet the additional requirements for implementing static circuits. One of the critical challenges addressed here is integrating both n-type and p-type nanowires. Molecular bonding process allows precise control between different doping regions, and novel fabric components are proposed to achieve 3-D routing between various doping regions.
Core fabric components are designed, optimized and modeled with their physical level information taken into account. Based on these basic structures we design and evaluate various logic gates, arithmetic circuits and SRAM in terms of power, area footprint and delay. A comprehensive evaluation methodology spanning material/device level to circuit level is followed. Benchmarking against 16nm 2-D CMOS shows significant improvement of up to 50X in area footprint and 9.3X in total power efficiency for low power applications, and 3X in throughput for high performance applications. Also, better noise resilience and better power efficiency can be guaranteed when compared with original SkyBridge fabrics
Driving the Network-on-Chip Revolution to Remove the Interconnect Bottleneck in Nanoscale Multi-Processor Systems-on-Chip
The sustained demand for faster, more powerful chips has been met by the
availability of chip manufacturing processes allowing for the integration of increasing
numbers of computation units onto a single die. The resulting outcome,
especially in the embedded domain, has often been called SYSTEM-ON-CHIP
(SoC) or MULTI-PROCESSOR SYSTEM-ON-CHIP (MP-SoC).
MPSoC design brings to the foreground a large number of challenges, one of
the most prominent of which is the design of the chip interconnection. With a
number of on-chip blocks presently ranging in the tens, and quickly approaching
the hundreds, the novel issue of how to best provide on-chip communication
resources is clearly felt.
NETWORKS-ON-CHIPS (NoCs) are the most comprehensive and scalable
answer to this design concern. By bringing large-scale networking concepts to
the on-chip domain, they guarantee a structured answer to present and future
communication requirements. The point-to-point connection and packet switching
paradigms they involve are also of great help in minimizing wiring overhead
and physical routing issues. However, as with any technology of recent inception,
NoC design is still an evolving discipline. Several main areas of interest
require deep investigation for NoCs to become viable solutions:
• The design of the NoC architecture needs to strike the best tradeoff among
performance, features and the tight area and power constraints of the onchip
domain.
• Simulation and verification infrastructure must be put in place to explore,
validate and optimize the NoC performance.
• NoCs offer a huge design space, thanks to their extreme customizability in
terms of topology and architectural parameters. Design tools are needed
to prune this space and pick the best solutions.
• Even more so given their global, distributed nature, it is essential to evaluate
the physical implementation of NoCs to evaluate their suitability for
next-generation designs and their area and power costs.
This dissertation performs a design space exploration of network-on-chip architectures,
in order to point-out the trade-offs associated with the design of
each individual network building blocks and with the design of network topology
overall. The design space exploration is preceded by a comparative analysis
of state-of-the-art interconnect fabrics with themselves and with early networkon-
chip prototypes. The ultimate objective is to point out the key advantages
that NoC realizations provide with respect to state-of-the-art communication
infrastructures and to point out the challenges that lie ahead in order to make
this new interconnect technology come true. Among these latter, technologyrelated
challenges are emerging that call for dedicated design techniques at all
levels of the design hierarchy. In particular, leakage power dissipation, containment
of process variations and of their effects. The achievement of the above
objectives was enabled by means of a NoC simulation environment for cycleaccurate
modelling and simulation and by means of a back-end facility for the
study of NoC physical implementation effects. Overall, all the results provided
by this work have been validated on actual silicon layout
Design and modelling of variability tolerant on-chip communication structures for future high performance system on chip designs
The incessant technology scaling has enabled the integration of functionally complex System-on-Chip (SoC) designs with a large number of heterogeneous systems on a single chip. The processing elements on these chips are integrated through on-chip communication structures which provide the infrastructure necessary for the exchange of data and control signals, while meeting the strenuous physical and design constraints. The use of vast amounts of on chip communications will be central to future designs where variability is an inherent characteristic. For this reason, in this thesis we investigate the performance and variability tolerance of typical on-chip communication structures. Understanding of the relationship between variability and communication is paramount for the designers; i.e. to devise new methods and techniques for designing performance and power efficient communication circuits in the forefront of challenges presented by deep sub-micron (DSM) technologies.
The initial part of this work investigates the impact of device variability due to Random Dopant Fluctuations (RDF) on the timing characteristics of basic communication elements. The characterization data so obtained can be used to estimate the performance and failure probability of simple links through the methodology proposed in this work. For the Statistical Static Timing Analysis (SSTA) of larger circuits, a method for accurate estimation of the probability density functions of different circuit parameters is proposed. Moreover, its significance on pipelined circuits is highlighted. Power and area are one of the most important design metrics for any integrated circuit (IC) design. This thesis emphasises the consideration of communication reliability while optimizing for power and area. A methodology has been proposed for the simultaneous optimization of performance, area, power and delay variability for a repeater inserted interconnect. Similarly for multi-bit parallel links, bandwidth driven optimizations have also been performed. Power and area efficient semi-serial links, less vulnerable to delay variations than the corresponding fully parallel links are introduced. Furthermore, due to technology scaling, the coupling noise between the link lines has become an important issue. With ever decreasing supply voltages, and the corresponding reduction in noise margins, severe challenges are introduced for performing timing verification in the presence of variability. For this reason an accurate model for crosstalk noise in an interconnection as a function of time and skew is introduced in this work. This model can be used for the identification of skew condition that gives maximum delay noise, and also for efficient design verification
- …