158 research outputs found

    Data path analysis for dynamic circuit specialisation

    Get PDF
    Dynamic Circuit Specialisation (DCS) is a method that exploits the reconfigurability of modern FPGAs to allow the specialisation of FPGA circuits at run-time. Currently, it is only explored as part of Register-transfer level design. However, at the Register-transfer level (RTL), a large part of the design is already locked in. Therefore, maximally exploiting the opportunities of DCS could require a costly redesign. It would be interesting to already have insight in the opportunities for DCS from the higher abstraction level. Moreover, the general design trend in FPGA design is to work on higher abstraction levels and let tool(s) translate this higher level description to RTL. This paper presents the first profiler that, based on the high-level description of an application, estimates the benefits of an implementation using DCS. This allows a designer to determine much earlier in the design cycle whether or not DCS would be interesting. The high-level profiling methodology was implemented and tested on a set of PID designs

    Design of Ultrafast All-Optical Pseudo Binary Random Sequence Generator, 4-bit Multiplier and Divider using 2 x 2 Silicon Micro-ring Resonators

    Full text link
    All-optical devices are essential for next generation ultrafast, ultralow-power and ultrahigh bandwidth information processing systems. Silicon microring resonators (SiMRR) provide a versatile platform for all-optical switching and CMOS-compatible computing, with added advantages of high Q-factor, tunability, compactness, cascadability and scalability. A detailed theoretical analysis of ultrafast all-optical switching 2 x 2 SiMRRs has been carried out incorporating the effects of two photon absorption induced free-carrier injection and thermo optic effect. The results have been used to design simple and compact all-optical 3-bit and 4-bit pseudo-random binary sequence generators and the first reported designs of all-optical 4 x 4-bit multiplier and divider. The designs have been optimized for low-power, ultrafast operation with high modulation depth, enabling logic operations at 45 Gbps.Comment: 13 pages, 4 figures. Submitted at Journal (Optik) for publicatio

    Using BDD-based decomposition for automatic error correction of combinatorial circuits

    Get PDF
    Boolean equivalence checking has turned out to be a powerful method for verifying combinatorial circuits and has been widely accepted both in academia and industry. In this paper, we present a method for localizing and correcting errors in combinatorial circuits for which equivalence checking has failed. Our approach is general and does not assume any error model. Working directly on BDDs, the approach is well suited for integration into commonly used equivalence checkers. Since circuits can be corrected fully automatically, our approach can save considerable debugging time and therefore will speed up the whole design cycle. We have implemented a prototype verification tool and evaluated our method with the Berkeley benchmark circuits. In addition, we have applied it successfully to a real life example taken from [DrFe96]

    Configurable data center switch architectures

    Get PDF
    In this thesis, we explore alternative architectures for implementing con_gurable Data Center Switches along with the advantages that can be provided by such switches. Our first contribution centers around determining switch architectures that can be implemented on Field Programmable Gate Array (FPGA) to provide configurable switching protocols. In the process, we identify a gap in the availability of frameworks to realistically evaluate the performance of switch architectures in data centers and contribute a simulation framework that relies on realistic data center traffic patterns. Our framework is then used to evaluate the performance of currently existing as well as newly proposed FPGA-amenable switch designs. Through collaborative work with Meng and Papaphilippou, we establish that only small-medium range switches can be implemented on today's FPGAs. Our second contribution is a novel switch architecture that integrates a custom in-network hardware accelerator with a generic switch to accelerate Deep Neural Network training applications in data centers. Our proposed accelerator architecture is prototyped on an FPGA, and a scalability study is conducted to demonstrate the trade-offs of an FPGA implementation when compared to an ASIC implementation. In addition to the hardware prototype, we contribute a light weight load-balancing and congestion control protocol that leverages the unique communication patterns of ML data-parallel jobs to enable fair sharing of network resources across different jobs. Our large-scale simulations demonstrate the ability of our novel switch architecture and light weight congestion control protocol to both accelerate the training time of machine learning jobs by up to 1.34x and benefit other latency-sensitive applications by reducing their 99%-tile completion time by up to 4.5x. As for our final contribution, we identify the main requirements of in-network applications and propose a Network-on-Chip (NoC)-based architecture for supporting a heterogeneous set of applications. Observing the lack of tools to support such research, we provide a tool that can be used to evaluate NoC-based switch architectures.Open Acces

    Dynamic voltage and frequency scaling with multi-clock distribution systems on SPARC core

    Get PDF
    The current implementation of dynamic voltage and frequency scaling (DVS and DFS) in microprocessors is based on a single clock domain per core. In architectures that adopt Instruction Level Parallelism (ILP), multiple execution units may exist and operate concurrently. Performing DVS and DFS on such cores may result in low utilization and power efficiency. In this thesis, a methodology that implements DVFS with multi Clock distribution Systems (DCS) is applied on a processor core to achieve higher throughput and better power efficiency. DCS replaces the core single clock distribution tree with multi-clock domain systems which, along with dynamic voltage and frequency scaling, creates multiple clock-voltage domains. DCS implements a self-timed interface between the different domains to maintain functionality and ensure data integrity. DCS was implemented on a SPARC core of UltraSPARC T1 architecture, and synthesized targeting TSMC 120nm process technology. Two clock domains were used on SPARC core. The maximum achieved speedup relative to original core was 1.6X. The power consumed by DCS was 0.173mW compared to the core total power of ~ 10W

    Design of Stochastic Computing Architectures using Integrated Optics

    Get PDF
    Approximate computing (AC) is an emerging computing approach that allows to trade off design energy efficiency with computing accuracy. It targets error resilient applications, such as image processing, where energy consumption is of major concern. Stochastic computing (SC) is an approximate computing paradigm that leads to energy efficient and reduced hardware complexity designs. In this approach, data is represented as probabilities in bit streams format. The main drawback of this computing paradigm is the intrinsic serial processing of bit streams, which negatively impacts the processing time. Nanophotonics technology is characterized by high bandwidth and high signals propagation speed, which has the potential to support the electrical domain in computations to speed up the processing rate. The major issues in optical computing (OC) remain the large size of silicon photonics devices, which impact the design scalability. In this thesis, we propose, for the first time, an optical stochastic computing (OSC) approach, where we aim to design SC architectures using integrated optics. For this purpose, we propose a methodology that has libraries for optical processing and interfaces, e.g., bit stream generator. We design all-optical gates for the computation and develop transmission models for the architectures. The methodology allows for design space exploration of technological and system-level parameters to optimize design performance, i.e., energy efficiency, computing accuracy, and latency, for the targeted application. This exploration leads to multiple design options that satisfy different design requirements for the selected application. The optical processing libraries include designing a polynomial architecture that can execute any arbitrary single input function. We explore the design parameters by implementing a Gamma correction application for image processing. Results show a 4.5x increase in the errors, which leads to 47x energy saving and 16x faster processing speed. We propose a reconfigurable polynomial architecture to adapt design order at run-time. The design allows the execution of high order polynomial functions for better accuracy or multiple low order functions to increase throughput and energy efficiency. Finally, we propose the design of combinational filters. The purpose is to investigate the design of cascaded gates architectures using photonic crystal (PhC) nanocavities. We use this device to design a Sobel edge detection filter for image processing. The resulting architecture shows 0.85nJ/pixel energy consumption and 51.2ns/pixel processing time. The optical interface libraries include designing different architectures of stochastic number generators (SNG) that are either electrical-optical or all-optical to generate the bit streams. We compare these SNGs in terms of computing accuracy and energy efficiency. The results show that all implementations can lead to the same level of computing accuracy. Moreover, using an all-optical SNG to design a fully optical 8-bit adder results in 98% reduction in hardware complexity and 70% energy saving compared to a conventional optical design

    Identification of dynamic circuit specialization opportunities in RTL code

    Get PDF
    Dynamic Circuit Specialization (DCS) optimizes a Field-Programmable Gate Array (FPGA) design by assuming a set of its input signals are constant for a reasonable amount of time, leading to a smaller and faster FPGA circuit. When the signals actually change, a new circuit is loaded into the FPGA through runtime reconfiguration. The signals the design is specialized for are called parameters. For certain designs, parameters can be selected so the DCS implementation is both smaller and faster than the original implementation. However, DCS also introduces an overhead that is difficult for the designer to take into account, making it hard to determine whether a design is improved by DCS or not. This article presents extensive results on a profiling methodology that analyses Register-Transfer Level (RTL) implementations of applications to check if DCS would be beneficial. It proposes to use the functional density as a measure for the area efficiency of an implementation, as this measure contains both the overhead and the gains of a DCS implementation. The first step of the methodology is to analyse the dynamic behaviour of signals in the design, to find good parameter candidates. The overhead of DCS is highly dependent on this dynamic behaviour. A second stage calculates the functional density for each candidate and compares it to the functional density of the original design. The profiling methodology resulted in three implementations of a profiling tool, the DCS-RTL profiler. The execution time, accuracy, and the quality of each implementation is assessed based on data from 10 RTL designs. All designs, except for the two 16-bit adaptable Finite Impulse Response (FIR) filters, are analysed in 1 hour or less
    • …
    corecore