240 research outputs found

    Driving the Network-on-Chip Revolution to Remove the Interconnect Bottleneck in Nanoscale Multi-Processor Systems-on-Chip

    Get PDF
    The sustained demand for faster, more powerful chips has been met by the availability of chip manufacturing processes allowing for the integration of increasing numbers of computation units onto a single die. The resulting outcome, especially in the embedded domain, has often been called SYSTEM-ON-CHIP (SoC) or MULTI-PROCESSOR SYSTEM-ON-CHIP (MP-SoC). MPSoC design brings to the foreground a large number of challenges, one of the most prominent of which is the design of the chip interconnection. With a number of on-chip blocks presently ranging in the tens, and quickly approaching the hundreds, the novel issue of how to best provide on-chip communication resources is clearly felt. NETWORKS-ON-CHIPS (NoCs) are the most comprehensive and scalable answer to this design concern. By bringing large-scale networking concepts to the on-chip domain, they guarantee a structured answer to present and future communication requirements. The point-to-point connection and packet switching paradigms they involve are also of great help in minimizing wiring overhead and physical routing issues. However, as with any technology of recent inception, NoC design is still an evolving discipline. Several main areas of interest require deep investigation for NoCs to become viable solutions: • The design of the NoC architecture needs to strike the best tradeoff among performance, features and the tight area and power constraints of the onchip domain. • Simulation and verification infrastructure must be put in place to explore, validate and optimize the NoC performance. • NoCs offer a huge design space, thanks to their extreme customizability in terms of topology and architectural parameters. Design tools are needed to prune this space and pick the best solutions. • Even more so given their global, distributed nature, it is essential to evaluate the physical implementation of NoCs to evaluate their suitability for next-generation designs and their area and power costs. This dissertation performs a design space exploration of network-on-chip architectures, in order to point-out the trade-offs associated with the design of each individual network building blocks and with the design of network topology overall. The design space exploration is preceded by a comparative analysis of state-of-the-art interconnect fabrics with themselves and with early networkon- chip prototypes. The ultimate objective is to point out the key advantages that NoC realizations provide with respect to state-of-the-art communication infrastructures and to point out the challenges that lie ahead in order to make this new interconnect technology come true. Among these latter, technologyrelated challenges are emerging that call for dedicated design techniques at all levels of the design hierarchy. In particular, leakage power dissipation, containment of process variations and of their effects. The achievement of the above objectives was enabled by means of a NoC simulation environment for cycleaccurate modelling and simulation and by means of a back-end facility for the study of NoC physical implementation effects. Overall, all the results provided by this work have been validated on actual silicon layout

    Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine

    Get PDF
    Deep neural networks have achieved impressive results in computer vision and machine learning. Unfortunately, state-of-the-art networks are extremely compute and memory intensive which makes them unsuitable for mW-devices such as IoT end-nodes. Aggressive quantization of these networks dramatically reduces the computation and memory footprint. Binary-weight neural networks (BWNs) follow this trend, pushing weight quantization to the limit. Hardware accelerators for BWNs presented up to now have focused on core efficiency, disregarding I/O bandwidth and system-level efficiency that are crucial for deployment of accelerators in ultra-low power devices. We present Hyperdrive: a BWN accelerator dramatically reducing the I/O bandwidth exploiting a novel binary-weight streaming approach, which can be used for arbitrarily sized convolutional neural network architecture and input resolution by exploiting the natural scalability of the compute units both at chip-level and system-level by arranging Hyperdrive chips systolically in a 2D mesh while processing the entire feature map together in parallel. Hyperdrive achieves 4.3 TOp/s/W system-level efficiency (i.e., including I/Os)---3.1x higher than state-of-the-art BWN accelerators, even if its core uses resource-intensive FP16 arithmetic for increased robustness

    Techniques of Energy-Efficient VLSI Chip Design for High-Performance Computing

    Get PDF
    How to implement quality computing with the limited power budget is the key factor to move very large scale integration (VLSI) chip design forward. This work introduces various techniques of low power VLSI design used for state of art computing. From the viewpoint of power supply, conventional in-chip voltage regulators based on analog blocks bring the large overhead of both power and area to computational chips. Motivated by this, a digital based switchable pin method to dynamically regulate power at low circuit cost has been proposed to make computing to be executed with a stable voltage supply. For one of the widely used and time consuming arithmetic units, multiplier, its operation in logarithmic domain shows an advantageous performance compared to that in binary domain considering computation latency, power and area. However, the introduced conversion error reduces the reliability of the following computation (e.g. multiplication and division.). In this work, a fast calibration method suppressing the conversion error and its VLSI implementation are proposed. The proposed logarithmic converter can be supplied by dc power to achieve fast conversion and clocked power to reduce the power dissipated during conversion. Going out of traditional computation methods and widely used static logic, neuron-like cell is also studied in this work. Using multiple input floating gate (MIFG) metal-oxide semiconductor field-effect transistor (MOSFET) based logic, a 32-bit, 16-operation arithmetic logic unit (ALU) with zipped decoding and a feedback loop is designed. The proposed ALU can reduce the switching power and has a strong driven-in capability due to coupling capacitors compared to static logic based ALU. Besides, recent neural computations bring serious challenges to digital VLSI implementation due to overload matrix multiplications and non-linear functions. An analog VLSI design which is compatible to external digital environment is proposed for the network of long short-term memory (LSTM). The entire analog based network computes much faster and has higher energy efficiency than the digital one

    IDPAL – A Partially-Adiabatic Energy-Efficient Logic Family: Theory and Applications to Secure Computing

    Get PDF
    Low-power circuits and issues associated with them have gained a significant amount of attention in recent years due to the boom in portable electronic devices. Historically, low-power operation relied heavily on technology scaling and reduced operating voltage, however this trend has been slowing down recently due to the increased power density on chips. This dissertation introduces a new very-low power partially-adiabatic logic family called Input-Decoupled Partially-Adiabatic Logic (IDPAL) with applications in low-power circuits. Experimental results show that IDPAL reduces energy usage by 79% compared to equivalent CMOS implementations and by 25% when compared to the best adiabatic implementation. Experiments ranging from a simple buffer/inverter up to a 32-bit multiplier are explored and result in consistent energy savings, showing that IDPAL could be a viable candidate for a low-power circuit implementation. This work also shows an application of IDPAL to secure low-power circuits against power analysis attacks. It is often assumed that encryption algorithms are perfectly secure against attacks, however, most times attacks using side channels on the hardware implementation of an encryption operation are not investigated. Power analysis attacks are a subset of side channel attacks and can be implemented by measuring the power used by a circuit during an encryption operation in order to obtain secret information from the circuit under attack. Most of the previously proposed solutions for power analysis attacks use a large amount of power and are unsuitable for a low-power application. The almost-equal energy consumption for any given input in an IDPAL circuit suggests that this logic family is a good candidate for securing low-power circuits again power analysis attacks. Experimental results ranging from small circuits to large multipliers are performed and the power-analysis attack resistance of IDPAL is investigated. Results show that IDPAL circuits are not only low-power but also the most secure against power analysis attacks when compared to other adiabatic low-power circuits. Finally, a hybrid adiabatic-CMOS microprocessor design is presented. The proposed microprocessor uses IDPAL for the implementation of circuits with high switching activity (e.g. ALU) and CMOS logic for other circuits (e.g. memory, controller). An adiabatic-CMOS interface for transforming adiabatic signals to square-wave signals is presented and issues associated with a hybrid implementation and their solutions are also discussed

    Testing of Level Shifters in Multiple Voltage Designs

    No full text
    The use of multiple voltages for different cores is becoming a widely accepted technique for efficient power management. Level shifters are used as interfaces between voltage domains. Through extensive transistor level simulations of resistive open, bridging and resistive short faults, we have classified the testing of level shifters into PASSIVE and ACTIVE modes. We examine if high test coverage can be achieved in the PASSIVE mode. We consider resistive opens and shorts and show that, for testing purposes, consideration of purely digital fault effects is sufficient. Thus conventional digital DfT can be employed to test level shifters. In all cases, we conclude that using sets of single supply voltages for testing is sufficient

    Semiconductor-technology exploration : getting the most out of the MOST

    Get PDF

    Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine

    Get PDF
    Deep neural networks have achieved impressive results in computer vision and machine learning. Unfortunately, state-of-the-art networks are extremely compute and memory intensive, which makes them unsuitable for mW-devices such as loT end-nodes. Aggressive quantization of these networks dramatically reduces the computation and memory footprint. Binary-weight neural networks (BWNs) follow this trend, pushing weight quantization to the limit. Hardware accelerators for BWNs presented up to now have focused on core efficiency, disregarding I/O bandwidth, and system-level efficiency that are crucial for the deployment of accelerators in ultra-low power devices. We present Hyperdrive: a BWN accelerator dramatically reducing the I/O bandwidth exploiting a novel binary-weight streaming approach, which can he used for an arbitrarily sized convolutional neural network architecture and input resolution by exploiting the natural scalability of the compute units both at chip-level and system-level by arranging Hyperdrive chips systolically in a 2D mesh while processing the entire feature map together in parallel. Hyperdrive achieves 4.3 TOp/s/W system-level efficiency (i.e., including I/Os)-3.1 x higher than state-of-the-art BWN accelerators, even if its core uses resource-intensive FP16 arithmetic for increased robustness

    Hybrid FPGA: Architecture and Interface

    No full text
    Hybrid FPGAs (Field Programmable Gate Arrays) are composed of general-purpose logic resources with different granularities, together with domain-specific coarse-grained units. This thesis proposes a novel hybrid FPGA architecture with embedded coarse-grained Floating Point Units (FPUs) to improve the floating point capability of FPGAs. Based on the proposed hybrid FPGA architecture, we examine three aspects to optimise the speed and area for domain-specific applications. First, we examine the interface between large coarse-grained embedded blocks (EBs) and fine-grained elements in hybrid FPGAs. The interface includes parameters for varying: (1) aspect ratio of EBs, (2) position of the EBs in the FPGA, (3) I/O pins arrangement of EBs, (4) interconnect flexibility of EBs, and (5) location of additional embedded elements such as memory. Second, we examine the interconnect structure for hybrid FPGAs. We investigate how large and highdensity EBs affect the routing demand for hybrid FPGAs over a set of domain-specific applications. We then propose three routing optimisation methods to meet the additional routing demand introduced by large EBs: (1) identifying the best separation distance between EBs, (2) adding routing switches on EBs to increase routing flexibility, and (3) introducing wider channel width near the edge of EBs. We study and compare the trade-offs in delay, area and routability of these three optimisation methods. Finally, we employ common subgraph extraction to determine the number of floating point adders/subtractors, multipliers and wordblocks in the FPUs. The wordblocks include registers and can implement fixed point operations. We study the area, speed and utilisation trade-offs of the selected FPU subgraphs in a set of floating point benchmark circuits. We develop an optimised coarse-grained FPU, taking into account both architectural and system-level issues. Furthermore, we investigate the trade-offs between granularities and performance by composing small FPUs into a large FPU. The results of this thesis would help design a domain-specific hybrid FPGA to meet user requirements, by optimising for speed, area or a combination of speed and area
    • …
    corecore