107 research outputs found

    Adaptive Latency Insensitive Protocols

    Get PDF
    Latency-insensitive design copes with excessive delays typical of global wires in current and future IC technologies. It achieves its goal via encapsulation of synchronous logic blocks in wrappers that communicate through a latency-insensitive protocol (LIP) and pipelined interconnects. Previously proposed solutions suffer from an excessive performance penalty in terms of throughput or from a lack of generality. This article presents an adaptive LIP that outperforms previous static implementations, as demonstrated by two relevant cases — a microprocessor and an MPEG encoder — whose components we made insensitive to the latencies of their interconnections through a newly developed wrapper. We also present an informal exposition of the theoretical basis of adaptive LIPs, as well as implementation detail

    Doctor of Philosophy

    Get PDF
    dissertationPortable electronic devices will be limited to available energy of existing battery chemistries for the foreseeable future. However, system-on-chips (SoCs) used in these devices are under a demand to offer more functionality and increased battery life. A difficult problem in SoC design is providing energy-efficient communication between its components while maintaining the required performance. This dissertation introduces a novel energy-efficient network-on-chip (NoC) communication architecture. A NoC is used within complex SoCs due it its superior performance, energy usage, modularity, and scalability over traditional bus and point-to-point methods of connecting SoC components. This is the first academic research that combines asynchronous NoC circuits, a focus on energy-efficient design, and a software framework to customize a NoC for a particular SoC. Its key contribution is demonstrating that a simple, asynchronous NoC concept is a good match for low-power devices, and is a fruitful area for additional investigation. The proposed NoC is energy-efficient in several ways: simple switch and arbitration logic, low port radix, latch-based router buffering, a topology with the minimum number of 3-port routers, and the asynchronous advantages of zero dynamic power consumption while idle and the lack of a clock tree. The tool framework developed for this work uses novel methods to optimize the topology and router oorplan based on simulated annealing and force-directed movement. It studies link pipelining techniques that yield improved throughput in an energy-efficient manner. A simulator is automatically generated for each customized NoC, and its traffic generators use a self-similar message distribution, as opposed to Poisson, to better match application behavior. Compared to a conventional synchronous NoC, this design is superior by achieving comparable message latency with half the energy

    Smart technologies for effective reconfiguration: the FASTER approach

    Get PDF
    Current and future computing systems increasingly require that their functionality stays flexible after the system is operational, in order to cope with changing user requirements and improvements in system features, i.e. changing protocols and data-coding standards, evolving demands for support of different user applications, and newly emerging applications in communication, computing and consumer electronics. Therefore, extending the functionality and the lifetime of products requires the addition of new functionality to track and satisfy the customers needs and market and technology trends. Many contemporary products along with the software part incorporate hardware accelerators for reasons of performance and power efficiency. While adaptivity of software is straightforward, adaptation of the hardware to changing requirements constitutes a challenging problem requiring delicate solutions. The FASTER (Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration) project aims at introducing a complete methodology to allow designers to easily implement a system specification on a platform which includes a general purpose processor combined with multiple accelerators running on an FPGA, taking as input a high-level description and fully exploiting, both at design time and at run time, the capabilities of partial dynamic reconfiguration. The goal is that for selected application domains, the FASTER toolchain will be able to reduce the design and verification time of complex reconfigurable systems providing additional novel verification features that are not available in existing tool flows

    Driving the Network-on-Chip Revolution to Remove the Interconnect Bottleneck in Nanoscale Multi-Processor Systems-on-Chip

    Get PDF
    The sustained demand for faster, more powerful chips has been met by the availability of chip manufacturing processes allowing for the integration of increasing numbers of computation units onto a single die. The resulting outcome, especially in the embedded domain, has often been called SYSTEM-ON-CHIP (SoC) or MULTI-PROCESSOR SYSTEM-ON-CHIP (MP-SoC). MPSoC design brings to the foreground a large number of challenges, one of the most prominent of which is the design of the chip interconnection. With a number of on-chip blocks presently ranging in the tens, and quickly approaching the hundreds, the novel issue of how to best provide on-chip communication resources is clearly felt. NETWORKS-ON-CHIPS (NoCs) are the most comprehensive and scalable answer to this design concern. By bringing large-scale networking concepts to the on-chip domain, they guarantee a structured answer to present and future communication requirements. The point-to-point connection and packet switching paradigms they involve are also of great help in minimizing wiring overhead and physical routing issues. However, as with any technology of recent inception, NoC design is still an evolving discipline. Several main areas of interest require deep investigation for NoCs to become viable solutions: • The design of the NoC architecture needs to strike the best tradeoff among performance, features and the tight area and power constraints of the onchip domain. • Simulation and verification infrastructure must be put in place to explore, validate and optimize the NoC performance. • NoCs offer a huge design space, thanks to their extreme customizability in terms of topology and architectural parameters. Design tools are needed to prune this space and pick the best solutions. • Even more so given their global, distributed nature, it is essential to evaluate the physical implementation of NoCs to evaluate their suitability for next-generation designs and their area and power costs. This dissertation performs a design space exploration of network-on-chip architectures, in order to point-out the trade-offs associated with the design of each individual network building blocks and with the design of network topology overall. The design space exploration is preceded by a comparative analysis of state-of-the-art interconnect fabrics with themselves and with early networkon- chip prototypes. The ultimate objective is to point out the key advantages that NoC realizations provide with respect to state-of-the-art communication infrastructures and to point out the challenges that lie ahead in order to make this new interconnect technology come true. Among these latter, technologyrelated challenges are emerging that call for dedicated design techniques at all levels of the design hierarchy. In particular, leakage power dissipation, containment of process variations and of their effects. The achievement of the above objectives was enabled by means of a NoC simulation environment for cycleaccurate modelling and simulation and by means of a back-end facility for the study of NoC physical implementation effects. Overall, all the results provided by this work have been validated on actual silicon layout

    Characterization and Avoidance of Critical Pipeline Structures in Aggressive Superscalar Processors

    Get PDF
    In recent years, with only small fractions of modern processors now accessible in a single cycle, computer architects constantly fight against propagation issues across the die. Unfortunately this trend continues to shift inward, and now the even most internal features of the pipeline are designed around communication, not computation. To address the inward creep of this constraint, this work focuses on the characterization of communication within the pipeline itself, architectural techniques to avoid it when possible, and layout co-design for early detection of problems. I present work in creating a novel detection tool for common case operand movement which can rapidly characterize an applications dataflow patterns. The results produced are suitable for exploitation as a small number of patterns can describe a significant portion of modern applications. Work on dynamic dependence collapsing takes the observations from the pattern results and shows how certain groups of operations can be dynamically grouped, avoiding unnecessary communication between individual instructions. This technique also amplifies the efficiency of pipeline data structures such as the reorder buffer, increasing both IPC and frequency. I also identify the same sets of collapsible instructions at compile time, producing the same benefits with minimal hardware complexity. This technique is also done in a backward compatible manner as the groups are exposed by simple reordering of the binarys instructions. I present aggressive pipelining approaches for these resources which avoids the critical timing often presumed necessary in aggressive superscalar processors. As these structures are designed for the worst case, pipelining them can produce greater frequency benefit than IPC loss. I also use the observation that the dynamic issue order for instructions in aggressive superscalar processors is predictable. Thus, a hardware mechanism is introduced for caching the wakeup order for groups of instructions efficiently. These wakeup vectors are then used to speculatively schedule instructions, avoiding the dynamic scheduling when it is not necessary. Finally, I present a novel approach to fast and high-quality chip layout. By allowing architects to quickly evaluate what if scenarios during early high-level design, chip designs are less likely to encounter implementation problems later in the process.Ph.D.Committee Chair: Scott Wills; Committee Member: David Schimmel; Committee Member: Gabriel Loh; Committee Member: Hsien-Hsin Lee; Committee Member: Yorai Ward

    Cross-layer design of thermally-aware 2.5D systems

    Full text link
    Over the past decade, CMOS technology scaling has slowed down. To sustain the historic performance improvement predicted by Moore's Law, in the mid-2000s the computing industry moved to using manycore systems and exploiting parallelism. The on-chip power densities of manycore systems, however, continued to increase after the breakdown of Dennard's Scaling. This leads to the `dark silicon' problem, whereby not all cores can operate at the highest frequency or can be turned on simultaneously due to thermal constraints. As a result, we have not been able to take full advantage of the parallelism in manycore systems. One of the 'More than Moore' approaches that is being explored to address this problem is integration of diverse functional components onto a substrate using 2.5D integration technology. 2.5D integration provides opportunities to exploit chiplet placement flexibility to address the dark silicon problem and mitigate the thermal stress of today's high-performance systems. These opportunities can be leveraged to improve the overall performance of the manycore heterogeneous computing systems. Broadly, this thesis aims at designing thermally-aware 2.5D systems. More specifically, to address the dark silicon problem of manycore systems, we first propose a single-layer thermally-aware chiplet organization methodology for homogeneous 2.5D systems. The key idea is to strategically insert spacing between the chiplets of a 2.5D manycore system to lower the operating temperature, and thus reclaim dark silicon by allowing more active cores and/or higher operating frequency under a temperature threshold. We investigate manufacturing cost and thermal behavior of 2.5D systems, then formulate and solve an optimization problem that jointly maximizes performance and minimizes manufacturing cost. We then enhance our methodology by incorporating a cross-layer co-optimization approach. We jointly maximize performance and minimize manufacturing cost and operating temperature across logical, physical, and circuit layers. We propose a novel gas-station link design that enables pipelining in passive interposers. We then extend our thermally-aware optimization methodology for network routing and chiplet placement of heterogeneous 2.5D systems, which consist of central processing unit (CPU) chiplets, graphics processing unit (GPU) chiplets, accelerator chiplets, and/or memory stacks. We jointly minimize the total wirelength and the system temperature. Our enhanced methodology increases the thermal design power budget and thereby improves thermal-constraint performance of the system

    Methodologies and Toolflows for the Predictable Design of Reliable and Low-Power NoCs

    Get PDF
    There is today the unmistakable need to evolve design methodologies and tool ows for Network-on-Chip based embedded systems. In particular, the quest for low-power requirements is nowadays a more-than-ever urgent dilemma. Modern circuits feature billion of transistors, and neither power management techniques nor batteries capacity are able to endure the increasingly higher integration capability of digital devices. Besides, power concerns come together with modern nanoscale silicon technology design issues. On one hand, system failure rates are expected to increase exponentially at every technology node when integrated circuit wear-out failure mechanisms are not compensated for. However, error detection and/or correction mechanisms have a non-negligible impact on the network power. On the other hand, to meet the stringent time-to-market deadlines, the design cycle of such a distributed and heterogeneous architecture must not be prolonged by unnecessary design iterations. Overall, there is a clear need to better discriminate reliability strategies and interconnect topology solutions upfront, by ranking designs based on power metric. In this thesis, we tackle this challenge by proposing power-aware design technologies. Finally, we take into account the most aggressive and disruptive methodology for embedded systems with ultra-low power constraints, by migrating NoC basic building blocks to asynchronous (or clockless) design style. We deal with this challenge delivering a standard cell design methodology and mainstream CAD tool ows, in this way partially relaxing the requirement of using asynchronous blocks only as hard macros

    Mocarabe: High-Performance Time-Multiplexed Overlays for FPGAs

    Get PDF
    Coarse-grained reconfigurable array (CGRA) overlays can improve dataflow kernel throughput by an order of magnitude over Vivado HLS on Xilinx Alveo U280. This is possible with a combination of carefully floorplanned high-frequency (645 - 768 MHz Torus, 788 - 856 MHz Mesh, 583 - 746 MHz BFT) design and a scalable, communication-aware compiler. Our CGRA architecture supports configurable Processing Element (PE) functionality supported by a configurable number of communication channels to match application demands. Compared to recent FPGA overlays like 4×4 ADRES and HyCUBE implementations in CGRA-ME, our design operates at a faster clock frequency by up to 3.4×, while scaling to an orders-of-magnitude larger array size of 19×69 on Xilinx Alveo U280. We propose a novel topology agnostic ILP placer that formulates the CGRA placement problem into an ILP problem. Our ILP placer optimizes placement regardless of topology and even for non-linear objective functions by using pre-computed placement costs as inputs to the ILP problem formulation. Using the ILP placer reduces placement quadratic wirelength up to 37% compared to the commonly used simulated annealing approach but increases runtime from less than a minute to hours. Our communication-aware compiler targets HLS objectives such as initiation interval (II) and minimizes communication cost using an integer linear programming (ILP) formulation. Unlike SDC schedulers in FPGA HLS tools, we treat data movement as a first-class citizen by encoding the space and time resources of the communication network in the ILP formulation. Given the same constraints on operational resources as Vivado HLS, we can retain our target II and achieve up to 9.2× higher frequency. We compare Torus and Mesh topologies, and show Mesh has less latency per area compared to Torus for the same benchmarks
    • …
    corecore