7,492 research outputs found

    FPGA-Based CNN Inference Accelerator Synthesized from Multi-Threaded C Software

    Full text link
    A deep-learning inference accelerator is synthesized from a C-language software program parallelized with Pthreads. The software implementation uses the well-known producer/consumer model with parallel threads interconnected by FIFO queues. The LegUp high-level synthesis (HLS) tool synthesizes threads into parallel FPGA hardware, translating software parallelism into spatial parallelism. A complete system is generated where convolution, pooling and padding are realized in the synthesized accelerator, with remaining tasks executing on an embedded ARM processor. The accelerator incorporates reduced precision, and a novel approach for zero-weight-skipping in convolution. On a mid-sized Intel Arria 10 SoC FPGA, peak performance on VGG-16 is 138 effective GOPS

    Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

    Get PDF
    The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER

    Driving the Network-on-Chip Revolution to Remove the Interconnect Bottleneck in Nanoscale Multi-Processor Systems-on-Chip

    Get PDF
    The sustained demand for faster, more powerful chips has been met by the availability of chip manufacturing processes allowing for the integration of increasing numbers of computation units onto a single die. The resulting outcome, especially in the embedded domain, has often been called SYSTEM-ON-CHIP (SoC) or MULTI-PROCESSOR SYSTEM-ON-CHIP (MP-SoC). MPSoC design brings to the foreground a large number of challenges, one of the most prominent of which is the design of the chip interconnection. With a number of on-chip blocks presently ranging in the tens, and quickly approaching the hundreds, the novel issue of how to best provide on-chip communication resources is clearly felt. NETWORKS-ON-CHIPS (NoCs) are the most comprehensive and scalable answer to this design concern. By bringing large-scale networking concepts to the on-chip domain, they guarantee a structured answer to present and future communication requirements. The point-to-point connection and packet switching paradigms they involve are also of great help in minimizing wiring overhead and physical routing issues. However, as with any technology of recent inception, NoC design is still an evolving discipline. Several main areas of interest require deep investigation for NoCs to become viable solutions: ‱ The design of the NoC architecture needs to strike the best tradeoff among performance, features and the tight area and power constraints of the onchip domain. ‱ Simulation and verification infrastructure must be put in place to explore, validate and optimize the NoC performance. ‱ NoCs offer a huge design space, thanks to their extreme customizability in terms of topology and architectural parameters. Design tools are needed to prune this space and pick the best solutions. ‱ Even more so given their global, distributed nature, it is essential to evaluate the physical implementation of NoCs to evaluate their suitability for next-generation designs and their area and power costs. This dissertation performs a design space exploration of network-on-chip architectures, in order to point-out the trade-offs associated with the design of each individual network building blocks and with the design of network topology overall. The design space exploration is preceded by a comparative analysis of state-of-the-art interconnect fabrics with themselves and with early networkon- chip prototypes. The ultimate objective is to point out the key advantages that NoC realizations provide with respect to state-of-the-art communication infrastructures and to point out the challenges that lie ahead in order to make this new interconnect technology come true. Among these latter, technologyrelated challenges are emerging that call for dedicated design techniques at all levels of the design hierarchy. In particular, leakage power dissipation, containment of process variations and of their effects. The achievement of the above objectives was enabled by means of a NoC simulation environment for cycleaccurate modelling and simulation and by means of a back-end facility for the study of NoC physical implementation effects. Overall, all the results provided by this work have been validated on actual silicon layout

    High-level synthesis of VLSI circuits

    Get PDF

    Digital implementation of the cellular sensor-computers

    Get PDF
    Two different kinds of cellular sensor-processor architectures are used nowadays in various applications. The first is the traditional sensor-processor architecture, where the sensor and the processor arrays are mapped into each other. The second is the foveal architecture, in which a small active fovea is navigating in a large sensor array. This second architecture is introduced and compared here. Both of these architectures can be implemented with analog and digital processor arrays. The efficiency of the different implementation types, depending on the used CMOS technology, is analyzed. It turned out, that the finer the technology is, the better to use digital implementation rather than analog

    Exploring resource/performance trade-offs for streaming applications on embedded multiprocessors

    Get PDF
    Embedded system design is challenged by the gap between the ever-increasing customer demands and the limited resource budgets. The tough competition demands ever-shortening time-to-market and product lifecycles. To solve or, at least to alleviate, the aforementioned issues, designers and manufacturers need model-based quantitative analysis techniques for early design-space exploration to study trade-offs of different implementation candidates. Moreover, modern embedded applications, especially the streaming applications addressed in this thesis, face more and more dynamic input contents, and the platforms that they are running on are more flexible and allow runtime configuration. Quantitative analysis techniques for embedded system design have to be able to handle such dynamic adaptable systems. This thesis has the following contributions: - A resource-aware extension to the Synchronous Dataflow (SDF) model of computation. - Trade-off analysis techniques, both in the time-domain and in the iterationdomain (i.e., on an SDF iteration basis), with support for resource sharing. - Bottleneck-driven design-space exploration techniques for resource-aware SDF. - A game-theoretic approach to controller synthesis, guaranteeing performance under dynamic input. As a first contribution, we propose a new model, as an extension of static synchronous dataflow graphs (SDF) that allows the explicit modeling of resources with consistency checking. The model is called resource-aware SDF (RASDF). The extension enables us to investigate resource sharing and to explore different scheduling options (ways to allocate the resources to the different tasks) using state-space exploration techniques. Consistent SDF and RASDF graphs have the property that an execution occurs in so-called iterations. An iteration typically corresponds to the processing of a meaningful piece of data, and it returns the graph to its initial state. On multiprocessor platforms, iterations may be executed in a pipelined fashion, which makes performance analysis challenging. As the second contribution, this thesis develops trade-off analysis techniques for RASDF, both in the time-domain and in the iteration-domain (i.e., on an SDF iteration basis), to dimension resources on platforms. The time-domain analysis allows interleaving of different iterations, but the size of the explored state space grows quickly. The iteration-based technique trades the potential of interleaving of iterations for a compact size of the iteration state space. An efficient bottleneck-driven designspace exploration technique for streaming applications, the third main contribution in this thesis, is derived from analysis of the critical cycle of the state space, to reveal bottleneck resources that are limiting the throughput. All techniques are based on state-based exploration. They enable system designers to tailor their platform to the required applications, based on their own specific performance requirements. Pruning techniques for efficient exploration of the state space have been developed. Pareto dominance in terms of performance and resource usage is used for exact pruning, and approximation techniques are used for heuristic pruning. Finally, the thesis investigates dynamic scheduling techniques to respond to dynamic changes in input streams. The fourth contribution in this thesis is a game-theoretic approach to tackle controller synthesis to select the appropriate schedules in response to dynamic inputs from the environment. The approach transforms the explored iteration state space of a scenario- and resource-aware SDF (SARA SDF) graph to a bipartite game graph, and maps the controller synthesis problem to the problem of finding a winning positional strategy in a classical mean payoff game. A winning strategy of the game can be used to synthesize the controller of schedules for the system that is guaranteed to satisfy the throughput requirement given by the designer

    Reducing dynamic power consumption in next generation DS-CDMA mobile communication receivers

    Get PDF
    Reduction of the power consumption in portable wireless receivers is important for cellular systems, including UMTS and IMT2000. This paper explores the architectural design-space and methodologies for reducing the dynamic power dissipation in the Direct Sequence Code Division Multiple Access (DS-CDMA) downlink RAKE receiver. At the algorithm level, we investigate the tradeoffs of reduced precision and arithmetic complexity on the receiver performance. We then present and analyse two architectures for implementing the reference and reduced complexity receivers, with respect to dynamic power dissipation. The combined effect of reduced precision and complexity reduction leads to a 37.44% power savings.Nokia CorporationTexas Instruments Inc.National Science Foundatio
    • 

    corecore