1,972 research outputs found

    The Distributed Network Processor: a novel off-chip and on-chip interconnection network architecture

    Full text link
    One of the most demanding challenges for the designers of parallel computing architectures is to deliver an efficient network infrastructure providing low latency, high bandwidth communications while preserving scalability. Besides off-chip communications between processors, recent multi-tile (i.e. multi-core) architectures face the challenge for an efficient on-chip interconnection network between processor's tiles. In this paper, we present a configurable and scalable architecture, based on our Distributed Network Processor (DNP) IP Library, targeting systems ranging from single MPSoCs to massive HPC platforms. The DNP provides inter-tile services for both on-chip and off-chip communications with a uniform RDMA style API, over a multi-dimensional direct network with a (possibly) hybrid topology.Comment: 8 pages, 11 figures, submitted to Hot Interconnect 200

    Fabric defect detection using the wavelet transform in an ARM processor

    Get PDF
    Small devices used in our day life are constructed with powerful architectures that can be used for industrial applications when requiring portability and communication facilities. We present in this paper an example of the use of an embedded system, the Zeus epic 520 single board computer, for defect detection in textiles using image processing. We implement the Haar wavelet transform using the embedded visual C++ 4.0 compiler for Windows CE 5. The algorithm was tested for defect detection using images of fabrics with five types of defects. An average of 95% in terms of correct defect detection was obtained, achieving a similar performance than using processors with float point arithmetic calculations

    Design space exploration tools for the ByoRISC configurable processor family

    Full text link
    In this paper, the ByoRISC (Build your own RISC) configurable application-specific instruction-set processor (ASIP) family is presented. ByoRISCs, as vendor-independent cores, provide extensive architectural parameters over a baseline processor, which can be customized by application-specific hardware extensions (ASHEs). Such extensions realize multi-input multi-output (MIMO) custom instructions with local state and load/store accesses to the data memory. ByoRISCs incorporate a true multi-port register file, zero-overhead custom instruction decoding, and scalable data forwarding mechanisms. Given these design decisions, ByoRISCs provide a unique combination of features that allow their use as architectural testbeds and the seamless and rapid development of new high-performance ASIPs. The performance characteristics of ByoRISCs, implemented as vendor-independent cores, have been evaluated for both ASIC and FPGA implementations, and it is proved that they provide a viable solution in FPGA-based system-on-a-chip design. A case study of an image processing pipeline is also presented to highlight the process of utilizing a ByoRISC custom processor. A peak performance speedup of up to 8.5×\times can be observed, whereas an average performance speedup of 4.4×\times on Xilinx Virtex-4 targets is achieved. In addition, ByoRISC outperforms an experimental VLIW architecture named VEX even in its 16-wide configuration for a number of data-intensive application kernels.Comment: 12 pages, 14 figures, 7 tables. Unpublished paper on ByoRISC, an extensible RISC with MIMO CIs that can outperform most mid-range VLIWs. Unfortunately Prof. Jorg Henkel destroyed the potential of this submission by using immoral tactics (neglecting his conflict of interest, changing reviewers accepting the paper, and requesting impossible additions for the average lifetime of an Earthlin

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    GCC-Plugin for Automated Accelerator Generation and Integration on Hybrid FPGA-SoCs

    Full text link
    In recent years, architectures combining a reconfigurable fabric and a general purpose processor on a single chip became increasingly popular. Such hybrid architectures allow extending embedded software with application specific hardware accelerators to improve performance and/or energy efficiency. Aiding system designers and programmers at handling the complexity of the required process of hardware/software (HW/SW) partitioning is an important issue. Current methods are often restricted, either to bare-metal systems, to subsets of mainstream programming languages, or require special coding guidelines, e.g., via annotations. These restrictions still represent a high entry barrier for the wider community of programmers that new hybrid architectures are intended for. In this paper we revisit HW/SW partitioning and present a seamless programming flow for unrestricted, legacy C code. It consists of a retargetable GCC plugin that automatically identifies code sections for hardware acceleration and generates code accordingly. The proposed workflow was evaluated on the Xilinx Zynq platform using unmodified code from an embedded benchmark suite.Comment: Presented at Second International Workshop on FPGAs for Software Programmers (FSP 2015) (arXiv:1508.06320

    Generating and evaluating application-specific hardware extensions

    Full text link
    Modern platform-based design involves the application-specific extension of embedded processors to fit customer requirements. To accomplish this task, the possibilities offered by recent custom/extensible processors for tuning their instruction set and microarchitecture to the applications of interest have to be exploited. A significant factor often determining the success of this process is the utomation available in application analysis and custom instruction generation. In this paper we present YARDstick, a design automation tool for custom processor development flows that focuses on generating and evaluating application-specific hardware extensions. YARDstick is a building block for ASIP development, integrating application analysis, custom instruction generation and selection with user-defined compiler intermediate representations. In a YARDstick-enabled environment, practical issues in traditional ASIP design are confronted efficiently; the exploration infrastructure is liberated from compiler and simulator idiosyncrasies, since the ASIP designer is empowered with the freedom of specifying the target architectures of choice and adding new implementations of analyses and custom instruction generation/selection methods. To illustrate the capabilities of the YARDstick approach, we present interesting exploration scenarios: quantifying the effect of machine-dependent compiler optimizations and the selection of the target architecture in terms of operation set and memory model on custom instruction generation/selection under different input/output constraints.Comment: 11 pages, 15 figures, 5 tables. An unpublished journal paper presenting the YARDstick custom instruction generation environmen

    Fingerprint Match in Box

    Full text link
    We open source fingerprint Match in Box, a complete end-to-end fingerprint recognition system embedded within a 4 inch cube. Match in Box stands in contrast to a typical bulky and expensive proprietary fingerprint recognition system which requires sending a fingerprint image to an external host for processing and subsequent spoof detection and matching. In particular, Match in Box is a first of a kind, portable, low-cost, and easy-to-assemble fingerprint reader with an enrollment database embedded within the reader's memory and open source fingerprint spoof detector, feature extractor, and matcher all running on the reader's internal vision processing unit (VPU). An onboard touch screen and rechargeable battery pack make this device extremely portable and ideal for applying both fingerprint authentication (1:1 comparison) and fingerprint identification (1:N search) to applications (vaccination tracking, food and benefit distribution programs, human trafficking prevention) in rural communities, especially in developing countries. We also show that Match in Box is suited for capturing neonate fingerprints due to its high resolution (1900 ppi) cameras

    A High-level EDA Environment for the Automatic Insertion of HD-BIST Structures

    Get PDF
    This paper presents a High-Level EDA environment based on the Hierarchical Distributed BIST (HD-BIST), a flexible and reusable approach to solve BIST scheduling issues in System-on-Chip applications. HD-BIST allows activating and controlling different BISTed blocks at different levels of hierarchy, with a minimum overhead in terms of area and test time. Besides the hardware layer, the authors present the HD-BIST application layer, where a simple modeling language, and a prototypical EDA tool demonstrate the effectiveness of the automation of the HD-BIST insertion in the test strategy definition of a complex System-on-Chip

    ThreadPoolComposer - An Open-Source FPGA Toolchain for Software Developers

    Full text link
    This extended abstract presents ThreadPoolComposer, a high-level synthesis-based development framework and meta-toolchain that provides a uniform programming interface for FPGAs portable across multiple platforms.Comment: Presented at Second International Workshop on FPGAs for Software Programmers (FSP 2015) (arXiv:1508.06320

    Efficient Realization of Givens Rotation through Algorithm-Architecture Co-design for Acceleration of QR Factorization

    Full text link
    We present efficient realization of Generalized Givens Rotation (GGR) based QR factorization that achieves 3-100x better performance in terms of Gflops/watt over state-of-the-art realizations on multicore, and General Purpose Graphics Processing Units (GPGPUs). GGR is an improvement over classical Givens Rotation (GR) operation that can annihilate multiple elements of rows and columns of an input matrix simultaneously. GGR takes 33% lesser multiplications compared to GR. For custom implementation of GGR, we identify macro operations in GGR and realize them on a Reconfigurable Data-path (RDP) tightly coupled to pipeline of a Processing Element (PE). In PE, GGR attains speed-up of 1.1x over Modified Householder Transform (MHT) presented in the literature. For parallel realization of GGR, we use REDEFINE, a scalable massively parallel Coarse-grained Reconfigurable Architecture, and show that the speed-up attained is commensurate with the hardware resources in REDEFINE. GGR also outperforms General Matrix Multiplication (gemm) by 10% in-terms of Gflops/watt which is counter-intuitive
    corecore