1,972 research outputs found
The Distributed Network Processor: a novel off-chip and on-chip interconnection network architecture
One of the most demanding challenges for the designers of parallel computing
architectures is to deliver an efficient network infrastructure providing low
latency, high bandwidth communications while preserving scalability. Besides
off-chip communications between processors, recent multi-tile (i.e. multi-core)
architectures face the challenge for an efficient on-chip interconnection
network between processor's tiles. In this paper, we present a configurable and
scalable architecture, based on our Distributed Network Processor (DNP) IP
Library, targeting systems ranging from single MPSoCs to massive HPC platforms.
The DNP provides inter-tile services for both on-chip and off-chip
communications with a uniform RDMA style API, over a multi-dimensional direct
network with a (possibly) hybrid topology.Comment: 8 pages, 11 figures, submitted to Hot Interconnect 200
Fabric defect detection using the wavelet transform in an ARM processor
Small devices used in our day life are constructed with powerful architectures that can be used for industrial applications when requiring portability and communication facilities. We present in this paper an example of the use of an embedded system, the Zeus epic 520 single board computer, for defect detection in textiles using image processing. We implement the Haar wavelet transform using the embedded visual C++ 4.0 compiler for Windows CE 5. The algorithm was tested for defect detection using images of fabrics with five types of defects. An average of 95% in terms of correct defect detection was obtained, achieving a similar performance than using processors with float point arithmetic calculations
Design space exploration tools for the ByoRISC configurable processor family
In this paper, the ByoRISC (Build your own RISC) configurable
application-specific instruction-set processor (ASIP) family is presented.
ByoRISCs, as vendor-independent cores, provide extensive architectural
parameters over a baseline processor, which can be customized by
application-specific hardware extensions (ASHEs). Such extensions realize
multi-input multi-output (MIMO) custom instructions with local state and
load/store accesses to the data memory. ByoRISCs incorporate a true multi-port
register file, zero-overhead custom instruction decoding, and scalable data
forwarding mechanisms. Given these design decisions, ByoRISCs provide a unique
combination of features that allow their use as architectural testbeds and the
seamless and rapid development of new high-performance ASIPs.
The performance characteristics of ByoRISCs, implemented as
vendor-independent cores, have been evaluated for both ASIC and FPGA
implementations, and it is proved that they provide a viable solution in
FPGA-based system-on-a-chip design. A case study of an image processing
pipeline is also presented to highlight the process of utilizing a ByoRISC
custom processor. A peak performance speedup of up to 8.5 can be
observed, whereas an average performance speedup of 4.4 on Xilinx
Virtex-4 targets is achieved. In addition, ByoRISC outperforms an experimental
VLIW architecture named VEX even in its 16-wide configuration for a number of
data-intensive application kernels.Comment: 12 pages, 14 figures, 7 tables. Unpublished paper on ByoRISC, an
extensible RISC with MIMO CIs that can outperform most mid-range VLIWs.
Unfortunately Prof. Jorg Henkel destroyed the potential of this submission by
using immoral tactics (neglecting his conflict of interest, changing
reviewers accepting the paper, and requesting impossible additions for the
average lifetime of an Earthlin
Coarse-grained reconfigurable array architectures
Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code
GCC-Plugin for Automated Accelerator Generation and Integration on Hybrid FPGA-SoCs
In recent years, architectures combining a reconfigurable fabric and a
general purpose processor on a single chip became increasingly popular. Such
hybrid architectures allow extending embedded software with application
specific hardware accelerators to improve performance and/or energy efficiency.
Aiding system designers and programmers at handling the complexity of the
required process of hardware/software (HW/SW) partitioning is an important
issue. Current methods are often restricted, either to bare-metal systems, to
subsets of mainstream programming languages, or require special coding
guidelines, e.g., via annotations. These restrictions still represent a high
entry barrier for the wider community of programmers that new hybrid
architectures are intended for. In this paper we revisit HW/SW partitioning and
present a seamless programming flow for unrestricted, legacy C code. It
consists of a retargetable GCC plugin that automatically identifies code
sections for hardware acceleration and generates code accordingly. The proposed
workflow was evaluated on the Xilinx Zynq platform using unmodified code from
an embedded benchmark suite.Comment: Presented at Second International Workshop on FPGAs for Software
Programmers (FSP 2015) (arXiv:1508.06320
Generating and evaluating application-specific hardware extensions
Modern platform-based design involves the application-specific extension of
embedded processors to fit customer requirements. To accomplish this task, the
possibilities offered by recent custom/extensible processors for tuning their
instruction set and microarchitecture to the applications of interest have to
be exploited. A significant factor often determining the success of this
process is the utomation available in application analysis and custom
instruction generation.
In this paper we present YARDstick, a design automation tool for custom
processor development flows that focuses on generating and evaluating
application-specific hardware extensions. YARDstick is a building block for
ASIP development, integrating application analysis, custom instruction
generation and selection with user-defined compiler intermediate
representations. In a YARDstick-enabled environment, practical issues in
traditional ASIP design are confronted efficiently; the exploration
infrastructure is liberated from compiler and simulator idiosyncrasies, since
the ASIP designer is empowered with the freedom of specifying the target
architectures of choice and adding new implementations of analyses and custom
instruction generation/selection methods. To illustrate the capabilities of the
YARDstick approach, we present interesting exploration scenarios: quantifying
the effect of machine-dependent compiler optimizations and the selection of the
target architecture in terms of operation set and memory model on custom
instruction generation/selection under different input/output constraints.Comment: 11 pages, 15 figures, 5 tables. An unpublished journal paper
presenting the YARDstick custom instruction generation environmen
Fingerprint Match in Box
We open source fingerprint Match in Box, a complete end-to-end fingerprint
recognition system embedded within a 4 inch cube. Match in Box stands in
contrast to a typical bulky and expensive proprietary fingerprint recognition
system which requires sending a fingerprint image to an external host for
processing and subsequent spoof detection and matching. In particular, Match in
Box is a first of a kind, portable, low-cost, and easy-to-assemble fingerprint
reader with an enrollment database embedded within the reader's memory and open
source fingerprint spoof detector, feature extractor, and matcher all running
on the reader's internal vision processing unit (VPU). An onboard touch screen
and rechargeable battery pack make this device extremely portable and ideal for
applying both fingerprint authentication (1:1 comparison) and fingerprint
identification (1:N search) to applications (vaccination tracking, food and
benefit distribution programs, human trafficking prevention) in rural
communities, especially in developing countries. We also show that Match in Box
is suited for capturing neonate fingerprints due to its high resolution (1900
ppi) cameras
A High-level EDA Environment for the Automatic Insertion of HD-BIST Structures
This paper presents a High-Level EDA environment based on the Hierarchical Distributed BIST (HD-BIST), a flexible and reusable approach to solve BIST scheduling issues in System-on-Chip applications. HD-BIST allows activating and controlling different BISTed blocks at different levels of hierarchy, with a minimum overhead in terms of area and test time. Besides the hardware layer, the authors present the HD-BIST application layer, where a simple modeling language, and a prototypical EDA tool demonstrate the effectiveness of the automation of the HD-BIST insertion in the test strategy definition of a complex System-on-Chip
ThreadPoolComposer - An Open-Source FPGA Toolchain for Software Developers
This extended abstract presents ThreadPoolComposer, a high-level
synthesis-based development framework and meta-toolchain that provides a
uniform programming interface for FPGAs portable across multiple platforms.Comment: Presented at Second International Workshop on FPGAs for Software
Programmers (FSP 2015) (arXiv:1508.06320
Efficient Realization of Givens Rotation through Algorithm-Architecture Co-design for Acceleration of QR Factorization
We present efficient realization of Generalized Givens Rotation (GGR) based
QR factorization that achieves 3-100x better performance in terms of
Gflops/watt over state-of-the-art realizations on multicore, and General
Purpose Graphics Processing Units (GPGPUs). GGR is an improvement over
classical Givens Rotation (GR) operation that can annihilate multiple elements
of rows and columns of an input matrix simultaneously. GGR takes 33% lesser
multiplications compared to GR. For custom implementation of GGR, we identify
macro operations in GGR and realize them on a Reconfigurable Data-path (RDP)
tightly coupled to pipeline of a Processing Element (PE). In PE, GGR attains
speed-up of 1.1x over Modified Householder Transform (MHT) presented in the
literature. For parallel realization of GGR, we use REDEFINE, a scalable
massively parallel Coarse-grained Reconfigurable Architecture, and show that
the speed-up attained is commensurate with the hardware resources in REDEFINE.
GGR also outperforms General Matrix Multiplication (gemm) by 10% in-terms of
Gflops/watt which is counter-intuitive
- …