21,121 research outputs found

    FPGA-based module for SURF extraction

    Get PDF
    We present a complete hardware and software solution of an FPGA-based computer vision embedded module capable of carrying out SURF image features extraction algorithm. Aside from image analysis, the module embeds a Linux distribution that allows to run programs specifically tailored for particular applications. The module is based on a Virtex-5 FXT FPGA which features powerful configurable logic and an embedded PowerPC processor. We describe the module hardware as well as the custom FPGA image processing cores that implement the algorithm's most computationally expensive process, the interest point detection. The module's overall performance is evaluated and compared to CPU and GPU based solutions. Results show that the embedded module achieves comparable disctinctiveness to the SURF software implementation running in a standard CPU while being faster and consuming significantly less power and space. Thus, it allows to use the SURF algorithm in applications with power and spatial constraints, such as autonomous navigation of small mobile robots

    NaNet: a Low-Latency, Real-Time, Multi-Standard Network Interface Card with GPUDirect Features

    Full text link
    While the GPGPU paradigm is widely recognized as an effective approach to high performance computing, its adoption in low-latency, real-time systems is still in its early stages. Although GPUs typically show deterministic behaviour in terms of latency in executing computational kernels as soon as data is available in their internal memories, assessment of real-time features of a standard GPGPU system needs careful characterization of all subsystems along data stream path. The networking subsystem results in being the most critical one in terms of absolute value and fluctuations of its response latency. Our envisioned solution to this issue is NaNet, a FPGA-based PCIe Network Interface Card (NIC) design featuring a configurable and extensible set of network channels with direct access through GPUDirect to NVIDIA Fermi/Kepler GPU memories. NaNet design currently supports both standard - GbE (1000BASE-T) and 10GbE (10Base-R) - and custom - 34~Gbps APElink and 2.5~Gbps deterministic latency KM3link - channels, but its modularity allows for a straightforward inclusion of other link technologies. To avoid host OS intervention on data stream and remove a possible source of jitter, the design includes a network/transport layer offload module with cycle-accurate, upper-bound latency, supporting UDP, KM3link Time Division Multiplexing and APElink protocols. After NaNet architecture description and its latency/bandwidth characterization for all supported links, two real world use cases will be presented: the GPU-based low level trigger for the RICH detector in the NA62 experiment at CERN and the on-/off-shore data link for KM3 underwater neutrino telescope

    Deep Learning-Based Multiple Object Visual Tracking on Embedded System for IoT and Mobile Edge Computing Applications

    Get PDF
    Compute and memory demands of state-of-the-art deep learning methods are still a shortcoming that must be addressed to make them useful at IoT end-nodes. In particular, recent results depict a hopeful prospect for image processing using Convolutional Neural Netwoks, CNNs, but the gap between software and hardware implementations is already considerable for IoT and mobile edge computing applications due to their high power consumption. This proposal performs low-power and real time deep learning-based multiple object visual tracking implemented on an NVIDIA Jetson TX2 development kit. It includes a camera and wireless connection capability and it is battery powered for mobile and outdoor applications. A collection of representative sequences captured with the on-board camera, dETRUSC video dataset, is used to exemplify the performance of the proposed algorithm and to facilitate benchmarking. The results in terms of power consumption and frame rate demonstrate the feasibility of deep learning algorithms on embedded platforms although more effort to joint algorithm and hardware design of CNNs is needed.Comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibl

    A general framework for efficient FPGA implementation of matrix product

    Get PDF
    Original article can be found at: http://www.medjcn.com/ Copyright Softmotor LimitedHigh performance systems are required by the developers for fast processing of computationally intensive applications. Reconfigurable hardware devices in the form of Filed-Programmable Gate Arrays (FPGAs) have been proposed as viable system building blocks in the construction of high performance systems at an economical price. Given the importance and the use of matrix algorithms in scientific computing applications, they seem ideal candidates to harness and exploit the advantages offered by FPGAs. In this paper, a system for matrix algorithm cores generation is described. The system provides a catalog of efficient user-customizable cores, designed for FPGA implementation, ranging in three different matrix algorithm categories: (i) matrix operations, (ii) matrix transforms and (iii) matrix decomposition. The generated core can be either a general purpose or a specific application core. The methodology used in the design and implementation of two specific image processing application cores is presented. The first core is a fully pipelined matrix multiplier for colour space conversion based on distributed arithmetic principles while the second one is a parallel floating-point matrix multiplier designed for 3D affine transformations.Peer reviewe

    DeSyRe: on-Demand System Reliability

    No full text
    The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints
    • …
    corecore