1,633 research outputs found

    Hardware acceleration of the trace transform for vision applications

    Get PDF
    Computer Vision is a rapidly developing field in which machines process visual data to extract meaningful information. Digitised images in their pixels and bits serve no purpose of their own. It is only by interpreting the data, and extracting higher level information that a scene can be understood. The algorithms that enable this process are often complex, and data-intensive, limiting the processing rate when implemented in software. Hardware-accelerated implementations provide a significant performance boost that can enable real- time processing. The Trace Transform is a newly proposed algorithm that has been proven effective in image categorisation and recognition tasks. It is flexibly defined allowing the mathematical details to be tailored to the target application. However, it is highly computationally intensive, which limits its applications. Modern heterogeneous FPGAs provide an ideal platform for accelerating the Trace transform for real-time performance, while also allowing an element of flexibility, which highly suits the generality of the Trace transform. This thesis details the implementation of an extensible Trace transform architecture for vision applications, before extending this architecture to a full flexible platform suited to the exploration of Trace transform applications. As part of the work presented, a general set of architectures for large-windowed median and weighted median filters are presented as required for a number of Trace transform implementations. Finally an acceleration of Pseudo 2-Dimensional Hidden Markov Model decoding, usable in a person detection system, is presented. Such a system can be used to extract frames of interest from a video sequence, to be subsequently processed by the Trace transform. All these architectures emphasise the need for considered, platform-driven design in achieving maximum performance through hardware acceleration

    Dataplane Specialization for High-performance OpenFlow Software Switching

    Get PDF
    OpenFlow is an amazingly expressive dataplane program- ming language, but this expressiveness comes at a severe performance price as switches must do excessive packet clas- sification in the fast path. The prevalent OpenFlow software switch architecture is therefore built on flow caching, but this imposes intricate limitations on the workloads that can be supported efficiently and may even open the door to mali- cious cache overflow attacks. In this paper we argue that in- stead of enforcing the same universal flow cache semantics to all OpenFlow applications and optimize for the common case, a switch should rather automatically specialize its dat- aplane piecemeal with respect to the configured workload. We introduce ES WITCH , a novel switch architecture that uses on-the-fly template-based code generation to compile any OpenFlow pipeline into efficient machine code, which can then be readily used as fast path. We present a proof- of-concept prototype and we demonstrate on illustrative use cases that ES WITCH yields a simpler architecture, superior packet processing speed, improved latency and CPU scala- bility, and predictable performance. Our prototype can eas- ily scale beyond 100 Gbps on a single Intel blade even with complex OpenFlow pipelines

    Analog Content-Addressable Memory from Complementary FeFETs

    Full text link
    To address the increasing computational demands of artificial intelligence (AI) and big data, compute-in-memory (CIM) integrates memory and processing units into the same physical location, reducing the time and energy overhead of the system. Despite advancements in non-volatile memory (NVM) for matrix multiplication, other critical data-intensive operations, like parallel search, have been overlooked. Current parallel search architectures, namely content-addressable memory (CAM), often use binary, which restricts density and functionality. We present an analog CAM (ACAM) cell, built on two complementary ferroelectric field-effect transistors (FeFETs), that performs parallel search in the analog domain with over 40 distinct match windows. We then deploy it to calculate similarity between vectors, a building block in the following two machine learning problems. ACAM outperforms ternary CAM (TCAM) when applied to similarity search for few-shot learning on the Omniglot dataset, yielding projected simulation results with improved inference accuracy by 5%, 3x denser memory architecture, and more than 100x faster speed compared to central processing unit (CPU) and graphics processing unit (GPU) per similarity search on scaled CMOS nodes. We also demonstrate 1-step inference on a kernel regression model by combining non-linear kernel computation and matrix multiplication in ACAM, with simulation estimates indicating 1,000x faster inference than CPU and GPU

    High Performance Biological Pairwise Sequence Alignment: FPGA versus GPU versus Cell BE versus GPP

    Get PDF
    This paper explores the pros and cons of reconfigurable computing in the form of FPGAs for high performance efficient computing. In particular, the paper presents the results of a comparative study between three different acceleration technologies, namely, Field Programmable Gate Arrays (FPGAs), Graphics Processor Units (GPUs), and IBM’s Cell Broadband Engine (Cell BE), in the design and implementation of the widely-used Smith-Waterman pairwise sequence alignment algorithm, with general purpose processors as a base reference implementation. Comparison criteria include speed, energy consumption, and purchase and development costs. The study shows that FPGAs largely outperform all other implementation platforms on performance per watt criterion and perform better than all other platforms on performance per dollar criterion, although by a much smaller margin. Cell BE and GPU come second and third, respectively, on both performance per watt and performance per dollar criteria. In general, in order to outperform other technologies on performance per dollar criterion (using currently available hardware and development tools), FPGAs need to achieve at least two orders of magnitude speed-up compared to general-purpose processors and one order of magnitude speed-up compared to domain-specific technologies such as GPUs

    FPGA implementation and performance comparison of a Bayesian face detection system

    Get PDF
    Face detection has primarily been a software-based effort. A hardware-based approach can provide significant speed-up over its software counterpart. Advances in transistor technology have made it possible to produce larger and faster FPGAs at more affordable prices. Through VHDL and synthesis tools it is possible to rapidly develop a hardware-based solution to face detection on an FPGA. This work analyzes and compares the performance of a feature-invariant face detection method implemented in software and an FPGA. The primary components of the face detector were a Bayesian classifier used to segment the image into skin and nonskin pixels, and a direct least square elliptical fitting technique to determine if the skin region\u27s shape has elliptical characteristics similar to a face. The C++ implementation was benchmarked on several high performance workstations, while the VHDL implementation was synthesized for FPGAs from several Xilinx product lines. The face detector used to compare software and hardware performance had a modest correct detection rate of 48.6% and a false alarm rate of 29.7%. The elliptical-shape of the region was determined to be an inaccurate approach for filtering out non-face skin regions. The software-based face detector was capable of detecting faces within images of approximately 378x567 pixels or less at 20 frames per second on Pentium 4 and Pentium D systems. The FPGA-based implementation was capable of faster detection speeds; a speedup of 3.33 was seen on a Spartan 3 and 4.52 on a Virtex 4. The comparison shows that an FPGA-based face detector could provide a significant increase in computational speed
    corecore