490 research outputs found

    Tinsel: a manythread overlay for FPGA clusters

    Get PDF
    Commodity FPGA boards with advanced networking facilities have great potential in the construction of high-performance compute clusters that scale. However, low-level design tools and long synthesis times are major barriers to productivity for application developers. In this paper, we explore the potential of a distributed soft-processor overlay, programmed in software at a high-level of abstraction, to deliver a useful level of performance for FPGA clusters. In particular, we demonstrate the use of hardware multhreading to achieve a fast, space-efficient, high-throughput overlay, and compare a 12-FPGA instance of it (12,288 RISC-V threads) against a conventional Xeon cluster on the problem of distributed graph processing.This work was supported by EPSRC grant EP/N031768/1 (POETS project)

    The Impact of the Multi-core Revolution on Signal Processing

    Get PDF
    This paper analyzes the influence of new multi- core and many-core architectures on Signal Processing. The article covers both the architectural design and the programming models of current general-purpose multi-core processors and graphics processors (GPU), with the goal of identifying their possibilities and impact on signal processing applications

    H-SIMD machine : configurable parallel computing for data-intensive applications

    Get PDF
    This dissertation presents a hierarchical single-instruction multiple-data (H-SLMD) configurable computing architecture to facilitate the efficient execution of data-intensive applications on field-programmable gate arrays (FPGAs). H-SIMD targets data-intensive applications for FPGA-based system designs. The H-SIMD machine is associated with a hierarchical instruction set architecture (HISA) which is developed for each application. The main objectives of this work are to facilitate ease of program development and high performance through ease of scheduling operations and overlapping communications with computations. The H-SIMD machine is composed of the host, FPGA and nano-processor layers. They execute host SIMD instructions (HSIs), FPGA SIMD instructions (FSIs) and nano-processor instructions (NPLs), respectively. A distinction between communication and computation instructions is intended for all the HISA layers. The H-SIMD machine also employs a memory switching scheme to bridge the omnipresent large bandwidth gaps in configurable systems. To showcase the proposed high-performance approach, the conditions to fully overlap communications with computations are investigated for important applications. The building blocks in the H-SLMD machine, such as high-performance and area-efficient register files, are presented in detail. The H-SLMD machine hierarchy is implemented on a host Dell workstation and the Annapolis Wildstar II FPGA board. Significant speedups have been achieved for matrix multiplication (MM), 2-dimensional discrete cosine transform (2D DCT) and 2-dimensional fast Fourier transform (2D FFT) which are used widely in science and engineering. In another FPGA-based programming paradigm, a high-level language (here ANSI C) can be used to program the FPGAs in a mode similar to that of the H-SIMD machine in terms of trying to minimize the effect of overheads. More specifically, a multi-threaded overlapping scheme is proposed to reduce as much as possible, or even completely hide, runtime FPGA reconfiguration overheads. Nevertheless, although the HLL-enabled reconfigurable machine allows software developers to customize FPGA functions easily, special architecture techniques are needed to achieve high-performance without significant penalty on area and clock frequency. Two important high-performance applications, matrix multiplication and image edge detection, are tested on the SRC-6 reconfigurable machine. The implemented algorithms are able to exploit the available data parallelism with independent functional units and application-specific cache support. Relevant performance and design tradeoffs are analyzed

    MLPerf Inference Benchmark

    Full text link
    Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. In this paper, we present our benchmarking method for evaluating ML inference systems. Driven by more than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures. The first call for submissions garnered more than 600 reproducible inference-performance measurements from 14 organizations, representing over 30 systems that showcase a wide range of capabilities. The submissions attest to the benchmark's flexibility and adaptability.Comment: ISCA 202

    Hardware accelerated real-time Linux video anonymizer

    Get PDF
    Dissertação de mestrado em Engenharia Eletrónica Industrial e ComputadoresOs Sistemas Embebidos estão presentes atualmente numa variada gama de equipamentos do quotidiano do ser humano. Desde TV-boxes, televisões, routers até ao indispensável telemóvel. O Sistema Operativo Linux, com a sua filosofia de distribuição ”one-size-fits-all” tornou-se uma alternativa viável, fornecendo um vasto suporte de hardware, técnicas de depuração, suporte dos protocolos de comunicação de rede, entre outros serviços, que se tornaram no conjunto standard de requisitos na maioria dos sistemas embebidos atuais. Este sistema operativo torna-se apelativo pela sua filosofia open-source que disponibiliza ao utilizador um vasto conjunto de bibliotecas de software que possibilitam o desenvolvimento num determinado domínio com maior celeridade e facilidade de integração de software complexo. Os algoritmos deMachine Learning são desenvolvidos para a automização de tarefas e estão presentes nas mais variadas tecnologias, desde o sistema de foco de imagem nosmartphone até ao sistema de deteção dos limites de faixa de rodagem de um sistema de condução autónoma. Estes são algoritmos que quando compilados para as plataformas de sistemas embebidos, resultam num esforço de processamento e de consumo de recursos, como o footprint de memória, que na maior parte dos casos supera em larga escala o conjunto de recursos disponíveis para a aplicação do sistema, sendo necessária a implementação de componentes que requerem maior poder de processamento através de elementos de hardware para garantir que as métricas tem porais sejam satisfeitas. Esta dissertação propõe-se, por isso, à criação de um sistema de anonimização de vídeo que adquire, processa e manipula as frames, com o intuito de garantir o anonimato, mesmo na transmissão. A sua implementação inclui técnicas de Deteção de Objectos, fazendo uso da combinação das tecnologias de aceleração por hardware: paralelização e execução em hardware especial izado. É proposta então uma implementação restringida tanto temporalmente como no consumo de recursos ao nível do hardware e software.Embedded Systems are currently present in a wide range of everyday equipment. From TV-boxes, televisions and routers to the indispensable smartphone. Linux Operating System, with its ”one-size-fits-all” distribution philosophy, has become a viable alternative, providing extensive support for hardware, debugging techniques, network com munication protocols, among other functionalities, which have become the standard set of re quirements in most modern embedded systems. This operating system is appealing due to its open-source philosophy, which provides the user with a vast set of software libraries that enable development in a given domain with greater speed and ease the integration of complex software. Machine Learning algorithms are developed to execute tasks autonomously, i.e., without human supervision, and are present in the most varied technologies, from the image focus system on the smartphone to the detection system of the lane limits of an autonomous driving system. These are algorithms that, when compiled for embedded systems platforms, require an ef fort to process and consume resources, such as the memory footprint, which in most cases far outweighs the set of resources available for the application of the system, requiring the imple mentation of components that need greater processing power through elements of hardware to ensure that the time metrics are satisfied. This dissertation proposes the creation of a video anonymization system that acquires, pro cesses, and manipulates the frames, in order to guarantee anonymity, even during the transmis sion. Its implementation includes Object Detection techniques, making use of the combination of hardware acceleration technologies: parallelization and execution in specialized hardware. An implementation is then proposed, restricted both in time and in resource consumption at hardware and software levels

    Performance Aspects of Synthesizable Computing Systems

    Get PDF

    Image Processing Using FPGAs

    Get PDF
    This book presents a selection of papers representing current research on using field programmable gate arrays (FPGAs) for realising image processing algorithms. These papers are reprints of papers selected for a Special Issue of the Journal of Imaging on image processing using FPGAs. A diverse range of topics is covered, including parallel soft processors, memory management, image filters, segmentation, clustering, image analysis, and image compression. Applications include traffic sign recognition for autonomous driving, cell detection for histopathology, and video compression. Collectively, they represent the current state-of-the-art on image processing using FPGAs

    ControlPULP: A RISC-V On-Chip Parallel Power Controller for Many-Core HPC Processors with FPGA-Based Hardware-In-The-Loop Power and Thermal Emulation

    Full text link
    High-Performance Computing (HPC) processors are nowadays integrated Cyber-Physical Systems demanding complex and high-bandwidth closed-loop power and thermal control strategies. To efficiently satisfy real-time multi-input multi-output (MIMO) optimal power requirements, high-end processors integrate an on-die power controller system (PCS). While traditional PCSs are based on a simple microcontroller (MCU)-class core, more scalable and flexible PCS architectures are required to support advanced MIMO control algorithms for managing the ever-increasing number of cores, power states, and process, voltage, and temperature variability. This paper presents ControlPULP, an open-source, HW/SW RISC-V parallel PCS platform consisting of a single-core MCU with fast interrupt handling coupled with a scalable multi-core programmable cluster accelerator and a specialized DMA engine for the parallel acceleration of real-time power management policies. ControlPULP relies on FreeRTOS to schedule a reactive power control firmware (PCF) application layer. We demonstrate ControlPULP in a power management use-case targeting a next-generation 72-core HPC processor. We first show that the multi-core cluster accelerates the PCF, achieving 4.9x speedup compared to single-core execution, enabling more advanced power management algorithms within the control hyper-period at a shallow area overhead, about 0.1% the area of a modern HPC CPU die. We then assess the PCS and PCF by designing an FPGA-based, closed-loop emulation framework that leverages the heterogeneous SoCs paradigm, achieving DVFS tracking with a mean deviation within 3% the plant's thermal design power (TDP) against a software-equivalent model-in-the-loop approach. Finally, we show that the proposed PCF compares favorably with an industry-grade control algorithm under computational-intensive workloads.Comment: 33 pages, 11 figure

    Low-Power and Error-Resilient VLSI Circuits and Systems.

    Full text link
    Efficient low-power operation is critically important for the success of the next-generation signal processing applications. Device and supply voltage have been continuously scaled to meet a more constrained power envelope, but scaling has created resiliency challenges, including increasing timing faults and soft errors. Our research aims at designing low-power and robust circuits and systems for signal processing by drawing circuit, architecture, and algorithm approaches. To gain an insight into the system faults due to supply voltage reduction, we researched the two primary effects that determine the minimum supply voltage (VMIN) in Intel’s tri-gate CMOS technology, namely process variations and gate-dielectric soft breakdown. We determined that voltage scaling increases the timing window that sequential circuits are vulnerable. Thus, we proposed a new hold-time violation metric to define hold-time VMIN, which has been adopted as a new design standard. Device scaling increases soft errors which affect circuit reliability. Through extensive soft error characterization using two 65nm CMOS test chips, we studied the soft error mechanisms and its dependence on supply voltage and clock frequency. This study laid the foundation of the first 65nm DSP chip design for a NASA spaceflight project. To mitigate such random errors, we proposed a new confidence-driven architecture that effectively enhances the error resiliency of deeply scaled CMOS and post-CMOS circuits. Designing low-power resilient systems can effectively leverage application-specific algorithmic approaches. To explore design opportunities in the algorithmic domain, we demonstrate an application-specific detection and decoding processor for multiple-input multiple-output (MIMO) wireless communication. To enhance the receive error rate for a robust wireless communication, we designed a joint detection and decoding technique by enclosing detection and decoding in an iterative loop to enhance both interference cancellation and error reduction. A proof-of-concept chip design was fabricated for the next-generation 4x4 256QAM MIMO systems. Through algorithm-architecture optimizations and low-power circuit techniques, our design achieves significant improvements in throughput, energy efficiency and error rate, paving the way for future developments in this area.PhDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/110323/1/uchchen_1.pd
    corecore