8 research outputs found

    A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs

    Get PDF
    Abstract-Data compression techniques have been the subject of intense study over the past several decades due to exponential increases in the quantity of data stored and transmitted by computer systems. Compression algorithms are traditionally forced to make tradeoffs between throughput and compression quality (the ratio of original file size to compressed file size). FPGAs represent a compelling substrate for streaming applications such as data compression thanks to their capacity for deep pipelines and custom caching solutions. Unfortunately, data hazards in compression algorithms such as LZ77 inhibit the creation of deep pipelines without sacrificing some amount of compression quality. In this work we detail a scalable fully pipelined FPGA accelerator that performs LZ77 compression and static Huffman encoding at rates up to 5.6 GB/s. Furthermore, we explore tradeoffs between compression quality and FPGA area that allow the same throughput at a fraction of the logic utilization in exchange for moderate reductions in compression quality. Compared to recent FPGA compression studies, our emphasis on scalability gives our accelerator a 3.0x advantage in resource utilization at equivalent throughput and compression ratio

    A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

    Get PDF
    Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we designed and built a composable, reconfigurable hardware fabric based on field programmable gate arrays (FPGA). Each server in the fabric contains one FPGA, and all FPGAs within a 48-server rack are interconnected over a low-latency, high-bandwidth network. We describe a medium-scale deployment of this fabric on a bed of 1632 servers, and measure its effectiveness in accelerating the ranking component of the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system. Under high load, the large-scale reconfigurable fabric improves the ranking throughput of each server by 95% at a desirable latency distribution or reduces tail latency by 29% at a fixed throughput. In other words, the reconfigurable fabric enables the same throughput using only half the number of servers

    A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications

    No full text
    With the emergence of accelerator devices such as multicores, graphics-processing units (GPUs), and field-programmable gate arrays (FPGAs), application designers are confronted with the problem of searching a huge design space that has been shown to have widely varying performance and energy metrics for different accelerators, different application domains, and different use cases. To address this problem, numerous studies have evaluated specific applications across different accelerators. In this paper, we analyze an important domain of applications, referred to as sliding-window applications, when executing on FPGAs, GPUs, and multicores. For each device, we present optimization strategies and analyze use cases where each device is most effective. The results show that FPGAs can achieve speedup of up to 11x and 57x compared to GPUs and multicores, respectively, while also using orders of magnitude less energy

    Accelerating Deep Convolutional Neural Networks Using Specialized Hardware

    No full text
    Abstract Recent breakthroughs in the development of multi-layer convolutional neural networks have led to stateof-the-art improvements in the accuracy of non-trivial recognition tasks such as large-category image classification and automatic speech recognition Hardware specialization in the form of GPGPUs, FPGAs, and ASICs 1 offers a promising path towards major leaps in processing capability while achieving high energy efficiency. To harness specialization, an effort is underway at Microsoft to accelerate Deep Convolutional Neural Networks (CNN) using servers augmented with FPGAs-similar to the hardware that is being integrated into some of Microsoft's datacenter

    A reconfigurable fabric for accelerating large-scale datacenter services

    No full text
    Datacenter workloads demand high computational capabili-ties, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance dat-acenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, recon-figurable fabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6x8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, acces-sible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cables. In this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents. Under high load, the large-scale reconfigurable fabric improves the ranking throughput of each server by a factor of 95 % for a fixed latency distribution— or, while maintaining equivalent throughput, reduces the tail latency by 29%. 1
    corecore