569 research outputs found

    Extending the PCIe Interface with Parallel Compression/Decompression Hardware for Energy and Performance Optimization

    Get PDF
    PCIe is a high-performing interface used to move data from a central host PC to an accelerator such as Field Programmable Gate Arrays (FPGA). This interface allows a system to perform fast data transfers in High-Performance Computing (HPC) and provide a performance boost. However, HPC systems normally require large datasets, and in these situations PCIe can become a bottleneck. To address this issue, we propose an open-source hardware compression/decompression system that can be used to adapt with continuously-streamed data with low latency and high throughput. We implement a compressor and decompressor engines on FPGA, scale up with multiple engines working in parallel, and evaluate the energy reduction and performance with different numbers of multiple engines. To alleviate the performance bottleneck in the processor acting as a controller, we propose a hardware scheduler to fairly distribute the datasets among the engines. Our design reduces the transmission time in PCIe, and the results show an energy reduction of up to 48% in the PCIe transfers, thanks to the decrease in the number of bits that have to be transmitted. The overhead in terms of latency is maintained to a minimum and user selectable depending on the tolerances of the intended application

    CPCIe:A Compression-enabled PCIe Core for Energy and Performance Optimization

    Get PDF

    A very high speed lossless compression/decompression chip set

    Get PDF
    A chip is described that will perform lossless compression and decompression using the Rice Algorithm. The chip set is designed to compress and decompress source data in real time for many applications. The encoder is designed to code at 20 M samples/second at MIL specifications. That corresponds to 280 Mbits/second at maximum quantization or approximately 500 Mbits/second under nominal conditions. The decoder is designed to decode at 10 M samples/second at industrial specifications. A wide range of quantization levels is allowed (4...14 bits) and both nearest neighbor prediction and external prediction are supported. When the pre and post processors are bypassed, the chip set performs high speed entropy coding and decoding. This frees the chip set from being tied to one modeling technique or specific application. Both the encoder and decoder are being fabricated in a 1.0 micron CMOS process that has been tested to survive 1 megarad of total radiation dosage. The CMOS chips are small, only 5 mm on a side, and both are estimated to consume less than 1/4 of a Watt of power while operating at maximum frequency

    Robust header compression over IEEE 802 networks

    Get PDF
    Tese de mestrado. Redes e Serviços de Comunicação. Faculdade de Engenharia. Universidade do Porto, INESC Porto. 200

    REDUCING POWER DURING MANUFACTURING TEST USING DIFFERENT ARCHITECTURES

    Get PDF
    Power during manufacturing test can be several times higher than power consumption in functional mode. Excessive power during test can cause IR drop, over-heating, and early aging of the chips. In this dissertation, three different architectures have been introduced to reduce test power in general cases as well as in certain scenarios, including field test. In the first architecture, scan chains are divided into several segments. Every segment needs a control bit to enable capture in a segment when new faults are detectable on that segment for that pattern. Otherwise, the segment should be disabled to reduce capture power. We group the control bits together into one or more control chains. To address the extra pin(s) required to shift data into the control chain(s) and significant post processing in the first architecture, we explored a second architecture. The second architecture stitches the control bits into the chains they control as EECBs (embedded enable capture bits) in between the segments. This allows an ATPG software tool to automatically generate the appropriate EECB values for each pattern to maintain the fault coverage. This also works in the presence of an on-chip decompressor. The last architecture focuses primarily on the self-test of a device in a 3D stacked IC when an existing FPGA in the stack can be programmed as a tester. We show that the energy expended during test is significantly less than would be required using low power patterns fed by an on-chip decompressor for the same very short scan chains

    Packet Compression in GPU Architectures

    Get PDF
    Graphical processing unit (GPU) can support multiple operations in parallel by executing it on multiple thread unit known as warp i.e. multiple threads running the same instruction. Each time miss happens at private cache of Streaming Multiprocessor (SM), the request is migrated over the network to shared L2 cache and then later down to Memory Controller (MC) for supplying memory block. The interconnect delay becomes a bottleneck due to a large number of requests from different SM and multiple replies from the MCs. The compression technique can be used to mitigate the performance bottleneck caused by a large volume of data. In this work, I apply various compression algorithms and propose a new compression scheme, Data Segment Matching (DSM). I apply approximation to the floating-point elements to improve compressibility and develop a prediction model to identify number of approximation bits. I focus on compression techniques to resolve this bottleneck. The evaluations using a cycle accurate simulator show that this scheme improves Instructions per Cycle (IPC) by 12% on an average across various benchmarks with compressibility 50% in integer type benchmarks and 35% in floating-point type benchmarks when the proposed scheme is applied to packet compression in the interconnection network
    corecore