5 research outputs found

    Resource conflict detection in simulation of function unit pipelines

    No full text
    Processor simulators are important parts of processor design toolsets in which they are used to verify and evaluate the properties of the designed processors. While simulating architectures with independent function unit pipelines using simulation techniques that avoid the overhead of instruction bit-string interpretation, such as compiled simulation, the simulation of function unit pipelines can become one of the new bottlenecks for simulation speed. This paper evaluates several resource conflict detection models, commonly used in compiler instruction scheduling, in the context of function unit pipeline simulation. The evaluated models include the conventional reservation table based-model, the dynamic collision matrix model, and an finite state automata (FSA) based model. In addition, an improvement to the simulation initialization time by means of lazy initialization of states in the FSA-based approach is proposed. The resulting model is faster to initialize and provides comparable simulation speed to the actively initialized FSA. Document type: Part of book or chapter of boo

    Instruction Buffer with Limited Control Flow and Loop Nest Support

    Get PDF
    In this work, we present a minimalistic, energy efficient implementation of instruction buffer. We use loop detection and execution trace analysis to find most commonly executed loops in already scheduled application and tailor instruction buffer size to the size of most commonly executed loop(s). In addition to our previous work, we allow buffering of loops with limited control flow (early exit from the loop or early return to the beginning of the loop). We also show how analysis of loop nests can decrease the number of times loop body is copied from memory into the buffer. Our results show that in case of favorable loop nest, we can execute all but initial loop iterations from the instruction buffer, keeping instruction memory in the deselect mode.Peer reviewe

    A 122Mb/s Turbo decoder using a mid-range GPU

    Get PDF
    Parallel implementations of Turbo decoding has been studied extensively. Traditionally, the number of parallel sub-decoders is limited to maintain acceptable code block error rate performance loss caused by the edge effect of code block division. In addition, the sub-decoders require synchronization to exchange information in the iterative process. In this paper, we propose loosening the synchronization between the sub-decoders to achieve higher utilization of parallel processor resources. Our method allows high degree of parallel processor utilization in decoding of a single code block providing a scalable software-based implementation. The proposed implementation is demonstrated using a graphics processing unit. We achieve 122.8Mbps decoding throughput using a medium range GPU, the Nvidia GTX480. This is, to the best of our knowledge, the fastest Turbo decoding throughput achieved with a GPU-based implementation.Peer reviewe

    Turbo Decoding on Tailored OpenCL Processor

    Get PDF
    Turbo coding is commonly used in the current wireless standards such as 3G and 4G. However, due to the high computational requirements, its software-defined implementation is challenging. This paper proposes a static multi-issue exposed datapath processor design tailored for turbo decoding. In order to utilize the parallel processor datapath efficiently without resorting to low level assembly programming, the turbo decoder is implemented using OpenCL, a parallel programming standard for heterogeneous devices. The proposed implementation includes only a small set of Turbo-specific custom operations to accelerate the most critical parts of the algorithm. Most of the computation is performed using general-purpose integer operations. Thus, the processor design can be used as a general-purpose OpenCL accelerator for arbitrary integer workloads as well. The proposed processor design was evaluated both by implementing it using a Xilinx Virtex 6 FPGA and by ASIC synthesis using 130 nm and 40 nm technology libraries. The implementation achieves over 63 Mbps Turbo decoding throughput on a single low-power core. According to the ASIC synthesis, the maximum operating clock frequency is 344 MHz/1 050 MHz (130 nm/40 nm).acceptedVersionPeer reviewe
    corecore