Search CORE

5 research outputs found

Resource conflict detection in simulation of function unit pipelines

Author: Guzma V.
Guzma Vladimir
Jääskeläinen Pekka
Jääskeläinen Pekka
Takala Jarmo
Publication venue
Publication date: 31/12/2006
Field of study

Processor simulators are important parts of processor design toolsets in which they are used to verify and evaluate the properties of the designed processors. While simulating architectures with independent function unit pipelines using simulation techniques that avoid the overhead of instruction bit-string interpretation, such as compiled simulation, the simulation of function unit pipelines can become one of the new bottlenecks for simulation speed. This paper evaluates several resource conflict detection models, commonly used in compiler instruction scheduling, in the context of function unit pipeline simulation. The evaluated models include the conventional reservation table based-model, the dynamic collision matrix model, and an finite state automata (FSA) based model. In addition, an improvement to the simulation initialization time by means of lazy initialization of states in the FSA-based approach is proposed. The resulting model is faster to initialize and provides comparable simulation speed to the actively initialized FSA. Document type: Part of book or chapter of boo

Scipedia

Instruction Buffer with Limited Control Flow and Loop Nest Support

Author: Guzma Vladimir
Pitkänen Teemu
Takala Jarmo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2011
Field of study

In this work, we present a minimalistic, energy efficient implementation of instruction buffer. We use loop detection and execution trace analysis to find most commonly executed loops in already scheduled application and tailor instruction buffer size to the size of most commonly executed loop(s). In addition to our previous work, we allow buffering of loops with limited control flow (early exit from the loop or early return to the beginning of the loop). We also show how analysis of loop nests can decrease the number of times loop body is copied from memory into the buffer. Our results show that in case of favorable loop nest, we can execute all but initial loop iterations from the instruction buffer, keeping instruction memory in the deselect mode.Peer reviewe

Trepo - Institutional Repository of Tampere University

A 122Mb/s Turbo decoder using a mid-range GPU

Author: Berg Heikki
Canfeng Chen
Guzma Vladimir
Jääskeläinen Pekka
Xianjun Jiao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

Parallel implementations of Turbo decoding has been studied extensively. Traditionally, the number of parallel sub-decoders is limited to maintain acceptable code block error rate performance loss caused by the edge effect of code block division. In addition, the sub-decoders require synchronization to exchange information in the iterative process. In this paper, we propose loosening the synchronization between the sub-decoders to achieve higher utilization of parallel processor resources. Our method allows high degree of parallel processor utilization in decoding of a single code block providing a scalable software-based implementation. The proposed implementation is demonstrated using a graphics processing unit. We achieve 122.8Mbps decoding throughput using a medium range GPU, the Nvidia GTX480. This is, to the best of our knowledge, the fastest Turbo decoding throughput achieved with a GPU-based implementation.Peer reviewe

Trepo - Institutional Repository of Tampere University

Turbo Decoding on Tailored OpenCL Processor

Author: Berg Heikki
Esko Otto
Guzma Vladimir
Jääskeläinen Pekka
Kultala Heikki
Takala Jarmo
Xianjun Jiao
Zetterman Tommi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

Turbo coding is commonly used in the current wireless standards such as 3G and 4G. However, due to the high computational requirements, its software-defined implementation is challenging. This paper proposes a static multi-issue exposed datapath processor design tailored for turbo decoding. In order to utilize the parallel processor datapath efficiently without resorting to low level assembly programming, the turbo decoder is implemented using OpenCL, a parallel programming standard for heterogeneous devices. The proposed implementation includes only a small set of Turbo-specific custom operations to accelerate the most critical parts of the algorithm. Most of the computation is performed using general-purpose integer operations. Thus, the processor design can be used as a general-purpose OpenCL accelerator for arbitrary integer workloads as well. The proposed processor design was evaluated both by implementing it using a Xilinx Virtex 6 FPGA and by ASIC synthesis using 130 nm and 40 nm technology libraries. The implementation achieves over 63 Mbps Turbo decoding throughput on a single low-power core. According to the ASIC synthesis, the maximum operating clock frequency is 344 MHz/1 050 MHz (130 nm/40 nm).acceptedVersionPeer reviewe

Trepo - Institutional Repository of Tampere University