Abstract: This paper reports the result of a comparison between reduced instruction set computing and the transport triggered architecture. Because of the simplicity and efficiency of the transport triggered architecture, its processor requires less execution cycles compared to the OpenRisc processor. This paper also presents a case study about designing an Architecture Definition File for a transport triggered architecture-based design tool, and it depicts how the Architecture Definition File structures are responsible for implementing high-speed design. In a custom Architecture Definition File, a new function unit is designed to improve processor performance, and it shows that the cycle count required to implement the Cyclic Redundancy Check algorithm drops to 7 executions from 5031.
Introduction
Programming for a transport triggered architecture (TTA) differs from programming for traditional processor structures. Since the custom functional unit of TTA brings good flexibility, it is an excellent solution to speed up an application-specific instruction set processor (ASIP) [1] . Moreover, the working principle of a TTA processor goes beyond the commonly-supported reduced instruction set computing (RISC)-like operations, so this solution does not suffer from communication overhead because of custom functional units. This paper illustrates the differences between the OpenRisc and TCE toolchains, in terms of cycle count, as a part of RISC and TTA processors. Later in the paper, TTA is modified to improve the performance of a TTA processor. Figure 1 shows an overview of these two different schemes.
In this paper, a register transfer level (RTL) simulation using the Icarus Verilog Simulator is done for reference designs using the OpenRisc architecture. This simulation consists of a 32-bit scalar RISC with Harvard micro architecture, 5-stage single-issue integer pipeline, virtual memory support, and basic DSP capabilities [2] . Figure 1(a) shows an overview of the OpenRisc 1200 core architecture. For RTL implementation, all the blocks of the OpenRisc 1200 IP core are written in the Verilog Hardware Description Language (HDL) and are published under the GNU License. Here the test programs are compiled to the Executable and Linkable Format (ELF) file format which can be executed both in an instruction set simulator (ISS) and an RTL simulator.
The memory operand warps around from the maximum effective address and Load/Store instructions using these address modes contain a signed 16-bit immediate value that is added to the content of a generalpurpose register specified in the instructions [4] . OpenRisc 1200 implements 32 of the 32-bit general-purpose registers (GPRs). The Load/Store Unit (LSU) transfers all data between the GPRs and the CPU's internal bus. In Figure 1(b) , the instruction unit implements the basic instruction pipeline by fetching instructions from the memory subsystem, dispatching them to available execution units, and maintaining a state history to ensure a precise execution model. For this OpenRisc processor, there is a five-stage pipeline that executes the stages of fetch, decode, execute, memory, and write-back [4] . These five instructions are in progress at any given clock cycle, and each stage of the pipeline performs its task in parallel with all other stages. So, in this paper, the execution clock cycles are counted for the OpenRisc processor by applying two reference designs, named the CRCFast and CRCSlow architectures. The result will be discussed extensively in the experimental results section. Figure 1(c) shows the basic architecture definition file of the TTA processor using the TCE tool [5] . The transport bus is used to transfer the operand which executes with instructions. Here the number of buses is customized so as to reduce cycle counts. In the TTA architecture, it is possible to implement one or more operations using FUs, because the FUs of this architecture are internally pipelined. As shown in Figure 1 (c), one of the FU ports is called the triggered port. It is executed as a side effect while transferring the operand through this port. That means no extra operations are required for triggering, which reduces further instructions. However, the result can be read from the output port after the time defined by the static time latency of the operation. In TTA, RFs are part of an interconnected network that is visible to the programmer. TTA templates allow the customization of RFs and FUs, which brings a tremendous improvement of performance to the processor. For adding custom operations in the TCE tool, the processor architecture and high level language (HLL) source code are modified according to the custom operation. This paper shows the performance of these custom operations in terms of cycle count.
Architecture of TTA and OpenRisc processor
The minimum architecture known as minimal.adf for the TCE tool is used to compile a minimalistic architecture.
The resources of all architecture definition files (ADFs) are listed in Table I . From the cyclic redundancy check (CRC) algorithm, a "Reflect" function is used, and an FU named "REFLECTER" is added to this minimal.adf ADF [3] . The new name of this starting point architecture is custom.adf. Instead of duplicating ALU, a new architecture named custom_1.adf is formed by adding FUs to custom.adf that perform logical AND and ADD operations.
In this paper, the response of an efficient CRC architecture known as CRCFast is analyzed with different architecture files, and a new FU named CRCFast for CRC is added to the minimal.adffileknown as custom_2.adf.
For the OpenRisc processor, a "cfg" file contains the default configurations and a set of simulation environments, which are similar to the actual hardware situation. For the RTL simulator, the verilog files of all IP cores are included using a MAKE file. Therefore, once the environment is configured, the simulator generates the "log" files under the "out" and "run" folders. The minimal architecture of the OpenRisc processor is shown in Table I .
Simulation results
The first part of Table II shows the execution result of two processors for the CRCFast and CRCSlow reference designs. CRCFast is more efficient compared to CRCSlow. It takes less execution cycles than the CRCSlow structure. For the CRCFast reference design, the OpenRisc processor performs 4179 executions, which is more than double the amount of the TTA processor. In the OpenRisc processor, the reference design is compiled using OpenRisc toolchain (or32elf) and a memory image is generated (vmem). Then this program image is used in simulation to fill the RAM. Next the verilog RTL sources check, compile, and simulate the result. Sothe RISC processor will generate all the required signals to execute the operation. In contrast, the operations of the TTA processor are executed as side effects of the data transport when data enters an operation triggering port of FU. For that reason, OpenRisc takes more executions than a TTA processor to implement the design. A similar comment is also applicable for implementing CRCSlow reference design. Therefore, we can say that transport triggered architectures replace RISC operations by virtue of even simpler actions known as register-register data transports. So the compiler can perform extra optimizations and has far better control over hardware usage.
The latter part of Table II summarizes the cycle counts and resource utilizations for all ADFs. Simulations are done using different ADFs to make some comparisons by taking the data regarding resource utilizations and cycle counts. In Table II , it can be shown that the minimal architecture file does not offer good performance in terms of cycle count. It takes 5031 cycles to execute the CRC function application. According to the resource utilization column of Table II, the ADD operation of minimal architecture utilizes 16.77% (844) of total executions, and word load (LDW) utilizes 9.8% (494) of total executions. The majority of the executions are done to perform the reflection of bit patterns. From the operation of ALU for minimal.adf architecture, logic operators take a significant fraction of executions, because from the software point of view this bit pattern is reflected iteratively one bit/byte at a time. However, when this reflection operation is included as a part of hardware architecture, i.e. in ADF architecture, the whole bit pattern is reflected at once.
For that, in custom.adf, the REFLECTER FU architecture is added with REFLECT 8 and REFLECT 32 operators. It can be shown from Table II that a significant improvement is found with custom.adf and the cycle count drops to 958 executions. Now again it is noted that most of the execution of this custom architecture is consumed by the ADD operator; it utilizes 17.2% (165) of the total executions. So this architecture is further modified using ADD function units, and it is named custom_1.adf. Here cycle count is reduced further to 829 executions. This is the main bottleneck of ALU, so duplicating the specific FUs reduces the executions needed. It is realized that there should be a tradeoff between the hardware and software implementations. If one operation is supposed to be implemented through software, then the processor requires more executions than the hardware implementation of that same operation. Similarly, the final architecture custom_2.adf is modified using one user-defined function CRCFast; this FU includes all redundant functions of custom.adf and custom_1.adf. The result is dramatically changed, in that the cycle count is dropped to only 15. From Table II , it can be shown that the influence of FUs on cycle count is significant. Similarly, transport buses and the number of registers in RF are responsible for reducing this cycle count. Those architectures are simulated for single transport bus, which means sequential TTA programs. The amount of ILP depends on the number of operations that are considered simultaneously. In this paper, the custom_2.adf is simulated with two transport buses, and the total cycle count is further dropped to 7. If there is one bus to simulate the custom_2 architecture, these parallel executions are going to become sequential executions and take more cycle counts. Parallel executions depend on a number of independent operations, otherwise it takes operation latency, and the instruction compiler must also take operation latencies into account. Again, there are huge load and store operations being executed for the minimal.adf, custom.adf, and custom_1.adf architectures. All of these architectures have 5 general-purpose registers.
Here one additional register file (5 registers and width 32) is added to the minimal.adf architecture so that it is possible to access the register file simultaneously on one-clock cycles.
In Table II , the mimimal.adf architecture is simulated for single and multiple RF units. For multiple RF units, it is observed that the number of load and store operations significantly decreases compared to a single RF unit, and its cycle count decreases to 2,244. But it is also true that the number of add operations are further reduced, because this addition is used to calculate stack memory addresses for the spilled registers.
Conclusions
This paper discusses a comparison between RISC and TTA-based processors in terms of execution counts. The comparison was made by introducing OpenRisc and TCE tools based on application-specific processor design frameworks. The purpose of this framework is to speed up the performance of processors. The results show that TTA represents high-speed application-specific processors compared to the RISC processor, and also supports larger amounts of instruction level parallelism. Moreover, this paper describes an important issue to design ADFs for TTA. The main bottleneck for reducing cycle count is to transfer software loads to the hardware. The results clearly show that the modifications of ADFs increase the performance of the processors described in this paper.
