The main aim of this paper is to explain the generation technique of application specific function units (FUs) for reducing the number of instructions in Luby Transform (LT) codec processor. For this reason, Transport Triggered Architecture (TTA) is taken as an active processor template for designing a high-speed TTA-based LT codec processor using TTA-based Co-design Environment (TCE) tool. In this design, processor architectures named as P 1 , P 2 , P 3 , P 4 , P 5 , and P 6 are generated to gradually improve the performance of the TTA processor. P 6 took only 4,466 cycles and 43 ms to simulate an LT codec system. In this paper, P 6 of the TCE tool took only a single iteration to generate the decoded signal.
Introduction
Silicon Intellectual Property (SIP) or Silicon IPs are used as components in silicon chip design since mid-1990s. The important constrains for quality design of SIP became higher after the year 2000. After that time, SIP has been accepted widely and used in large scale [1] . In System-on-chip (SoC) design, programmability, reusability, and concurrent operation ability are the most exigent challenges, and these force the design work from the register transfer level (RTL) to a higher abstraction level [1] . TTA is a very useful platform for designing an application specific processor (ASP) [2] . TTAs are developed by using different FUs and bus connections. In this architecture, FUs are totally independent and connect with an interconnected network. We develop different processor architectures, including efficient, custom FUs and database FUs, the bus network, adding register files (RFs), etc. Besides this modification of the TTA architecture, design of the LT encoder and decoder should be modified to make a simple and computationally efficient codec processor. To create the hardware for an LT codec, it is necessary to develop the implementation platform of the LT codec based on the SoC design. Recently, research has been carried out to implement LT codec architecture on a field-programmable gate array (FPGA) evaluation board. The researchers proposed different RTL architectures where performance was evaluated in terms of resource utilization of the prototype board. Zhang presented architecture for a soft decision LT decoder with a block length of 1024 bits and 100 iterations [3] . In it, the input node and output node processing techniques are described to accelerate the decoding speed. To apply these node-processing units, an efficient router and reverse router are designed to indicate the graphic connectivity between input and output nodes. Brandon et al. presented a scalable bit serial architecture for a low-density parity check (LDPC) decoder [4] . In here, the decoder was implemented for a (256,128) regular ð3; 6Þ LDPC code using Taiwan semiconductor manufacturing company (TSMC) 180 nm 6-metal CMOS technology. It has decoded information throughput of 350 Mbps, a core area of 6.96 mm 2 , and energy efficiency at 7.56 nJ per un-coded bit at low signal to noise ratio (SNR) [4] . There are some other unpublished works based on the FPGA implementation of an LT codec application [5] . All these works followed application specific integrated circuit (ASIC) design techniques. The bottleneck of this ASIC design depends on the manual effort in RTL design. Sometimes this RTL design has to be optimized to reduce the footprint and power consumed by the chip. Moreover, it is time consuming to translate input application architectures into synthesizable hardware description language (HDL) code. Therefore to reduce this effort, another design automation technique is needed that will take the specific application high-level language (HLL) format and the processor architecture written in architecture description language (ADL) as input design in order to generate an RTL design for the ASP. This paper describes a high-speed LT codec processor with performance measured by the cycle count and the simulation time.
Design methodologies
In this paper, we mainly focus on the TCE tool for generating efficient LT codec processors [6] . Fig. 1(a) shows the basic design methodologies for a high-speed LT codec processor. As seen, after writing the LT codec program using HLL, we select a processor platform: TTA architecture using TCE. After that, architecture generation is required using ADL or prescribed techniques set by those ASIP tools. Then, we apply benchmarking and evaluation processes after finishing architecture optimization. We selected the cycle count and the simulation time as benchmark parameters, and if these parameters are not satisfied, then we need to optimize the architecture of each platform. Besides these hardware modifications, it is necessary to modify the LT codec design by minimizing unnecessary instructions and reducing redundant operations or selecting new algorithms. Fig. 1(b) shows the complete design structure of TCE ASIP design flow. From this figure, we see that the desired application in HLL and the design requirements are applied as inputs to the design flow. At the beginning of the design flow, a starting point architecture known as the architecture definition file (ADF) is required. The structure of the architecture is very important in order to meet the desired requirements, and there are flexible activities to modify this architecture to meet such requirements. Therefore, the aim of this paper is to depict the responses of different ADFs to reduce cycle counts to implement the input application. This source code with the starting point architecture (the ADF) is next compiled by the TCE C compiler, generating a TTA program exchange format (TPEF) binary file [7] . These results are then fed back to the starting point architecture (ADF) to further adjust the parameters. If the minimal structure of the ADF fails to meet the requirements, then a custom architecture is applied to modify the minimal architecture. This custom operation is allowed to accelerate the application. First, it is necessary to find a custom operation, and then the designers create a custom operation compiler definition by using the operation set editor tool. In order to simulate the custom operation FUs, simulation models written in C/C++ are required. After this, the processor architecture and HLL source code are modified according to the custom operation. In this paper, we showed the performance of this custom operation in terms of cycle count and resource utilization for the LT encoder and decoder as an input application file.
Overview of LT codec architecture
In an encoder, the output degree d is taken randomly from a degree distribution function as explained in [8] . In our encoder architecture, a uniform random number generator (RNG) is applied to get the degree value from this degree distribution [9] . The address of the message signal is randomly distributed, and the combined operation of the column for degree distribution and the address of the message satisfy the distribution mentioned in the equations for ISD and RSD [8] . For this reason, in ASIP design, we translated the encoding process of the LT codec in HLL by satisfying the minimum execution of operation, which is very simple, compared to the use of LUTs. In this paper, we executed this algorithm as an encoder technique of the LT code and designed encoder processor using ASIP tools. In an LT codec, the decoder is more complex than the encoder. From the encoding explanation, we see that direct RTL mapping is more difficult than HLL mapping. At first we'll explain the ASIC architecture of LT decoder in terms Check Node Unit (CNU) and Variable Node Unit (VNU) operations [9] . Fig. 2(a) shows the CNU of LT decoder architecture. In this CNU module, LLR memory is used for check node operation while the message is passing through the check node. Like encoder, the same degree distribution table is used so that when the degree is one, the counter counts the position of unity degree and CNU memory stores the message of the count address value from LLR memory. Then, the counter counts more when the degree is not equal to one, the message from LLR of that count address is multiplied with the message from VNU memory.
The CNU memory therefore has messages for degree one and updated messages for a degree greater than one. Messages pass through these CNU nodes and updated messages are stored in the CNU memory. As shown in Fig. 2(b) , each variable node contains 4 LUTs. Two new LUTs termed as edge information and index tables are included in (VNU) operation. These additional tables consist of nodes and edge information provided by the degree distribution function. The VNU function unit takes data from CNU memory and stores it in VNU memory after following the operation of node routing and inverse node routing explained in Fig. 2(b) . In VNU, the processing unit accumulates messages serially from the check node and stores them in the variable node memory. For this reason, in the decoding process, it is quit complex than HLL mapping of decoder executions. We explain the decoding procedure using HLL mapping. In this LT codec implementation, we have taken 128 bits for the information signal and 256 bits for the encoded signal. In order to get the decoded signal from the encoded bit stream, a soft decoding procedure is applied by using a sum-product algorithm. Channel decoding in an LT decoder is based on the log likelihood ratio (LLR) of a binary random variable X 2 fAE1g or X 2 f0; 1g defined by the LLR equation [10] . The LT decoder operates based on the sum-product algorithm by passing the message (LLR values) on a Tanner graph. Let Lðt i;j Þ denote an L value message passed from check node i to variable node j, and let Lðh i;j Þ denote an L value message passed from variable node i to check node j. From Jenkačand Mayer [10] , Lðt i;n Þ can be written as:
where Lðĉ i Þ denotes the received L value of the codeword from the channel. Similarly, the L value Lðh i;j Þ depends on the messages passed to variable node i. So Lðh i;j Þ and the decoding decision can be obtained by the equation mentioned in Fig. 3 . The decoding algorithm was developed by using equations mentioned in Fig. 3 [10] . To write the above procedure in HLL, we followed the algorithm explained in Fig. 3 . In this figure, first of all, we have taken one 2D array (Lðt i;j Þ) with the size of the encoded signal length by the maximum degree value. At the encoding end, we had already generated the edge, the index of those edges for the variable node, and the degree value of the check node. At first, we need to find which check node has a single degree; in other words, if the degree is 1, then store the LLR values of that check node in Lðt i;j Þ memory. Otherwise, store the messagepassing value and the edge and degree information in Lðt i;j Þ memory. Then, we take another 1D array (Lðu i Þ) with the size of the information signal length by 1.
According to the equations mentioned in Fig. 3 , the message value of each variable node should be stored in Lðû i Þ memory. After that, the decoded signal is found by applying the hard decision as shown in Fig. 3 .
LT codec processors using TCE
In the TCE tool, the TCE C compiler will take the LT codec file written in HLL (.c) and the architecture written in the architecture definition file (.adf ) as input designs. The TTA simulator (TTASIM) will simulate the instruction set, and finally, the result can be found in terms of cycle count, time and resource utilization. If these results fail to satisfy, then we need to modify the .adf and .c files. It is mentioned that designing a custom FU is an efficient technique for increasing processor performance. At first, the processor P 1 is designed to adhere to a minimal architecture. After that, this P 1 architecture is modified step by step, including different custom FUs, adding more data buses, increasing the number of registers in RF, etc. In this paper, we only discuss the performance of a custom FU. For example, P 1 is a common platform for all architectures; P 2 is developed by adding four FUs with a P 1 file structure. Similarly, P 3 is created by including a Random custom FU in the P 2 file, whereas P 4 is generated by adding a DEGREE FU with the P 3 file. In similar fashion, P 5 and P 6 are formed by adding an LLR FU and Encoder_Decoder FU, respectively. Table I lists these architectures. Here, we explain the generating technique of an important custom FU named DEGREE. Before designing the custom FU for implementation in the LT decoder, we will explain the main bottleneck in the decoding algorithm. In the decoding algorithm, a soft decoding procedure has been used through the check node and variable node operations. In a variable node unit (VNU) operation, it is necessary to know how many edges are formed for each variable node so it can determine the degree distribution of the message signal. Similarly, in check node unit (CNU) operation, it is necessary to know how many variable nodes are connected with each check node (the edge information of the check nodes). It is mandatory to find the single-edge check node (degree 1 of the check node per update), so it must be indexed the edges of the check nodes. To make the custom FU for the LT decoder, we need to include three parameters for this custom FU and use the required output properly fetched from this custom FU.
The name of this custom FU is DEGREE. Moreover, at decoding, the encoded signal should be taken from the DEGREE FU. Fig. 4 represents the structure of the custom DEGREE FU, and this FU is used for the decoding algorithm of the LT codec. This DEGREEFU gives four outputs, as labeled in Fig. 4 . Of these outputs, degree, edge and index information are part of the architecture of the LT codec processor. As a result, the new ADF file P 4 will need fewer cycle counts to implement the decoding operation. In this ADF architecture, all of the encoding operations (generation of encoded signal, degree, index and edge information) are part of the DEGREE FU. So, we can remove the coding related to activities of this DEGREE FU from the main input design written in C. The custom DEGREE FU is written in C++. So, this is a powerful technique used in the TCE tool.
Simulation results
We translated the complete encoding and decoding algorithm using C. Before feeding in the decoding module, we applied noise to corrupt the transmitted signal through the channel. Therefore, the overall communication channel can be modeled as AWGN. The main aim of this paper is to implement this LT codec communi- cation system using ASIP design tools. The results of this implementation will show how efficiently we produced the LT codec processor, and its efficiency can be calculated in terms of cycle count and time required for simulation. Area, number of gates and cells required to implement this architecture were discussed in ref [9] . The simulation procedures using the TCE tool were also elaborately discussed in [11] . At first, the minimal structure of the architecture (P 1 ) is used, which describes a minimalistic architecture containing minimum resources that the TCE compiler can perform to compile C code. So the P 1 architecture is a mandatory architecture, and new architectures are formed by adding or modifying custom FUs with this P 1 architecture. Instead of copying whole FUs, duplicating the specific operation of that FU will reduce the total cycle count [11] . For this reason, P 2 is developed by including its resources with the minimal architecture. In order to increase the performance of the processor, new FUs and RFs are added to the P 1 file, and these new architectures are listed in Table I . We developed a hierarchy of processors for the LT codec, and performances are tabulated in terms of cycle count, time count and resource utilization. There are various ways to increase the performance of the processor. For example, increasing the width of the RFs, duplicating the FUs, increasing the number of transport buses, modifying the design architectures, and generating the custom FU for specific operation are popular useful techniques for improving processor performance. However, we focused on modification of the LT codec input design structure, generating the custom FU for the LT codec architecture. Other techniques were elaborately explained in [11] . After finishing the simulation with P 1 by using TTASIM, the results show cycle execution count, time required for simulation, and processor utilization, which are tabulated in Table II.  Table II shows the implementation results of P 1 , P 2 and custom architectures of the A new architecture is formed, named P 2 , which shows good performance compared to the P 1 architecture. This kind of improvement is already explained in author's other paper [11] . In this paper, we are trying to show that how custom operation is generated based on LT codec architecture and is responsible to reduce the number of cycles to improve the amount of resource utilization as well. To create different operation, we need to select some specific operation in our LT codec algorithm to produce specific custom FU. It is discussed earlier that the RNG is very important in this LT encoder and decoder operation. In HLL, a default C random function was used to generate this random number. Therefore, we generated one new FU, RANDOM, which generates the random number, and we use this FU in the architecture named P 3 . The result shows that this custom FU takes only 230 cycles and utilizes 0.001% of total execution. This ADF takes almost 84,900 cycles fewer than the P 2 architecture. Using this P 3 architecture, the LT codec takes 195,431,136 cycles and 1,954,311 ms for implementation. Still, this is not a sufficient reduction in the cycle count for implementing the LT Codec. We need to develop a more efficient processor. There are several ways to improve the performance of the processor. We modified the input design of the LT codec step by step. For example, the random number generator is widely used in encoders and channel noise generators. If this RNG is included as part of the input design, then it will consume almost 370 cycles (84,900/230) per function call, compared to the RNG included as part of the compiler design. So, it can easily be shown that if there are multiple calls to the RNG function in HLL, it will require a huge number of cycles. One possible solution is to design a uniform RNG. But it is very difficult to generate a uniform RNG by satisfying the functionality of the encoder and decoder. We modified the input design based upon the expectation of a random number. For example, in order to generate the degree distribution during encoding, rand() is used in its prescribed manner. On the other hand, for noise generation, we used LUTs instead of RNG. Similarly, the decoding process of the LT codec is based on the iterative manner. We need to design a decoder that will take less iteration, and this depends on the degree distribution and number of redundant bits to decode the encoded signal. In this paper, satisfying the functionality of the LT codec, we modify the degree distribution to reduce the cycles and simulation time. Later, we will show the design of a custom FU for the LT decoder. Now, we are going to explain the cost statement for different parts of the LT codec. Table III shows the simulation result for the LT encoder using Encoder.adf architecture. This CUS_ENC custom FU includes the major operation in LT encoder. Table III depicts that only encoder takes 23,946 cycles. This custom FU consumes only 1% of total executions. However, at first, we simulated the LT encoder using the P 1 architecture, and it takes a huge amount of time and many cycles because of the missing custom FU. Then, in Encoder.adf, we included one custom FU named CUS_ENC to transfer the major operation of the encoding algorithm to the compiler (hardware architecture).
From Table III , we see that this custom operation takes only 230 operations and reduces clock cycles to almost 7,717,027. This shows a significant improvement in performance. Table IV shows the simulation result of the LT decoder using P 4 . The result shows that the P 4 configuration takes 184,541,996 cycles, which are 10,889,140 cycles fewer compared to the P 3 architecture. From Table IV , it can be shown that DEGREE FU takes only 358 cycles when its operations are part of the ADF architecture. Behind this operation, the processor improves efficiency by eliminating 10,889,140 cycles compared to P 3 .
Still, it is not sufficient in terms of cycle reduction. Therefore, we need to modify even more. According to SPA, in CNU and VNU operation, 'tanh' is used for sign identification. Therefore, we made a custom FU for the 'tanh' function included in the architecture named P 5 . Table V shows the results of this processor. From the comparison of Tables IV and V, the LLR custom FU reduces the number of cycles by 163,425,299, compared to the P 4 processor. LLR itself consumes only If we analyze decoding part of the input design, the whole complexity of the decoding algorithm drops to the number of iterations of the message-passing algorithm. Moreover, this number of iterations depends on the degree distribution of the encoded signal. For constant degree distribution, error is inversely proportional to the number of iterations. In this paper, we actually focused on the implementation of the encoder and decoder.
So, we slightly modified the degree distribution to ensure error is zero and calculated the cycle count w.r.t. number of iterations. For example, for 7 iterations, P1 took a huge number of cycles because of input design. In this input design, we included channel noise, and there is no optimization of degree distribution. Moreover, the P1 architecture is a simple processor structure. Up to this point, the P5 architecture takes the minimum cycles to process the LT decoder. This architecture can be further modified by generating a custom FU using Encoder.adf and P 5 architectures. The name of this FU is Encoder_Decoder. Using this FU, the final architecture is formed: P 6 . Table VI shows final results using this architecture. It takes far fewer cycles compared to all other architectures. When an operation is included as a function of input design, it will take more cycles to generate the TTA instructions for this particular operation. TTA compiler will translate these specific operation instructions by using ALU and LSU FUs. On the other hand, when the specific operation is included as a part of a custom FU, then the TCE compiler can easily generate the TTA instructions independently.
However, Table VII shows the complete scenario of all architectures. After designing this architecture, TCE will generate the complete processor for specific application input design in VHDL HDL. These are the step-by-step procedures for generating an ASP like the LT codec application. According to the performance of the processor, the P 6 processor shows very good performance compared to the other architectures. Moreover, these architectures can be further modified by duplicating the custom FUs, adding more data buses, or changing the RFs. However, after generating the optimized processor as an HDL formation, it will be applied in the prototype board or chip design procedures for getting the real information about timing, area or power reports.
Conclusions
This paper consists of three parts: efficient processor selection, state-of-the-art input design selection, and generation of the processor for that input design. For designing an application-specific system, TTA is a promising processor family for getting high-speed response. Therefore, we took a TTA processor to design the LT codec processor using TCE tool. We designed different custom FUs for the LT encoder and decoder. However, the response of the processor does not solely depend on the processor architecture. Performance also depends on the input design architecture. Therefore, besides designing custom processor parts, we need to efficiently design the LT codec as a reference input. We saw some works those were implemented in FPGA prototyping board. In addition, from this comparison, we found that the LT codec processor from the TCE tool is good in terms of cycle count and required time.
Acknowledgments
This study was supported by a research fund from Chosun University, 2016. GoangSeog Choi is a corresponding author. 
