INTRODUCTION
The complexity of signal processing systems is constantly increasing. To tackle the increasing program complexity, algorithm designers have looked for new ways to express programs in a less error-prone and more portable fashion than traditional imperative languages can offer.
A major step in increasing design portability has been the introduction of the CAL dataflow language [1] . A more restricted version of the CAL language, named RVC-CAL, has been standardized and is used to specify the algorithms and functions required by the Reconfigurable Video Coding standard, which embodies the recent trends of software modularity, portability and concurrency [2] .
Dataflow programs written in RVC-CAL can be compiled into implementation languages with the Open RVC-CAL Compiler (Orcc) that has several backends [3] : C, C++, LLVM assembly and VHDL [4] , which enable the algorithms specified in RVC-CAL to be executed on instruction processors or dedicated integrated circuits.
This paper presents a design flow that allows a designer of signal processing systems to write a program in the RVC-CAL dataflow language and automatically generate a multiprocessor implementation out of it. The full automation of the process greatly reduces the risk of introducing errors to the complex multiprocessor design.
In the proposed approach, the designer has the responsibility of describing the program in the RVC-CAL language, and must also provide the processors that the system uses. The processor design is based on the TCE toolset [5] , which has a graphical user interface and allows the designer to assemble processors without any HDL design skills. The design flow presented in this paper creates a multiprocessor system as a final output and enables direct synthesis on FPGA boards. The functionality of the design flow is demonstrated with an MPEG-4 Simple Profile (SP) video decoder that is written in RVC-CAL. All the parts of the design flow are available as open source.
BACKGROUND
Before going into the details of our design flow, the RVC-CAL language and the used processor technology are explained.
The RVC-CAL Language
RVC-CAL is a dataflow language. Dataflow languages are popular in signal processing system design, as they allow the designer to abstract the signal processing system to logical entities that interact with each other. It is up to the designer to decide how to partition the signal processing system into different dataflow entities, which are called actors.
A dataflow actor reads data from its inputs, performs some data processing and finally outputs the results. In RVC-CAL, actors communicate with each other over FIFO buffers. The set of actors interconnected with FIFOs is called an RVC-CAL network. The data is wrapped inside tokens and each FIFO carries tokens of a specified size. Internally, actors work like finite state machines (FSMs) that contain states, state transitions and internal variables.
Conditional execution is the most important feature of RVC-CAL when it is compared to traditional dataflow languages: for example, Synchronous Data Flow (SDF) [6] does not allow conditional execution.
Transport Triggered Architecture Processors
Transport Triggered Architecture processors resemble Very Long Instruction Word processors (VLIW) in the sense that they fetch and execute multiple instructions each clock cycle. A major difference, however, is that TTA processors have only one instruction: move, which simply transfers data from an input location to an output. For example, one move instruction can initiate a data transfer from the output of an add function unit (FU) to one of the inputs of a mul function unit.
In [7] it is stated that direct programming of the data transports reduces the register file traffic when compared to VLIWs, but on the other hand makes the compiler design quite challenging, as it is the compiler that schedules the data transports and makes sure that conflicts are avoided. As the compiler does so many decisions at design time, the runtime system is simplified and hence there are savings on the processor gate count and energy consumption.
The design of custom TTA processors has become easy and accessible to everyone through the open source TTA Codesign Environment (TCE) toolset [5, 8] . The TCE toolset offers a graphical user interface for custom processor, function unit and instruction design. The TCE toolset has a compiler which is based on LLVM 1 and contains a processor simulator and profiler. TCE also provides a possibility to realize the processors into VHDL files and memory images, which enable easy FPGA synthesis.
Related Work
There has been prior work similar to that presented in this paper. Park, Oh and Ha [9] list several multiprocessor system design methods that use various dataflow models as input. Compared to our work, none of the design methods listed in [9] support the RVC-CAL language or target TTA processors.
In [10] , a design flow is described for synthesizing heterogeneous multiprocessor systems out of CAL programs. The difference to our work is that the methodology in [10] does not target a specific platform, but remains on a more abstract level. In contrast, our work targets heterogeneous TTA processor networks and presents all necessary tools down to the level of FPGA synthesis. 
PROPOSED SOLUTION
We propose a design flow that enables a signal processing system designer to write a program in the RVC-CAL language, and automatically produce an FPGA-ready multiprocessor system out of it. The full automation of the process greatly reduces the risk of introducing errors to the design. The design flow requires three different inputs from the designer: 1) the actors, 2) the actor interconnection network, and 3) the processors. The actors are source code files written in the RVC-CAL language. An example of an actor is Inverse Discrete Cosine Transform (IDCT) that reads 8 tokens and produces 8 tokens of data. The actor interconnection network is an XML file that describes the connections between actors. Finally, the user needs to provide a processor specification for each RVC-CAL actor in the system.
Having a dedicated processor for each actor enables customizing the processor to the requirements of that actor. The scalability of TTAs enables creating both high-performance and low-resource processors.
The actor files (written in RVC-CAL) and the actor network description file (written in XML) are processed by the Orcc compiler 2 . For our purpose, Orcc produces a C language file out of each RVC-CAL actor, as well as a network description file that is in a special format required by our design flow. This part of the design flow is depicted in Figure 1 .
The processors are designed with the TCE (TTA Codesign Environment) toolset. The designer can create the processors with a graphical user interface without writing any hardware descriptions by hand. For each processor, the TCE toolset produces an Architecture Definition File (ADF) and a set of VHDL files. This part of the design flow is depicted in Figure 2 . The ADF file is used by the TCE compiler to compile the C code of each actor into TTA machine language and produce memory images for the FPGA implementation.
To enable the design flow presented in this paper, some software tools and components had to be designed for Orcc and TCE. For inter-processor communication, special TTA FIFO function units (see Figure 2) had to be designed to enable the processors to communicate over hardware FIFOs. For Orcc, a special TTA backend had be to written (see Figure 1) . The TTA backend produces C code that is specially meant for TTAs that contain FIFO function units. The network description produced by the TTA backend is processed by the TTANetGen tool ( Figure 3 ) that generates the interconnect between the TTA processors. These new components are available as open source (https://sourceforge.net/projects /efsmsched/) and are in gray in the figures. Next, these components and tools are described in detail.
The TTA FIFO Function Units
The special TTA function units (FUs) for accessing external hardware FIFO memories were designed to enable interprocessor communication with minimal overhead. The "FIFO read" TTA FU implements three instructions: status (returns number of tokens in the FIFO), read (reads a token) and peek (shows the value of the next token in the FIFO). The "FIFO write" TTA function unit, on the other hand, has just status and write instructions.
The instructions were implemented in C++ for the processor simulator proxim and in VHDL for FPGA implementations. The latencies of these new functions units range between 1 and 3 clock cycles depending on the instruction.
The TTA Backend for Orcc
The backends of the Orcc compiler are easy to customize, as they are specified with StringTemplate (http://www.stringtemplate.org/) files. To make each actor directly multiprocessingcapable, the Orcc C language backend was modified such that it produces actors that access FIFOs with TTA special instructions that were introduced in Subsection 3.1.
Each FIFO connection of an actor is given an index number which directly invokes a TTA function unit that is connected to the respective hardware FIFO. Thus, if a processor executes actor A, and actor A has 5 input ports, the processor It is also worth mentioning that using C code as an intermediate language between Orcc and TCE is not mandatory. Actually Orcc is capable of directly producing LLVM assembly [11] , which is already an intermediate language of TCE. Using LLVM assembly as a bridge between Orcc and LLVM is a natural direction of future work.
Generating the Interconnect Between Processors
In general, the complexity of actors in an RVC-CAL network can vary considerably from actor to actor. As each RVC-CAL actor is running on a separate TTA processor, it is not sensible to reserve the same amount of instruction memory, data memory or computational resources (such as multipliers) for each processor.
After compiling the C code of each actor, the TCE toolset produces instruction-and data memory images for each processor. The sizes of these memory images are analyzed by our TTANetGen software, which generates data-and instruction memories of correct size. TTANetGen also generates a VHDL envelope for each processor, which encapsulates the processor with its private memories as a single entity.
With the network description originating from Orcc, TTANetGen generates the top-level VHDL file for the whole system. In the top-level VHDL, each processor and each hardware FIFO exists as a separate entity. These entities are automatically interconnected by TTANetGen. This part of the design flow is depicted in Figure 3 
DESIGN FLOW
In this section we describe the use of the design flow that is presented in this paper. We assume that in the very beginning the user has only an abstract idea of the program that he/she wishes to implement with the design flow. The very first task the user needs to do is to split his/her program into logical entities. The division of the application into entities is a very important part, because it will later on affect the maximum attainable performance, as well as the resource consumption on the target device.
After partitioning the abstract program the user needs to write the functionality of the individual entities as RVC-CAL actors, and draw data connections between the actors where ever they need to communicate. For this, the Orcc compiler provides a text editor that assists in writing of RVC-CAL. Finally, the designer can use the Orcc compiler to produce the C code implementations and interconnection information of the actor network. This phase was depicted in Figure 1 .
As it is quite hard to predict the computational resource requirements of individual actors, it is recommended that the user designs one low-performance processor that he/she uses as the initial execution entity for each actor. This processor must be designed such that it fills the minimum needs of every actor, but does not offer superfluous resources.
Next, the C code implementations of the actors together with the processor description (ADF) are provided to the TCE compiler, which produces executable binary files of each actor. Having the actor binaries, the designer can now use proxim to acquire the performance of actors on the initial processor. If necessary, the user can create more powerful custom processors for actors that need more performance. One of the most important ways to improve TTA processor performance is adding more transport buses. Each transport bus in the processor can perform one data move each clock cycle. As the processor network generally forms a sort of a pipeline, it is desirable that all processors have an identical latency in the final implementation.
After the designer has customized the processors and reached the performance requirements, he/she can use the TCE processor generator to produce VHDL implementations of each processor. The instruction-and data memory images that are needed by the FPGA implementation are produced by a special tool that belongs to TCE. The VHDL description of the interconnect between processors, on the other hand, is produced by our TTANetGen.
By default, the design flow offers a special source actor that can be used to input processing data to the network from a dedicated on-chip memory. If the user wishes to use another kind of a data source, he/she must at this stage design it. Likewise, the default data output is the display, which enables directing the computation results to FPGA general output pins (such as LEDs). For the actual synthesis, the user can use his/her favorite design environment. However, the interfaces to on-chip memories (required by FIFOs and processor memories) have been designed only Altera devices in mind. 
EXPERIMENTS
To demonstrate the functionality of our design flow, we took the RVC-CAL program describing an MPEG-4 Simple Profile decoder and synthesized it using our design flow as a multi-TTA processor network. Figure 4 shows the automatically generated network that consists of 21 small processors. The synthesis results were transferred to the Altera Cyclone IV EP4CE115F29C7 FPGA. We used a video stream of QCIF resolution as test data to verify the correct operation of the network. The decoding result was compared against a checksum that had been computed for each frame on a workstation beforehand.
All of the memories and video bitstream fit on the FPGA on-chip resources (EP4CE115 has 432 kB of on-chip RAM), when the maximum video resolution was limited to QCIF. Higher video resolutions would exceed the available on-chip memory resources and would require resorting to off-chip memory, which was not done at this stage. Table 1 shows the memory usage of each processor in the MPEG-4 SP decoder. These numbers are provided by the memory image generator that belongs to the TCE toolset, and are used to automatically generate correct-sized memories.
Most of the processor instances in the network are identical. As the tools that form the basis of the tool chain, TCE and Orcc, allow creating heterogeneous processors, this was done for some actors that required more than an average amount of resources. The processors that required more performance can be seen in Table 1 : they have more than 2 transport buses. On the contrary, Figure 5 shows a tiny TTA processor that was used to execute the code of the serialize actor. The performance of the resulting implementation was 20 frames per second on a 50 MHz FPGA and QCIF video resolution. Table 3 shows the number of clock cycles consumed by intra-and predicted macroblocks, which was wellbalanced due to the possibility to tune the performance of each processor. It was detected that the system's performance was currently limited by the backend that generates code for TTAs. The TTA processors could compute much faster, but the automatically generated code still had some unnecessary computations at this stage. Improving this is a clear direction for future work.
Regarding the FPGA resource usage (see Table 2 ), the example application fits well to the FPGA board. The TTA feature of instruction word compression helped alleviate need of scarce on-chip memory resources. Through instruction word compression, it was possible to save on the usage of memory blocks with the expense of logic element usage. The FPGA board limited the clock frequency to 50 MHz, whereas TTA processors allow much higher clock frequencies.
CONCLUSION
In this paper we have presented a design flow that enables automatic synthesis of RVC-CAL actor networks to application specific transport-triggered processor networks. The functionality of the toolchain has been demonstrated by applying it to an RVC-CAL network, which defines an MPEG-4 SP video decoder. The automated process realized the RVC-CAL dataflow network into 21 tiny, heterogeneous processors.
Future work includes the possibility of substituting some processor entities in the multiprocessor network through accelerator blocks that have been generated by a hardware code generator [4] . Also the code generation for TTA processors needs to be improved to bring the performance of automatically generated C code closer to that of handwritten code. Finally, we are going to investigate the possibility of running several RVC-CAL actors on one processor.
ACKNOWLEDGEMENTS
The authors would like to thank Pekka Jääskeläinen and Otto Esko for technical help related to TTAs. This research was 
