A solution paradigm that has emerged in the embedded-systems market over the past few years is the programmable system on a chip (SoC)-an integrated design that incorporates programmable cores, custom or semicustom blocks, and memories in a single chip. This paradigm allows the reuse of predesigned intellectual-property (IP) cores, thus amortizing a core's design cost over many system generations.
A solution paradigm that has emerged in the embedded-systems market over the past few years is the programmable system on a chip (SoC)-an integrated design that incorporates programmable cores, custom or semicustom blocks, and memories in a single chip. This paradigm allows the reuse of predesigned intellectual-property (IP) cores, thus amortizing a core's design cost over many system generations.
We have designed a new architecture that simplifies integration of heterogeneous IP for multimedia and streaming applications. The Multilevel Computing Architecture (MLCA) is a template architecture featuring multiple processing units and a top-level controller that follows well-developed superscalar principles to automatically exploit parallelism among coarse-grained units of computation or tasks. Like the sequential-programming model, the MLCA's programming model does not require programmers to specify synchronization and data communication. We also developed a set of code transformations for porting programs to the MLCA and improving their performance. These transformations are based on standard compiler analyses and hence can be incorporated into compilers for the MLCA.
Architectural overview
The MLCA is a two-level hierarchical architecture. The lower level consists of multiple processing units. A PU can be a full-edged processor core (superscalar or very-longinstruction-word, for example), a digital signal processor (DSP), a field-programmable gate array (FPGA) block, or custom hardware. The upper level consists of a control processor (CP), a task dispatcher (TD), and a universal register file (URF). A dedicated interconnect links the PUs to the URF and to memory. The URF is similar to the general purpose register file (GPR) and a PU corresponds to an execution unit (XU). (The "Related work" sidebar describes previous research underlying our work on the MLCA.)
The MLCA's novelty is that its upper level supports out-of-order, speculative, and superscalar task execution. It
57

MAY-JUNE 2004
The MLCA's tasks have similarities to the user-defined functions of GLU (Granular Lucid), a coarse-grained dataflow language. 1 GLU's combination of tasks in a sequential language (C) and a top-level dataflow exhibits very good parallel performance, comparable to lower-level explicit parallel programming systems.
The MLCA uses superscalar technology, which Tomasulo pioneered in 1967. 2 Hennessy and Patterson and Smith and Sohi offer excellent overviews of modern superscalar architectures. 3, 4 Verians et al. and Quisquater, Verians, and Legat extended superscalar principles to tasks instead of instructions and explored the associated scheduling. 5, 6 The core of their work is an efficient implementation of the instruction queue, which checks on operands' availability to perform outof-order dispatching. However, they rely on a shared memory instead of a shared register file for intertask communication.
The following three SoC systems use multiple processing units for multimedia and other applications. However, they differ from the MLCA in their programming models. Ackland et al. developed a scalable DSP architecture that features a split-transaction bus for communication and cached semaphores for synchronization. 7 The programming model is also layered, separating tasks from the top-level control flow. 8 The dynamic scheduler is implemented in a runtime kernel and is configurable. The picoChip (http://www.picochip.com) is a cascadable, reconfigurable array processor architecture intended for third generation wireless communications. In contrast to the MLCA, which has a quasisequential programming model, the picoChip uses VHDL programming. Cradle Technologies' 3SOC (http://www.cradle.com) is a shared-address-space multiprocessor SoC. It consists of processor clusters connected by two levels of buses. For synchronization, the system provides 32 semaphore registers, which must be explicitly used in a parallel program. In contrast, the MLCA automates task synchronization through the URF, providing a programming model closer to sequential programming. We based the code transformations that improve program performance for the MLCA on several well-known compiler analyses and optimizations, including dataflow analysis, 9 array privatization, 10 code hoisting, and loop unrolling. uses the same techniques as today's superscalar processors, such as register renaming and outof-order execution, to exploit parallelism among instructions. It leverages existing superscalar technology to exploit task-level parallelism across PUs, as well as possible instruction-level parallelism within a PU. The CP fetches and decodes task instructions, each of which specifies a task to execute. A task instruction also specifies the task's inputs and outputs as registers in the URF. The CP detects dependencies among task instructions in the same way that a superscalar processor detects dependencies among instructions: using the URF's source and sink registers. The CP renames URF registers as necessary to break false dependencies among task instructions and then issues decoded task instructions to the TD unit. On the basis of dynamic dependencies, tasks can issue out of order and can also complete and commit their outputs out of order.
The MLCA enqueues task instructions in the TD unit, just as a superscalar PU enqueues instructions in the instruction queue. When a task instruction's operands are ready, the TD dispatches the task instruction to the PUs according to a scheduling strategy. The simplest strategy dispatches instructions to PUs in a round-robin fashion, but more dynamic strategies are also possible.
The MLCA is a template architecture. Thus, it doesn't specify the form of interconnect among the PUs. Several implementations are possible, including buses, crossbars, and multistage interconnects. (For certain configurations, we have developed a dedicated multistage interconnect called Octagon 1 ). In addition, because the CP enforces intertask dependencies, and because the PUs communicate data primarily through the URF, there is no need to assume a particular memory architecture. Possible implementations include the PUs sharing a single memory, each having its own private memory, or any combination of the two, depending on the application.
Programming model
The MLCA's hardware features give rise to a natural programming model very similar to sequential programming. The MLCA programming model is layered. The bottom layer consists of task bodies, or simply tasks. Each task implements a particular functionality with defined inputs and outputs. A task can be a sequential C program, a block of assembly code executing on a programmable PU such as a processor or DSP core, or the predefined functionality of a nonprogrammable PU such as a hardware block.
The model's top layer is a sequential task program that executes on the CP. It specifies task instructions, using a C-like language called Sarek. This language replaces function calls with task calls and adds explicit direction indications for function arguments. Figure 2 shows the Add task expressed as a C function that computes the sum of two integers. The function has no formal arguments. Instead, it communicates with the Sarek program through an API, obtaining input data with a readArg call and writing results using an analogous writeArg call. For example, readArg (1) reads the function's second input, and writeArg(0) writes the function's first output. The task also returns a condition code that is written to a condition register in the CP. A Sarek program can use this condition code to make control decisions. Figure 3 shows the main part of the corresponding Sarek program for the example task. It makes four calls to the Add task. Each call specifies the variable names of each task's inputs and outputs, as well as a direction indicator (in or out) for each variable. The second instance of Add must wait for the first instance to complete because of the true dependence caused by the variable totwidth. However, the third and fourth instances of Add can proceed out of order with the first two, even though they also write to and read from totwidth, because they have no dependency on the conditional call to Div. The hardware automatically renames the register holding totwidth for these two tasks to 58 MULTILEVEL COMPUTING ARCHITECTURE eliminate false output (write-after-write) and antidependencies (write-after-read).
IEEE MICRO
Sarek has only two data types: control variables and data variables. Control variables store the return values of task calls and determine the control flow in conditionals and loops. Sarek allows bitwise logical expressions on control variables. Data variables provide input and output arguments for task calls, as the example in Figure 3 illustrates.
Semantically, data variables that are output from tasks in a PU are available when the task writes them using the writeArg call. In contrast, control variables are available to the upper layer only when a task has completed execution. Consequently, the first conditional if (notzero) … in Figure 3 can be evaluated only after the preceding task Add has completed, even though the input argument totwidth for the following task Div is available earlier.
Sarek generates HyperAssembler (HASM), an intermediate program representation similar to assembly. Figure 4 shows an HASM code fragment corresponding to the Sarek code in Figure 3 . This code stores control variables in control registers (CRi), and stores data variables in universal registers (Ri). For universal registers, the program also specifies register usage as :r for inputs and :w for outputs, which the hardware uses for dependency analysis.
Benefits
The MLCA's combination of architecture and programming model offers important advantages for SoC designs:
• Reduced software complexity. The programming model alleviates the need for explicit parallel programming, reducing software complexity. It also separates synchronization and communications from computations, further reducing software complexity.
• Automatic extraction of parallelism.
Speedup can be achieved through register renaming and out-of-order task execution. For dependencies such as write-after-write and write-after-read, the CP allocates new registers, letting the tasks run in parallel on separate PUs.
• Scheduling policy independent from source code. Our layered approach fits in the model advocated by Paul and Thomas, in which each layer (application, schedulers, and resources) can be tuned independently to attain an optimal cost-performance ratio. 
Two applications
We present two case studies of realistic multimedia applications that we ported to the MLCA. For each case, we describe the application and the code transformations necessary to obtain good performance. These transformations build on well-known compiler analyses and optimizations, including dataflow analysis, 3 array privatization, 4 code hoisting, 3 and loop unrolling. 3 Hence, programmers can easily incorporate them into Sarek and C compilers.
MAD MAD (http://www.underbit.com/products/ mad/) is an MPEG audio decoder that translates MPEG files into 16-bit pulse-code modulation (PCM) output. We use a stripped-down version of the code, which does not include multithreading but retains the original functionalities and code structure.
The input to MAD is a byte stream that represents a sequence of audio frames. Each frame consists of a frame header and frame data. The frame header contains configuration information such as audio layer type, channel mode, sampling frequency, stream bit rate, and location of the frame's main data in the input stream. Because frame size can vary, a frame header also contains the size of its corresponding frame.
MAD's main data structure is a C structure called mad_decoder, which contains global variables and also three other C structures: mad_stream, mad_frame, and mad_synth. The mad_stream structure stores the input stream's start and end addresses in memory, a pointer to the start of the current frame being decoded, a pointer to the next frame to be decoded, and buffers for decoding a frame. The mad_frame and mad_synth structures hold buffers for a frame's decoded output and PCM output, respectively. Thus, most of the pointers and buffers within mad_stream, mad_frame, and mad_synth are reused for each frame's decoding. Figure 5 shows the main steps of the MAD program. The first step allocates and initializes the various data structures the program will use. The next step maps the file containing the input stream to memory. Then the frames are decoded one at a time until the program reaches end of file (eof ). For each frame, the decoded output is copied to mad_frame. The PCM output is synthesized and placed in the mad_synth structure. The structure is sent to either a file or standard output. The final steps unmap the input file from memory and deallocate the various structures.
The MAD program performs frame decoding as follows: First, the program reads the frame's header and determines the current frame's length. Then, it parses frame data from the input stream. Data for successive frames does not synchronize with respective frame headers, and it also overlaps. Thus, the data header can point to data preceding it. The program copies this data into main_data, a buffer in the mad_stream structure, before starting decoding. It copies and decodes the data in three stages: The load stage copies the data in the current frame from the input stream into main_data. The decode stage decodes the data in main_data. The preload stage moves the data associated with the next header from one part of main_data to another, in preparation for the next frame's processing.
The first step in porting MAD to the MLCA is to determine the tasks and a corresponding Sarek program. This step is relatively simple because MAD's main function consists mostly of calls to top-level functions, which become tasks. Figure 6 shows the initial Sarek program.
The Init task corresponds to the function that creates and initializes all data structures contained in mad_decoder. A pointer to this structure appears as a parameter to each function in the MAD program and thus as a parameter to each task in the Sarek program. Because each task reads and writes variables to and from this structure, the pointer is designated as both input and output to each task.
Using a pointer to mad_decoder as input and output arguments to each task causes two problems. The first is that using a single pointer to reflect dependencies among tasks introduces false dependencies. These dependencies are false because tasks use different parts of mad_decoder and thus can execute in parallel. Using one pointer as both input and output causes the tasks to serialize, thus eliminating parallelism.
The second, more serious problem stems from using pointers as parameters to tasks. For a task to write to a memory location in a buffer or a structure, the pointer to this memory location must be an input argument to the task, even if the task is not reading the memory location. This pointer creates an unnecessary true dependence. As a result, every task using the pointer must receive the original copy of the pointer value, making hardware renaming inapplicable. Furthermore, even if we remove the pointer as an input parameter, the hardware will rename the value of the pointer in the corresponding register, not the data it points to, which is in memory.
To overcome these problems, we apply the following code transformations to both the Sarek program and tasks.
Task parameter deaggregation. This transformation exposes the elements of structures in the parameter task list, thus eliminating false dependencies and allowing the hardware to rename these parameters when appropriate. We perform the transformation by recursively replacing pointers to structures with components, until all task parameters are of primitive types (such as int, float, and int *).
For each parameter, we must determine the direction of access (in, out, or both). Using standard dataflow analysis techniques, we compute upward exposed uses and downward exposed writes for each parameter in a task body. If there is an upward exposed read of a parameter, we designate the parameter as input; if there is a downward exposed write, we designate the parameter as output. Figure 7 shows the Header_Decode task header before and after transformation. In Figure 7b , two (of many) elements of mad_decoder now appear as task parameters. The first two parameters, this_frame and next_frame, are integers. Thus, the hardware can eliminate false task dependencies caused by the use of this variable. The third parameter, ptr, is a pointer to a buffer in memory, and as described earlier, it causes a renaming problem. We must resolve this problem through renaming during compilation, using buffer privatization. Header_Decode (in mad_decoder, out mad_decoder); while loop use the buffer called buff. Output and antidependencies prevent instances of the tasks in one iteration from executing in parallel with instances in another iteration. These dependencies result from the use of the same buffer area in memory by all iterations. These tasks write to the buffer at the beginning of each iteration and they read from it in the remainder of the iteration.
Buffer privatization.
We can break the output and antidependencies by privatizing the buffer-that is, by giving each iteration a private copy of the buffer, as Figure 8b shows. We allocate a new buffer at the beginning of each iteration, using the new Init task, and deallocate it at the end of an iteration, using the new Finish task. The hardware renames parameter buff in each iteration, along with a corresponding private buffer. The addition of artificial dependencies guarantees that Finish will not start until all tasks in an iteration are complete.
For the privatization of buff to be legal, every read to a section of buff must be dominated by a write to the same section in the same iteration. This transformation is similar to array privatization, which is successful in the context of automatic loop parallelization. 5 The transformation requires buffer section analysis 4 and interprocedural alias analysis. 6 Buffer privatization is particularly effective in MAD because the program reuses buffers within all the main structures for each frame's decoding.
Buffer replication. In some cases, not all reads to a buffer's sections are dominated by writes to the same sections in an iteration, making buffer privatization inapplicable. Nonetheless, we can introduce parallelism by overlapping task execution in one iteration with task execution in subsequent iterations. The transformation that accomplishes this is buffer replication. Figure 9 illustrates buffer replication. The three tasks, Load, Decode, and Preload, which all use the main_data buffer, are part of MAD's Frame_Decode component. Load writes only to the second half of buff; Decode reads the entire buffer; Preload reads buff's second half and writes to its first half. Privatization is not possible because in an iteration, Decode reads buffer data written in the previous iteration by Preload. As Figure  10 shows, the dependencies among the tasks cause their execution to serialize.
To achieve parallelism, we replicate the main_data buffer and copy it into the buffer temp_data in the task called DoCopy, as Figure 9b shows. This copy goes to Decode, thus breaking the antidependence between Decode and Preload and allowing them to execute in parallel. A new copy of temp_data is allocated in every iteration of the loop, and the hardware automatically renames temp_data every iteration. Figure  10b shows task execution after buffer replication. Decode task instances now overlap in different iterations because each instance uses a different buffer.
Buffer replication requires the same set of analyses required for buffer privatization. multiple tasks. For example, task Frame_Decode consists of calls to three functions, Load, Decode, and Preload, which perform the actions previously described. However, these functions are not tasks. Thus, we split Frame_Decode by transforming the three functions into separate tasks that are now part of the Sarek program.
Code hoisting.
A task cannot start until all its input parameters are available. Thus, it's desirable to write a task's output parameters as early as possible so that waiting tasks can proceed. By applying code hoisting to the body of a task, we move calls to writeArg to the earliest point possible.
Loop unrolling. The MLCA can support outof-order speculative execution. Nonetheless, we apply loop unrolling to the main loop in the MAD Sarek program to increase the amount of task parallelism. The program copies the body of the main loop as many times as the number of processors used (P) and checks the loop condition every P iterations.
FMR
FMR is an audio application that performs FM demodulation on a 16-bit input data stream, producing a 32-bit output data stream. The input stream consists of data packets of 1,536 bytes each. The program's main steps use about 70 calls to 16 functions in a loop in the main functions, making the derivation of a Sarek program straightforward. We used the same set of transformations for FMR that we used for MAD to realize parallelism among tasks in one loop iteration, as well as among task instances in successive iterations.
Evaluation and analysis
We developed a timed functional model of an MLCA instance, consisting of about 6,000 lines of C++/SystemC. The model reflects MLCA's overall structure: CP, TD, URF, and PUs, with associated PU caches and shared or distributed memory. In this MLCA instance, we used Advanced RISC Machine (ARM) processors as the PUs.
The model instantiates the desired configuration at runtime. Parameters include number and type of PUs; URF size; number of renaming registers; cache and memory configuration and associated latencies; and relative CP, TD, and PU speed.
Each PU can be configured with a cache and a combination of local and global memory. The interconnect adds a constant delay, and the memory model implements a simple contention mechanism, which enqueues requests in order and dequeues them at a given rate. The URF contention model is similar. 
63
MAY-JUNE 2004
The caches are write-through, and thus memory always contains the most up-to-date copy of data. Cache lines need not be invalidated on every write to maintain consistency; rather, invalidating the entire cache at the end of task execution on the corresponding processor maintains cache consistency. We use eightway set-associative caches, with an 8-byte cache block, and only a global memory. The model assumes four ports to the URF (and thus to the renaming registers) and four ports to the global memory. We used simulation to evaluate the performance of MAD and FMR on the MLCA model. MAD decoded an input MP3 song file consisting of 126 frames, with two channels, a 22,050-Hz sample rate, and a 40-Kbps bit rate. It executed in 137,312,446 cycles on one processor. FMR decoded 21 input packets, each consisting of 1,536 bytes. It executed in 112,468,135 cycles on one processor. Figure 11 shows the two applications' speedup as a function of the number of processors. We define each application's speedup for P processors as the ratio of the application's execution time on one processor to its execution time for P processors. The figure shows the speedup for various numbers of renaming registers. Each application exhibits scaling speedup, which is relatively close to the ideal when the number of renaming registers is sufficiently large. MAD's speedup at one processor is slightly less than 1. This reflects the overhead incurred by the code transformations. For example, each application's transformed code allocates and deallocates buffers during its execution, which does not occur in the sequential, untransformed application. Nonetheless, as the number of processors increases, the benefits of parallelism outweigh this overhead.
Figure 11 also shows that each application's speedup depends on the number of renaming registers; in general, the more renaming registers available, the higher the speedup. Further exploring this issue, Figure 12 shows the applications' speedup as a function of the number of renaming registers for different numbers of processors. Each application's speedup noticeably increases up to a certain breakpoint and then remains relatively flat. The breakpoint is different for each application and for each number of processors.
We explain this speedup behavior as follows. The increase in the number of renaming registers lets more tasks execute in parallel. (Renaming registers allow the hardware to break false intertask dependencies and thus to issue more task instructions in parallel. The hardware stops issuing instructions when it runs out of renaming registers.) However, the increase of registers has an impact only if it enables more tasks to execute. When adding a sufficient number of renaming registers has 64 MULTILEVEL COMPUTING ARCHITECTURE IEEE MICRO removed all false dependencies, each application's execution speed is dictated by the true dependencies among its tasks. Adding more renaming registers does not lead to the execution of more tasks. Thus, speedup improves until all false dependencies are broken and then remains flat. Similarly, the availability of processors dictates the maximum number of tasks that can execute, and the breakpoint is also a function of the number of processors. Because the number of false dependencies is different for each application, the breakpoint is also different for each application. However, Figure 12 also shows, especially for eight processors, that adding more renaming registers sometimes decreases the speedup, rather than improving it. This is caused by the impact of additional registers on task scheduling. 7 It's important to point out that the potentially large number of renaming registers poses no performance bottleneck in the MLCA's top level. The granularity of tasks executing on the PUs is several orders of magnitude greater than the URF register access time. Thus, URF registers are not likely to be accessed every cycle, and they need not be as fast as registers within the PUs. Indeed, as one indicator, we experimented with the performance impact of the number of URF access ports. In both applications, changing the number of access ports from one to eight had a negligible effect on performance. This was also true for the number of memory ports. Changing the number of memory ports had a negligible effect on speedup of the two applications, indicating that neither the URF nor the memory is strongly contended for. O ur simulation results show that both applications exhibited good performance. We found that adding more processors results in scaling performance and that contention for resources is negligible. Thus, the MLCA is a viable SoC architecture for multimedia applications.
In future work, we will address issues such as the design and evaluation of a memory hierarchy for the MLCA, task definition and formation, task scheduling, and system evaluation using industry-standard applications. 
