Recent efforts in adapting computer networks into system-on-chip (SOC), or network-on-chip. present a setback to the traditional computer systems for the lack of effective Progr-ing model, while not taking full advantage of the almost unlimited on-chip handwidth. In this Paper, we Propose a new Programming model, called context-How, that is simple, safe, highly parallelizable Yet of the much relaxed physical constraints and almost unlimited onchip bandwidth.
INTRODUCTION
. . straction of autonomous dynamic data structures. An application written with this programming model is not only simple, as it is "less" than the usual C code, but also safe in the sense of lava, as it is free of problems such as free memory access and dangling pointers. Second, we propose a new S W platform architecture. called the contat-pow architecture, revolving around an on-chip network infrastructure called a tunnel, which takes full advantage of the physical proximity of tightly coupled processing elements. The tunnel implements the on-chip remote procedure call abstraction, therefore achieving the transparency of the programming model, since an application does not have to change with respect to the change in the underlying architecture, yet with a cheap local procedure calls, thereby achieving performance ef. ficiency. Third, we have built a development suite by extending the popular SimpleScalar environment, which was designed for single processor evaluation, so that complex applications can be simulated on the multi.processor context.Row ar.
chiteclure platform, we the performance emciency of this archlecture by real world applications.
The rest of the paper is organized as follows. In Section 2, we introduce the context-flow programming model. In Section 3 the design of baseline context-flow architecture is then described.
we built for our architecture before we demonstrate in Section 5 its performance efficiency on two applications. namely an MP3 First, while traditional computer architecture is well abstracted decoder a cryptograp~y accelerator, we discuss work in The continued advancement in semiconductor technology allows sYStem-on-chiPs (soc) to accommodate an increasing number of computational elements and embedded memoly the industry has been using common busses and design specific communication channels to interconnect these components. Such global-wiring communication architectures are unable to scale with the large dies fabficated in the near future with a 0. l p n technology or below [I] . To overcome this problem and accommodate future applications that need massive parallelism. researchers proposed the use of interconnection networks, previously used to interconnect supercomputer components, to fulfill on-chip communication banner of network-on-chip, we observe some common, yet important ommissions. 
CONTEXT-FLOW PROGRAMMING MODEL
A programming model is an abstraction that separates application from architecture. This separation is important to allow applications he developed and reused across different architectures, and vise versa. A programming model can he defined at different levels of abstraction, and a hardwarelsoftware infrastructure is usually needed to support such abstraction. For example, an instruction set is a programming,model defined at the low level to abstract away architectural details such as pipelining and out-of-order issue, and a massive amount of hardware logic is used to realize this abstraction. A programming language is defined at the higher level to abstract away the differences between different instruction sets, and a compiler is used to realize such abstractions. For the same programming model, a middleware infrastructure, such as CORBA
[6] or DCOM, can be used to abstract away architectural details of a distributed environment to implement a distributed application the same way as a sequential one.
The importance of programming model, however, is ignored in the hardware-centric CAD community. Even though platformbased design is advocated to allow the reuse and customization of pre-aggregated components, the concept of platform has not been formalized with a programming model for applications. Recent interest in building the communication infrastructure on massive parallel SOC has led to the concept of network-on-chip. Building a programming model for network-on-chip either has to use explicit communication with sendlreceive system calls, a wide departure from the traditional imperative programming model, or has to build another middleware infrastructure on top of the network, leading to performance degradation with the number of layers one communication session has to go through.
We propose a new programming model formally defined in Definition 1. A context-flow program (CFP) is extremely simple: it is simply a C program with the same sequential semantics. It therefore can be compiled using any conventional compiler and executed on any conventional machine. Contexts can be implemented by using the API shown in Figure 1 . While the API consists of only three functions. it is the complete API seen by the application programmer. Here, cfNewContext creates a context and returus a unique identifier. cfDelContext destroys a context, thereby reclaiming all the memory blocks contained in the context. cfAlloc allocates a memory block of certain size from the specified context. We now argue that a CFP is in fact simpler than a usual C program: note that the counterpart of CfAlloc, which should be responsible for memory block deallocation, is not provided by design. In fact. the memory is deallocated at the context level by CfDelContext. This relieves the task of fine-grained memory management, thereby simplifying the programming task in a way similar to garbage collection.
Definition 1 Given a program with a se1 P o/pmcedures, oper-
This simplification can lead to program safety in the same sense of what garbage collection brings to modem languages such as Java. A CFP is free from dangling pointers and free memory access problems [hanks to the closure properly of contexts: there cannot be any references to freed memory blocks, since the memory containing the reference should belong to the same context, and therefore he freed already as well. On the other hand, the implementation of context is far cheaper than a garbage collector, in fact cheaper than the mllodfree in a normal program: the cost of memoly allocation can be confined to constant time using a stack based mechanism.
Context is designed to be an abstraction of autonomous data structures. It can he anything ranging from arrays, linked lists, trees, graphs, or the combination of all. The concept of context offers a macroscopic view of the program and therefore makes coarse-grained parallelization much easier, which shall become apparent in the next section.
CONTEXT-FLOW ARCHITECTURE
An architecture is an aggregate of architectural components such that an application can be executed or implemented through a well defined programming model. A micm-architecture is an aggregate of components such as fetch stage, decode stage, execution stage and memory stage to implement a sequential application in C or other programming languages by its instruction set. On the other hand, a macm-architecture is an aggregate of components such as processing elements (PES) and memories to implement a parallel application by a programming model such as MPI. The composition of a macro-architecture in a traditional parallel system is pre-defined. wherear in the case of SOC. the composition is often customized according to one application or one family of applications. A macro-architecture is said to he homogeneous if all PES are of the same type, e.g., processors, and heterogeneous if PES can be microprocessors, DSPs, ASlPs or custom hardware cores.
We consider the design of a macro-arclutecture. called the context-flow macro-architecture (CFA), formally defined in Definition 2. Unlike an application in traditional programming model, A CFP is highly parallelizable, since different procedures, each accessing their own private data structures maintained in different context, can be run in a CFA in different PES in parallel. without the concern of dependency hazard or cache coherence that frequently occur in the traditional shared or distnbutcd memoly architecture. The accesses of contexts do switch from one procedure to another, when a procedure call occurs. When the remote procedure call (RPC) abstraction is implemented by the on-chip network of a CFA, whose runtime configuration in Definition 2 is dynamcally adjusted, then a CFP is also highly lransparmr, meaning that it does not need to he changed no matter how the PES in a CFA is allocated. and how the procedures are mapped.
The key problem in the design of a CFA is the design of its on-chip network. We start by first defining a programming model, which abstracts how it interacts with the PES that it connects. We define the programming model in the form of an instruction set, as shown in Figure 2 . The instruction set is simple enough to contain only 10 instructions. It is encoded by the values of the wires on each port that connects a PE to the network. From the perspective of the network, it encodes a command or request from a PE. From the perspective of a PE, the instruction set is a complement of its own for which it can assume the availability of a co-processor for actual execution -effectively by driving the right wires in the corresponding ports. We now consider how to implement an on-chip network that can implement this instruction set eficiently. There are several alternatives, each employing a different network topology.
As shown in Figure 3 (a) , a bus based CFA maintains a private memory bank for each of its PES, in other words, the connection configuration C in Definition 2 is static. The context is also maintained in its private memory bank. On the other hand, every time a RPC is invoked, the content of the corresponding context needs to be copied to the memory bank that belongs to the callee, and this data transfer is canied out by a shared bus. Figure 3 (b) , a packef-swifch based CFA is the same as bus-based except that the data transfer can be performed more efficiently: while a shared bus may invite transfer congestion, a well designed packet-switched network can distribute the comnnication traffic evenly.
As shown in
Like previous efforts, these two altematives do not take full advantage of the fact that the network we are designing is on-chip, and the PES are physically close to each other. We propose a new based on-chip network, called a CFA runnel. As shown in Figure 3 (c) , the tunnel maintains a pool of separate memory banks, as well as an intelligent crossbar switch. Each context is dynamically mapped to a single memory until it is deallocated, and the crossbar ensures the access to the memory is dynamically switched lo the callee whenever an RPC occurs. Note that our crossbar should not be confused with the crossbars in previous efforts, which is designed still for the purpose of data transfer. Instead. the goal of OUT crossbar is to provide the direct, wired access for memories. RPC, or the flow of contexts from one PE to another, can then he achieved at virtually no cost! It is important to note that there is a physical limit for the scalability of the CFA tunnel. As the network gets larger, the delay of the crossbar grows quickly, thereby increasing the cost of each memory access. This can he contained by employing a two-layer strategy, where PES are partitioned into clusters based on the communication traffic among them. and intra-cluster network is based on the tunnel, whereas the inter-cluster network is based on packet switch. In this paper, we focus only on the study of the Rat network, which we believe is appropriate for the applications we are interested in.
PERFORMANCE EVALUATION FRAMEWORK
We target complex applications which are usually described in C using high-level language features such as pointer references and complex data structures. The speculated performance advantage can only he validated on such applications. A performance evaluation environment, which can simulate CFA with reasonable architectural details far any CFP applications. is therefore needed.
A good example of a n architectural evaluation environment is the SimpleScalartoolset developed at Wisconsin 17). It is designed to study new innovations in micro-architecture such as pipelining, branch prediction, out-of-order issue etc. The environment provides a complete compiler tool chain that can compile a C application into a binary in the PISA instruction set. An instruction set simulator can then he used to simulate the binary, whle collecting performance metric of interest. Figure 4 (a) shows the pseudo code of sim-&e. a fast simulator provided in SimpleScalar, which maintains the processor state by a simulated memory (memj and registers (regs) . It starts by loading the application binary into a simulated memory, and then entering a loop which fetches an instruction from the simulated memory at a time, decodes it, and then performs an action that is consistent with the instruction semantics, while updating simulated registers and memory accordingly.
In the sequel, we first introduce how the SimpleScalar infrastructure is extended into a multi-processor, CFA performance evaluation environment. We then show how a C program is mapped into a CFA in our environment by a simple, yet complete example.
Sim-CFA
We consider a homogeneous CFA where each PE is implemented by a processor equipped with the PISA instruction complemented by the context-flow instruction set defined in Section 3. The processor state in a single processor environment first needs to be replicated, as shown in Figure 4 (b) . Whle each PE has its own private address space, an unused memory space segment of each PE, from address OxoOM)OoOO to OXO~FFFFFF, is mapped to context memory pool. With this approach. high-level language features, such as array references, pointer indirection and structure member references, can still be used directly in the source code to access objects within the context. The simulator was modified to run multiple SimpleScalar pro- cci tu cnih Pli Smnc ofthe in,iruruon, w i l l bc Used to lmplcmcnt the conlcxl-iloi\, MI !Szr.iion ?I. u h i k uihrrs \\111 he U& hy the implementation ihnt brcaks doun the ulc~lation intu 1uo wp., IS 5houn in Fieurc 5 . Tu trandorm the DroYrm into a CFP. rhe tir>t compiler described in the next section to implement RPC. As shown in Figure 4 (b) , the simulation engine stam by loading the binaries for each PE into the simulated memories. At each simulation cycle, for each PE, the simulator fetches an instruction from memory and decodes it. If its annotation field is non-zero, meaning that it is a context-flow instruction, it will invoke the corresponding on-chip network simulation to process a request on one of the ports of the network. If it is a memory access whose address falls into the range from OxOOOOOooO to OXO~FFFFFF, the corresponding location inside the context memory pool will be accessed. Otherwise, it will interpret the instruction the same way as ~ . I step is context definition. In this example, the context is simply the data array. Figure 6 presents a transformation of the source code that runs on two PES. mapping top() and sqrtArray0 to PE0 and addArray0 to PEI. Procedure mappings to system PES are defined in "config.da1" along with these procedures' stamps. This file is used to generate proxies and main functions for each PE via an automatic code generator (Figure 7) . Note that the main() for each PE simply runs an infinite loop waiting for call to the procedures it implements. WAITEORRPC() and READIARGSO are simply macros that use cfiAckRF'C0 and cfiload(), respectively. Once codedlgenerated, the source files of each PE alone with SimplcScaldr docs p r~i c s ' dctinition are compiled b) the SimplcScdlar gcr. compilrr r r -g~ Sini.CFlo\,, lhrn can stnulaic (he mudded \ ) r i m b! m i -We i m p l m w " diifcrem nciuorls dcfincd in Sc.r.tlon 3. stage is implemented in a single procedure processing one data granule at a time. Procedures are grouped in PES such that the sum of method delays within PES are as close as possible, targeting efficient thread-level pipelining. Due to the absence of accurate hardware implementation performance numbers, the delay of each method is determined using the number of memory accesses per call, assuming a perfect pipeline implementation of the processors and that memory bandwidth is the primary bottleneck. Current datapath synthesis tools (such as Module Compiler by Synopsys) can easily pipeline the computational parts of the target algorithm. In our experiment, each configuration uses 6 CKBytes SRAM banks. Simulation results are shown in Table I , where the second column reports the throughput in cycles per request. The third column reports the average PE utilization. In this section we present performance results of several architectural configurations in comparison of our proposal. Evalualions were applied to two real-life applications, namely, MPEG1-Layer111 decoder and cryptography acceleration processor. The performance evaluation framework presented in Section 4 was used to hold the experiments.
MPEG1-Layer111 Decoder
MPEGI-Layerlll, commonly referred to as MP3, is the de-facto standard of high-quality high-compression of audio data. MP3 decoders became of interest after their popular use in portable multimedia devices.
An overview of the decoder stages is presented in Figure 8 . The highlighted stages were implemented in our testbench. Each Figure 9 . Delay of processing methods were obtained from actual RTL implementations [81 and comparison results (91. The longest path of an input packet is to go through all three categories of processing, namely haslung (MD5 or SHAl), symmetric or privatekey encryption (DESECB, DESCBC, 3DESECB. 3DESCBC. or RC4). asymmetric or public-key encryption (RSA). Packets could skip hashing, public-key encryption, or both.
To carry out the experiment, we coded a packet generator that generates a packet mix which uses various processing paths accordine to a eiven distribution. A set of oackets was eenerated and [RI Rudolf ~~ Usselmann, "DESmriple DES IP cores," September designs.
RELATED WORK
The MIT Raw machine was one of the earliest designs to utilize on-chip interconnection networks [IO]. It uses several 2-D mesh networks to connect an array of identical programmable tiles of RISC processing cores. Dally in [21 suggests the use of on-chip interconnection networks for future SOC where traditional interconnection techniques do not scale. It suggests the use of regular interconnection topologies, such as toms and mesh networks. as a means of communication between square tiles of identical dimensions, but not necessarily homogeneous. The work in [ I l l elaborates on thn arclutecture targeting design exploration at the system levcl. Tneir work proposes mapping algorithms that target the powerlperformance optimization problems for the regular communication architecture.
The use of crossbar based interconnects started to become popular in recent years. The Berkeley IRAM 1121 and Stanford Smart Memory system (131 both use a crossbar to interface a single general purpose programmable RISC PE to an array of memory hanks, targeting the high bandwidth that crossbars provide. However, the high-level interface we implemented in our tunnels is not used in those systems as only a single P E is interlaced to the memory pool.
2001.
[9] Bruce Schneier, 
