Abstract-Few analytical performance models that relate performance figure of merit to architectural design decisions are reported in recent studies of network-on-chip, which prevents the development of eNective system-level synthesis techniques. In this paper, we propose an analytical performance model based on queuing theory for a network-on-chip platform recently reported, which features an extremely simple programming model, while providing superior performance measures when compared with alternative
INTRODUCTION
With the vast complexity growth of System-On-Chip (SOC) platforms, the number of critical design decisions and alternative implementations and configurations considered in order lo map an applicalion to the corresponding platform increases exponentially. Therefore, the ability to evaluate the effect ofthese passible alternatives accurately in a reasonably short time becomes indispensable.
So far. simulation-based avoroaches have been the dominant choices made much pro&ss-nn that front. An SOC platform is either modeled in system-level languages, such as SystemC [2] or S p e d 131, where a distinction between avvlication. architecture and hardware does not ..
by the industry for performance analysis of SOCs. These approaches are highly accurate, but also prohibitively time consuming for large systems, which prevents the evaluation of a large number of possible system configurations. Inhitian and experience are usually relied on to select a few configurations to simulate out of the feasible many. However, such ad hoc decisions become less effective as the systems become larger. What is badly needed is a performance model that can give insight on how performance metric is related to architectural mapping decisions, commonly referred to as ~n~l y l i~~l p e r f~~~a n c~ model. In recormition to the limitations of simulation-based SOC exolo-.. exist, or using traditional parallel programming mcdels, such as MPI [4] , which are usually very complex to implement. Second, while traditional networks in supercomputers arc designed with the bandwidth limitation imposed by chip pin count, new SOC platforms, which are based on similar topologies, do not take full advantage of the much relaxed physical constraints and almost unlimited an-chip bandwidth.
We introduce a new propmming model revolving around a new concept, called conterf, which is essentially an abstraction of autonomous dynamic data structures closed under the point-to relation. A context-flaw nromam (CFP) can be viewed as B set of orocedures -. _ . , ration prncedure, we developed an analytical performance model to statically model systems implemented on the recently proposed Context-Flow Architecture (CFA). Our mcdel is based an Queueing Networks, a field that received extensive research over many decades, and whose models were used extensively in computer systems and networks modeling. Queueing network models were proved to be general, simple, accurate, and detailed, reporting various aspects of the target system and application performance measures.
In contrast to the vrevious work revaned in the area. the followoperating on a set of contexts in a multi-threaded form, collaborating through remote procedure call abstraction (RPC) to achieve the overall system behavior. Unlike an application in traditional programming models, a CFP is highlyparalleliiable, since different procedures, each accessing their own private data structures maintained in different context, can be run in a CFA on different processing elements (PES) in parallel, without the concern of dependency hazard 01 cache coherence that frequently occur in the lraditional shared or distributed memory architectures. The accesses of contexts do switch from one vrocedure to another when a procedure call O C C U~. The key problem in the design of a CFA is the design of its on-chip network. We start by fin1 defining an instruction set, which abstracts how the on-chip network interacts with the PES that it connects (Figure I) . The instmction set is simple enough to contain only 7 Instructions. It is encoded by the values ofthe wires on each port that connects a PE to the network. From the perspective of the network, it enccdes a command or request from a PE. From the perspective of a PE, the instruction set is a complement of its own for which it can assume the 0-7803-863 i-0/04/$20.00 0 2 0 0 4 IEEE. availability of a CO-processor for actual execution ~ effectively by driving the right wires in the corresponding pans. In Figure I , cf iAllocBank allocates a bank far a single on text until deallocated by cfifreesank. cfiMalloc is used for subsequent allocations of arbitrary objects on the target context. cf iLoad and cf istore ate simple memory accesses. cfiRPC and cf iRet are used to implement the remote procedure call abstraction, where the context currently accessed by the caller is passed to the callee for further processing. 
B. Queueing Nemorks
In this section, we provide some background on Queuing Network, an efficient and accurate approach to computer system modeling. It has been used in the design afsystems ranging from single network servers to wide area communication networks [SI.
A queueing network consists of a set of communicating nodes of service providers. A job arrives at a node, waits in the corresponding queue when all servers are busy, gets processed, and departs for another node or out of the system'. Figure 3 shows an example of a simple queueing system with some feedback flows. Previous efforts do not take full advantage of the fact that the network we are designing is on-chip, and the PES are physically close to each other. In [I] , we proposed a new on-chip network, called a CFA tunnel, that can implement this instruction set efficiently. As shown in Figure 2 , the tunnel maintains a pool of separate memory banks, as well as an intelligent crossbar switch. Each context is dynamically mapped to a single memory until it is deallocated, and the crossbar ensures the access to the memory is dynamically switched to the c a k e whenever an RPC occurs. Note that our crossbar should not be confused with crossbars in previous elfons, such as switch fabrics of network routers, which are utilized still for the purpose ofdata transfer. Instead, the goal ofour crossbar is to Drovide the direct. wired access to memories. RPC. A key feature and reason to the success of queueing network models is that they abstract away many of the low level details associated with the various modeled system. All it needs is a set timed parameters that affect the system performance.
The basic characterization entities of queueing network models are service providers, which represent the modeled system processing resources, and cuslomers, which represent the system jobs (contexts in our case). A typical set of inputs of a queueing model are [E]:
. A, arrival role, specifies the arrival intensity in customers per unit time.
. Dem,, senice demond at server i, which specifies the service time for each customer. The outputs obtained by solving the system are:
-R, overage system response lime, which specifies the travel time between the system input and output.
U!, utiliralion of sewer i, i.e. the percentage ofoverall time the server is busy.
. Wq,, queueing lime of server i , which specifies the average waiting time at server i before a job gets serviced.
. Ri, residence time of semer i, which is simply the s u m of average waiting time and average service time at server i.
-Lqi, queue lengfh of sewer i.
If the jobs arriving to the system have some classification, usually re-OT the flow ofcontexts from one PE to another, can then be achieved at virtually no cost! Experimental results in [I] showed superior performance measures of CFA-based implementation when compared with alternative architectures.
It is important to note that there is a physical limit for the scalability of the CFA tunnel. As the network gets larger, the delay of the crossbar grows quickly, thereby increasing the cost of each memory access. This is contained by employing a two-layer strategy, where PES are par. titioned into clustcrs based an the communication traffic amone them. ferred to as Multi-Closs Sysfems, the model inputs need to specify the job mix and required services, and the ourputs will be returned per class as well as overall system measures. It is worth noting that the input and output measure mentioned above are just the essential requirements for the least detailed models. Further parameters and ~esults are associated with other models used in various analysis tools, as presented below.
Finally, it is a common practice in queuing theory to describe a queue using Kendall's Notation (AJSJmJBIKISD); where: A describes the distribution of interarrival times of customers. S is the distribution of service times. m is the number of servers. B is the maximum number of customers which can be accommodated by the annotated queue. K is the population size, and SD is the service discipline.
For example MiDIZi101500iFCFS is for exponentially distributed interarrival time, deterministic service time, two servers, buffer size of 10, papulation 500, and first-come-first-served discipline. Default values, such as infinite queue size and FCFS service discipline, can be omined from this natation.
ANALYTICAL PERFORMANCE MODEL

A. The Modeling Process
The close correspondence between the anributes of queueing networks and those of our CFA, as shown in Section 11, suggests that queueing networks could be ideal modeling tools to describe our system.
The modeling process could be viewed as a conversion from system specifications in the Context-Flow domain to those recognized by queueing systems. The output of this stage would be a fully specified queueing network that can be easily salved using simple equations.
Whether the resulting system is single-class or multi-class depends on the application being mapped on a CFA.
The inputs of our modeling process are: WorkloodSpecrfcution, which defines the arrival jobs mix and their corresponding arrival rates. This can be obtained by a process called workload choracwirotion, which is a complex process of profiling to arrive at a typical workload. A second possibility is that B typical workload would be defined initially as parl of the system specifications [8].
. Procedure Frequency, which defines the number of calls made to each context-flow procedure per unit time. Again, this measure can be obtained by profiling of a typical workload, or by static prediction of the probability of edges of the application call graph for a typical workload.
. Mapping, which describes the assignment of procedures to target system processing elements.
T h e output of our modeling is a fully characterized queueing network.
Solving the model rerums the performance estimates of various aspects of the system.
E. Stochosric Model
Traditional applications of queueing networks to model computer systems assumed the arrival of a Poisson process at the system inputs, and exponentially distributed service times at the service centers [8] .
These assumptions imply that the resulting interconnection of processing elements farms B Joekson Nerwork [9] . In this class of networks each queue can be analyzed separately as an MiMim queue. This model is parameterized only by the average arrival rate and average service rate. retuning average waiting time, average queue length, and utilization. This approach was proved quite successful in modeling such systems. For example, requests sent by users to a mainframe did have a random arrival panern that was captured using a Poisson process. And the size of jobs to be serviced was also a randomized process. However, the immediate application of the same simplifying assumptions lo model our architecture was unsuccessful. In a SOC, the arrival process andior service times could easily be deterministic! For example, arrival rate for an MPEG decoder is usually deterministic, and service rate far ATM packet processing stages is also deterministic.
In [IO], W. Whin described the Queueing Network Analyzer (QNA), a software package developed at Bell Laboratories to analyze complex queueing networks. The package uses a GWGlm approximation models to describe and analyze the given system. The arrival process is assumed to be a generalized intzranival (GI) process, and the service may have any general (G) distribution. The approximation made by this approach is that only the meon and squaredcoeficient of variance (SQV = var/(mem)2) of the arrival and service processes are required for the our calculations (a two-moment model). In addition to the basic input parameters described in Section ll-B, we need to provide the SQV of interarrival time of the external arrival process to each node i, 4;, and the SQV of the service time, 4,. The analysis process calculates the parameters of internal nodes, which enables the calculations of all required system measures. The model is capable of handling even more complicated system features, including superposition and splitting, which is outside the scope of this paper. For our purpose, the proposed madel seemed to be a suitable fit. The additional required parameters could easily be driven by workload characterization. The question lefl is the model accuracy, which will be reponed in Section IV. In the sequel, we provide our approach to transform our CFA and application description into a fully described queueing network model Note that in this model zE;'m;,j must add to 1. Values less that 1 imply logicifunctionality replication and workload distribution. For example, if we want to replicate procedure p3 and divide the arrival requests such that one third of the requests go to PE I and the rest to PE2, then the new mapping matrix will be:
To force single instantiation of procedure logic, we allow mapping figures to take only binary values, 0, 1.
Using the summing d e , when two procedures are assigned to a single PE, the arrival rate will be the sum of their frequencies. This conversion from the abstract domain to the queueing system domain can be captured using the mapping matrix, as shown in 4. 
$ ' J ( m i p -f i )
use as a test case followed by results and discussion.
A . Pegomonce Evaluorion Fromework
D; =
Where Dp, is lhe average processing time of job by procedure j. Although Dpj is assumed to be constant, the model can be easily ex.
tended to make procedure delays a function of the mapping, on het. erogeneous systems, a single procedure could be mapped to different embedded processors with different architectural feamres, or to Custom logic. TO take that infa account we define D~~ in terms ofdj,j; the average processing time ofjob by procedure j when on PE i, as follows:
At the system level design We target complex applications usually described in C using high-level language feahlres such as painter references and accuracy Can Only be validated On such applications. A performance evaluation environment, which can simulate CFA with reasonable architectural details for any CFP application, is therefore needed.
A good example of an architectural evaluation environment is the SimpleScalar tool set developed at Wisconsin [ I I] . It is designed to study new innovations in micro-architecture such as pipelining, branch prediction, out-ofurder issue etc. The environment provides a complete compiler tool chain that can compile a C application into a binary in the PlSA instruction set. An instruction set simulator can then be used to simulate the binary, while collecting performance metric of interest.
We consider a homogeneous CFA where each PE is implemented by a processor equipped with the PlSA instruction complemented by the context-flow instruction set defined in Figure 1 . While each PE has their own private address space, an unused memory space segmenl of each PE, from address 0x00000000 to Ox03FFFFFF. is mapped to con- ...
In Queueing Theory it is a common practice to use service rate instead of service or processing time:
, $ :
Using these numbers we can derive major performance meaSures of processing elements using very simple formulas. The equation describing processing element utilization would be:
Using equations 7 and 8, and and <, for each node i, we can calculate funher estimates of PE statistics. For example, the average waiting time at PEi is:
Where:
(11) We can also derive performance estimates of the overall system. An average processing elements utilization is:
And the average service time for a request is:
Using this model we can easily get performance measures for each procedure, each processing element, each job class, and the overall system. Funher processing is needed if the more detailed probability distribution ofthe above quantities is required, which is outside the scope of this work.
erences, can still be useddirectly in the source code to access objects within the context. We also coded a cycle-accurate implementation of the tunnel-based on-chip network defined in Section 11-A. The SimpleScalar annotation interface was used to introduce the context-flow instruction set to each PE. Further details can be found in [I] .
Our simulator collects all performance statistics that we need to fully describe the system performance during simulation. These statistics are compared with those derived in our queueing network-based estimation model for validation purposes.
E. Tesl Cares
We pursue the validation of our model through real-life applications, namely Cryptography Acceleration Processor, and MP3 Decoder B. I cryptography Acceleration Processor Cryptography acceleration processors are becoming of central interest with the increase of SSL-based traffic over the internet. In OUT benchmark, we implemented a number of symmemc and asymmetric algorithms commonly used in SSL and IPSec. The implemented functions and the possible Rows of packets are shown in Figure 4 . Delay of processing methods were mainly obtained from actual RTL implementations [12] . The longest path of an input packet is to go through all three categories of processing, namely hashing (MD5 01 SHAI), symmetric or private-key encryption (DESECB, DESCBC, 3DESECB. 3DESCBC, or RC4), asymmetric or public-key encryption (RSA). Packets could skip hashing, public-key encryption, or both. MPEGI-Layerlll, commonly referredto as MP3, is the de-facta standard of high-quality high-compression of audio data. MP3 decoders became of interest after their popular use in portable multimedia devices. An overview of the decoder stages is presented in Figure 5 The highlighted stages were implemented in our testbench. Each stage is implemented in a single procedure processing one data granule at a time.
B.2 MPEGI-Layerlll
C. Resulrr and Discussion
To carry out the experiments on the SSL accelerator, we implemented a packet generator that generates a workload, or packet mix, which uses various processing paths according to given distribution parameters. Forthe MP3 decoder, on the other hand, we used some afthe input files distributed along with the standard MP3 software.
In case of the SSL accelerator, for a given workload we used the different mappings described in Table 1 . For example, in mapping I we map the RSA procedure to PEO, MD5 to PEI, SHAI to PEZ, and so on. The corresponding simulation and estimation results are reported in Table 111 , and the average estimation errors for each mapping over all PES are presented in Figure 6 . In Table 111 , for each mapping we report the average simulated residence time and that estimated by our model for each PE (other measures, such as response time and utiliration, can be easily derived from model inputs and reported results). For example, for mapping I of the SSL accelerator, the average residence time at PE0 was 10255.5 cycles, while the estimated value was 7027.6, residence time at PE1 was 462.1 cycles, while the estimated value WBS 413.6, and so on. Figure 6 reports the average estimation error for each mapping over all PES. For example, estimation error for mapping 1 over all PES was I 1 Sa%. Similarly, for the MP3 decoder we tried the mappings described in Table II , and the corresponding estimation results are reponed also in Table 111 and Figwe 6. From the reported results, we can see that the estimation results were accurate in some cases, and varied (either high or low) in others, but correctly reported the relative time values at different PES with acceptable average error (Figure 6) . taking only few seconds as opposed to many simulation hours. It turned out that the way the solver handles multi-class networks through simple aggegatian could potentially be improved. To illustrate this issue, mapping 3 of the SSL test case was intentionally configured such that procedures with largely different processing times were mapped to the same PES. Also, the use of a single variability parameter to characterize the variability of an arrival process to a queue was not optimal. More advanced solutions were reponed in [13] , and further enhancements to queueing network solvers are being proposed in this active area of research, which is outside the scope of this work. However, 9s we observed in our experiments, the used solver still serves as a first order approximation of the queueing time 81 each PE. Far example, the solver does not report a waiting time in thousands ofcycles while the actual value is only in hundreds, or vice versa.
Although higher accuracy levels would have been appreciated, our proposed model is still valid. and it gets as accurate, flexible, and powerful as queueing theory itself. Even at the reported accuracy measures, the model will provide important optimization directions as pan of a system-level optimization framework. [IS] , very little work has been carried out focusing on the performance modeling of SOC architectures [IS] . In the following we give a brief review of those effork focusing on the performance modeling of network-on-chip. We start by first developing a taxonomy to help categorize these works.
. A performance model is dynamic, if it relies on the use of simulation. It is smic otherwise. In general, a dynamic performance model is more accwate with respect to specific input trace. A static performance model is faster to evaluate. 
SIMULATED AID ESTIMATED RES~DENCL T i m
. A performance model is onolyficol, or architecture-oware, if the result depends not only on the characteristics of the application, but also the architecture and haw application is mapped to the architeclure.
-A performance model is aufomatic, if it can be automatically constructed from the application and architectural mapping. It is manual otherwise.
-A performance model is validofed, if its accuracy has been confirmed by detailed simulation.
Stochastic Automata Networks (SANS) were used in [20] to analyze application and derive probability distribution for various performance aspects of the target application. This model is static, however, not architecture-aware. Furthermore, the COnStNCtiOn of a SAN network from an application is not yet an automated process.
A static performance model for network packet processing architectures was derived in [21] using Network Calculus results. The proposed approach uses deterministic bounds to describe the arrival and service processes of the target system. The model is also analytical, yet incomplete in the sense that conflicts over communication resources are ignored, which could easily result in large errors ofthe estimated measures. As aresult, estimation resultsofthe test cases werenot validated.
The work in [22] proposes a hybrid staticldynamic performance analysis methodology for bus-based SOC communication architeclures. Although the flow was validated and accurate estimates were reported, a speedup ofonly 2 over hardwarelsoftware co-simulation was obtained.
In this work we propose a performance model of a concrete SOC platform equipped with both an efficient on-chip network and a simple application programming model. The proposed model is static, architecture-aware, automatically evaluated, and can be easily incorporated in a system-level synthesis framework.
VI. CONCLUSION AND FURTHER WORK
In this work we proposed the ue of queueing networks to derive analytical performance models for a novel SOC platform. We illustrated the model usability and accuracy with real-life applications using a cycle-accurate simulation environment. The model is as flexible and powerful as queueing theory. I t can easily be used in exploring the design space of CFAs for system-bel synthesis, which represent a promising future work in this field.
After having a better understanding of the behavior of intercluster traffic on candidate second-level networks, such as torus or mesh [5], future work will consider the incorporation of the queueing theoretic model in B complete static performance analysis of larger systems. At that stage, the enhanced model will become an essential part of a complete system-level design exploration framework.
