This paper presents a method for static performance analysis of SoC architectures. The method is based on a network calculus theory known as LR servers. This network calculus is extended and applied to make it support SoC performance analysis. Performance requirements of subsystems are elegantly captured as traffic flows and associated latency constraints. The SoC infrastructure is modeled as a set of LR servers to validate that the worst-case delays in handling the traffic flows meet the latency constraints. A multi-channel DVB-T set-top box case study demonstrates the power of the method. Key architecture choices, such as schedule or interconnect variant, can be varied easily to support exploration of architecture options.
Introduction
At the heart of consumer products for mobile communications and digital entertainment is typically a complex Systemon-Chip (SoC) that performs a broad range of functions. These functions are implemented by subsystems on the SoC, where a subsystem may contain one or more IP blocks like programmable processors and / or dedicated hardware blocks. The subsystems are connected by a SoC infrastructure consisting of interconnect and memory.
A SoC architect has many design options for a SoC architecture that need to be explored. Simulation-based approaches to architecture exploration typically require large efforts for building models of the SoC architectures and have long execution times. Also they do not provide guarantees that performance requirements are met under all circumstances, e.g. for all video content and for all possible interactions in the SoC infrastructure.
We propose a method for static performance analysis that can be used for early exploration of SoC architectures. The method can be used for worst-case analysis, so that the required performance can be guaranteed. Models can be constructed quickly and evaluated with short execution times. The method supports the SoC architect in gaining insight in the design problem at hand by providing valuable feedback for the metrics of interest, such as the required size of a queue in the SoC infrastructure. It helps him to set budgets for execution times, bandwidths, latencies etc that need to be met in downstream design activities to satisfy performance requirements.
Our method for static performance analysis is based on network calculus theory [1] [2] . Performance requirements of subsystems are captured as a set of traffic flows with associated latency constraints. The SoC infrastructure is modeled as a set of interconnected network elements. We then verify for the specified traffic flows whether the worst case delays incurred by the SoC infrastructure satisfy the latency constraints associated with these traffic flows.
The aim of this paper is to show that an extended form of the "Latency-Rate Servers" network calculus [2] can be applied for fast exploration of SoC architectures. We extend and apply the basic network calculus theory in order to make it support the performance analysis of SoC architectures. We show that the extended network calculus is sufficiently expressive to capture SoC architectures in concise models. We present a realistic design case of industrial complexity to show how different architecture options, such as schedule or interconnect variant, can be modeled and evaluated early, enabling the SoC architect to identify the valid options that meet the required deadlines.
In section 2 we introduce a network calculus theory known as Latency-Rate servers. In section 3 we discuss related work. In section 4 we present our performance analysis method. In section 5 we discuss a multi-channel DVB-T set-top box case study and we present the results of the analysis method. Finally, in section 6 we draw conclusions.
Latency-Rate servers
In network calculus, traffic can be characterized by an upper bound. In its basic form, this upper bound is a monotonously increasing function c(t) = σ + ρ · t, where σ represents the burstiness constraint in words and ρ the rate of the traffic stream in words s . In Figure 1 , b 1 is a traffic stream on a particular link during a particular time interval. b 1 is upper bounded by c 1 . Network calculus provides an elegant way to describe bounds on traffic for a sliding window of arbitrary size with just two parameters (i.e. (σ, ρ)). For modeling and analyzing communication networks, Stiliadis and Varma [2] , [3] have introduced a network element called Latency-Rate server (LR server). In Figure 2 a model of the internal structure of an LR server is shown. More than one producer transmits traffic (represented by σ i and ρ i ) to an LR server. This traffic is temporarily stored in queues. A multiplexer combines the traffic according to an arbitration policy. Finally, the traffic is demultiplexed and each packet is sent to a consumer.
The behavior of an LR server is determined by the latency (Θ i ) and the allocated service rate in words s (ρ i ) for input traffic stream i. An LR server guarantees an output service rate ρ i , a time period Θ i after receiving packets of stream i (see Figure 1) . So, if b 1 is between c 1 and c 2 (i.e. ρ · t), then b 2 has a lower bound of c 3 (i.e. ρ · (t − Θ)).
The delay of a packet of stream i (i.e. D i ) for a chain of m LR servers is upper bounded by (see Figure 1 )
The maximum backlog in words of the k th LR server in a chain of LR servers for stream i (i.e. Q k i ) is upper bounded by (see Figure 1 )
This determines the required queue size for the k th LR server.
Related work
Cruz has pioneered a network calculus [1] , originally intended for computer networks. He made a mathematical framework for deriving worst-case bounds on the performance (e.g. the delay). He assumed that R(t) represents the instantaneous rate in words s of a stream flowing on a link at time t. Then, this traffic can be upper bounded by a monotonously increasing function c(t), such that t2 t1
, where c(t) = σ + ρ · t. Stiliadis and Varma extended this work and introduced the LR server as an abstract network element.
Chakraborty and Thiele define bounds for arrival curves and service curves of traffic streams in [4] . They present a task-level model that captures some properties of stream processing applications and supports investigation of task interactions. Jersak et al. [5] also focus on the task interaction. Their event streams are somewhat similar to the traffic model we employ.
In our paper, the focus is on the traffic between the subsystems and the SoC infrastructure, where we assume that the deadlines have been specified for the subsystems and do not depend on the interaction between subsystems. The techniques in [4] may be used for deriving such deadlines. Our techniques can then be applied to derive the actual execution times of tasks in order to check that the deadlines can be met.
In [6] Henriksson applies network calculus for the analysis of memory access latencies. He supports request-response streams and pipeline degrees to obtain tight bounds in the analysis. We adopt these concepts and derive expressions for total delay.
Performance Modeling and Analysis
The method of Stiliadis is used as foundation for our performance analysis method. The (σ, ρ) traffic model is used to characterize traffic and LR servers are used for modeling SoC infrastructures. This method was not originally devised for SoC performance analysis. We observed that the method of Stiliadis does not support
• the total delay for handling data consisting of multiple packets starting from the first word of the first packet until the last word of the last packet. Stiliadis only derives delays for individual packets where delay is defined as the time between last-word-in and last-wordout.
• the total delay for sending a request to a subsystem and receiving a response to that request.
• the total delay for sending requests, where the number of outstanding requests is limited.
• a model of a memory system.
• SoC arbitration policies (e.g. TDMA). In [6] the concepts of request-response stream and pipeline degree are introduced and a DRAM model is derived. The contribution of this paper consists of deriving expressions for the total delays with support for three traffic characteristics (dropping impulse assumption, request-response stream and pipeline degree) and a SoC arbitration policy (TDMA).
For the performance analysis method, we have made the following assumptions.
• There are no dependencies between the different traffic streams of a subsystem.
• The input of a subsystem is according to specified traffic characteristics.
• The output of a subsystem can always be delivered (e.g.
buffers are large enough to prevent overflow).
Dropping impulse assumption
Stiliadis assumes that a packet has been serviced when its last word has left the server. Then, the arrivals and departures of packets are considered as impulses (impulse assumption). Therefore, he uses last-word-in, last-word-out to determine the delay. Because the total delay is required for sending packets from Subsystem 1 via LR servers to Subsystem 2, first-word-in, last-word-out is required. Then, the arrival time of the first packet in the first LR server is part of the total delay. Therefore, a term L C is added to the total delay of Equation (1) [7] .
Request-response streams
Subsystems interact via the SoC infrastructure. Stiliadis only determined the maximum delay for a packet traveling from one subsystem to another. In some cases, a subsystem sends requests via the SoC infrastructure to another subsystem and waits for responses from that subsystem. This type of traffic stream is called a "request-response stream" [6] . An example is a load request and a corresponding load response from a memory system. We extend the method of Stiliadis with request-response streams and derive an equation for the delay of such streams.
Assume the situation of Figure 3 . Subsystem 1 produces requests and sends these requests via a chain of m LR servers, with a total latency of m j=1 Θ j req , to Subsystem 2. Subsystem 2 produces responses to the requests and sends these responses via a chain of m LR servers, with a total latency of m j=1 Θ j resp , back to Subsystem 1. Furthermore, Subsystem 2 requires D proc s to produce a response, after a request is received. Finally, for each request exactly one response is generated. Assume that the request flow b 1 is characterized by σ req , ρ req and packet size L req words. Furthermore, the response flow b 3 is characterized by σ resp , ρ resp and L resp . Finally, assume that If the request flow is bounded by c 1 and c 2 (see Figure  3) , the LR servers guarantee that these requests arrive at Subsystem 2 ( is upper bounded by [7] (2)).
Pipeline degree
For most subsystems in SoC architectures, the number of outstanding requests is limited. In [6] the pipeline degree of a traffic stream is defined as the maximum number of outstanding requests. A higher pipeline degree can decrease the maximum delay of a traffic stream, but increases the complexity of the subsystem. We introduce the pipeline degree into the method of Stiliadis and derive an equation for the delay.
Assume that n is the pipeline degree. Then, the (n + k) th request cannot be sent earlier than the k th response has been received. Assume the same situation as in Figure 3 , but now the number of outstanding requests is limited. Then, the maximum delay for x words of requests can be split into periods of sending n requests and waiting until the first response arrives (i.e. D 1 , see Figure 3 ). Therefore [7] ,
A maximum of n request packets of L req words (i.e. n · L req words) can be outstanding. Then, there are The backlog of the LR servers can still be determined by the original equation of Stiliadis (Equation (2)). σ i is now lower bounded by [1] [7] 
DRAM memory model
A DRAM memory system has some specific characteristics that are important for the performance analysis of SoC architectures. The method of Stiliadis assumes that the processing delay of a packet is proportional to the size of the packet. In case of a memory system, this proportionality does not hold [8] . This can be taken care of by a so-called packet stretcher as described in [6] . Using the model of a DRAM memory system as described in [6] , the DRAM controller can be modeled as an LR server.
TDMA
A useful SoC arbitration policy is TDMA. TDMA uses a periodic schedule. In each round, the input streams of an LR server using TDMA are served in a Round Robin fashion. The maximum number of packets of stream i (i.e. w i ) that the LR server can serve in a round is determined in advance. Let assume that V streams share an LR server with TDMA as arbitration policy and that the packets of stream i have a size L i . Then, the maximum amount of service stream i receives from the LR server (i.e. φ i ) is
The total amount of service the streams receive in one round from the LR server is called a "frame". The size of a frame (i.e.
The latency of a packet is the maximum amount of time between the moment the first word of that packet arrives in the LR server and the moment that the last word of the packet has left the LR server. Assume the LR server has a maximum rate of C words s . In the worst case scenario the packet has to wait
s before the stream is serviced. The packet itself is serviced in Li C s. Therefore, the latency of stream i for TMDA is [7] 
The LR server with TDMA allocates a maximum rate of φi F · C to stream i. In a similar way the Θ i can be derived for other arbitration policies (see [7] ).
Multi-channel DVB-T set-top box case study
A multi-channel DVB-T set-top box case study is used to illustrate the power of the performance analysis method. We analyze the performance of several schedule and interconnect variants.
System description
Digital video signals are received by an antenna (see Figure 4 ) and sent via the Radio Front-End (RF) to the DVB-T Channel Decoders (CDs). Four DVB-T CDs are used to decode four different channels. Two decoded channels are stored, two decoded channels are processed by H.264 decoders. The output of the H.264 decoders can be used for dual screen or dual window functionality. The scope of this case study is indicated by the dotted shape in Figure 4 .
A possible implementation is shown in Figure 5 . The four DVB-T CDs are combined on one subsystem. To temporarily store and exchange data between subsystems, an external DRAM memory system (indicated by DRAM controller and DRAM) is used. The DRAM controller is modeled as an LR server, using TDMA as arbitration policy. Each subsystem is connected via a private link to the DRAM controller. The output of the DRAM is sent back to the subsystems via a bus. The burstiness of the input and the output of the DRAM controller is constrained by using regulators (indicated by "R") (see [9] ). 
Schedule variants
The first step is to characterize all traffic streams between the subsystems and the memory system. One of the traffic streams is the traffic of the DVB-T CD to perform the decoding. The four DVB-T CDs are mapped onto one processor and are active alternately. A DVB-T CD outputs an interleaved stream of so-called OF DM -symbols of 20 kB. In each active period of a DVB-T CD, two OF DM -symbols have to be loaded from the DRAM into on-chip buffers (a i and b i ). During an active period, a new OF DM -symbol is created in an on-chip buffer (c i ) which has to be written to the DRAM. One option is to load the OF DM -symbols when the previous DVB-T CD is active (see Figure 6 ). This means An alternative schedule variant is to divide the active period of a DVB-T CD into two equal parts. In the first part, the two OF DM -symbols are loaded from the DRAM, in the second part the produced OF DM -symbol is written to the DRAM. Now, both operations have a deadline of 1 8 OF DMperiod. Therefore, the rates of this traffic are higher compared to Schedule 1. This variant is called Schedule 2.
An overview of the traffic streams can be found in Table 1 . All traffic is to / from the DRAM. The σ of a traffic stream can be calculated by Equation (6) . Note that H.264 also needs DRAM access for accessing private data. A Mathematica model has been created to execute the performance analysis. The input consists of values of σ, ρ, L and n for each traffic stream and a model of the communication infrastructure using LR servers. Then, Equations (2), (3), (4) and (5) are used to calculate the maximum delay per traffic stream and the required queue size per LR server. Different TDMA-wheels, DRAM frequencies and pipeline degrees are analyzed. Only results where the maximum delay for each traffic stream meets the deadline are used. Per memory frequency and per maximum pipeline degree the best solution (i.e. the smallest queue size required for the LR server and the regulators) is selected. The results are shown for Schedule 1 and 2 using the private links interconnect in Figures 7 and  8 . Each pipeline degree has a unique color. Analyzing the results of Schedule 1, we observe the following.
• For frequencies smaller than 150 M Hz, no solution has been found that guarantees all deadlines. • By increasing the frequency, it is possible to meet all deadlines with a lower pipeline degree and smaller queues. If packets are processed faster, fewer packets have to be in flight simultaneously.
• The queue size is monotonic in the pipeline degree and the frequency.
• If the frequency gets higher, there is less benefit from reducing the pipeline degree. The deadlines for reading and writing the OF DM -symbols are more strict for Schedule 2 compared to Schedule 1. Therefore, a higher frequency and a higher pipeline degree are required to meet all deadlines (see Figure 8 ).
Bus interconnect variant
An alternative interconnect is analyzed in combination with Schedule 1. In a bus based interconnect variant (see Figure  9 ), all traffic to and from the memory system is transmitted via one bus. Before transmitting data over the bus, a request has to be sent to the bus arbiter. The requests are regulated to constrain their burstiness. The bus arbiter (modeled as an LR server with TDMA) determines the order on the bus. The requests arriving at the DRAM controller are processed in a First-Come First-Served fashion. Using this interconnect variant, we get the results of Figure  10 . With this bus based interconnect the traffic to and from the memory system shares the same wires. Therefore, the latencies of the traffic streams increase. To compensate this, the frequency and the pipeline degree have to be significantly higher compared to the architecture with private links to meet all deadlines. 
Packet size
Until now, the read and write actions of the memory system use blocks of 128 bytes. An alternative is to use blocks of 64 bytes. Then, the read and write actions have more overhead in the memory system. The results are shown in Figure 11 , using private links interconnect and Schedule 1. Due to the extra overhead in the memory system, a higher frequency and a higher pipeline degree are required compared to using blocks of 128 bytes to meet all deadlines. In cases where all deadlines are met for a particular frequency and pipeline degree, it is better to use blocks of 64 bytes than of 128 bytes, because substantially smaller queues are required.
Conclusion
We have shown that an extended form of the Latency-Rate Server network calculus can be applied for fast exploration of SoC architectures. Performance requirements of subsystems are captured as a set of traffic flows, using a powerful and elegant traffic model, and associated latency constraints. After modeling the SoC infrastructure as a set of LR servers, it can be verified whether the worst-case delays incurred by the SoC infrastructure satisfy the latency constraints of the traffic flows. Several extensions, such as request-response streams, pipeline degree, and TDMA scheduling, were used to make the network calculus support SoC performance analysis.
With a multi-channel DVB-T set-top box case study we demonstrated that models of SoC architectures can be built and evaluated quickly. The modelling time of a variant of the case study was a couple of hours and the execution time of the model is less than one second on a standard PC. The case study illustrated that key architecture choices, such as schedule or interconnect variant, can be varied easily to support exploration of architecture options. The impact on the required frequency, queue size and pipeline degree could be evaluated, showing e.g. that the cheaper bus interconnect requires a higher frequency than the private links interconnect. We therefore conclude that the proposed method for static performance analysis can support a SoC architect in gaining early insight into the design problem at hand and in quickly identifying the most promising design options.
