MAXIMUM THROUGHPUT PERIOD IDENTIFICATION FOR SOC PERFORMANCE ANALYSIS by Peddireddy, Shravya et al.
   Shravya Peddireddy* et al. 
  (IJITR) INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND RESEARCH 
Volume No.3, Issue No.6, October - November 2015, 2529 – 2533. 
2320 –5547 @ 2013-2015 http://www.ijitr.com All rights Reserved.  Page | 2529 
Maximum Throughput Period Identification 
for SoC Performance Analysis 
SHRAVYA PEDDIREDDY  
ECE Department  
JNTU, Kakinada 
VIJAYA RAM SREENI 
AMD R&D Center India Pvt. Ltd. 
Hyderabad, India  
E.V.NARAYANA 
Assistant Professor  
ECE Department JNTU, Kakinada 
Abstract— Performance projections are very important for any System on Chip (SoC) design. How fast 
these projections are verified and with what accuracy lies in the quality of the workloads and test 
bench. Given the variety and the nature of workloads, which also evolve based on the architectural 
changes, identification of the right throughput period is a  unique challenge, as it determines the 
achieved bandwidth of  a workload. This paper presents the design of a standard algorithm which will 
determine the valid throughput period for any kind of workload, which will work independent of the 
architectural changes. 
Index Terms— SoC Performance Verification, Throughput Period, Bandwidth 
I. INTRODUCTION 
The objective of SoC Performance Verification is 
to validate the key performance metrics at pre-
silicon level to ensure that the system performance 
specifications are met. During this process, the 
performance metrics such as Bandwidth (BW) 
calculated during Register Transfer Level (RTL) 
based simulations are compared with the 
theoretical expectations. This analysis helps in 
identifying bottlenecks and architectural drawbacks 
of the design which limit the overall system 
performance. 
The RTL based simulations use Directed tests 
which target a particular BW intensive data path 
and present synthetic traffic which mimics 
workload from various Functional blocks (or 
clients) to the memory i.e. Dynamic Random 
Access Memory ( DRAM ). Finding the right time 
interval (which is a part of workload time) for 
measuring the workload BW (in GB/s) is 
important for reliable performance projections. 
This time interval is called Throughput (TP) Period. 
During TP period, the workload demand for a 
Memory Bound case (when workload BW 
demand exceeds the BW supplied by DRAM) 
would be true to its nature. Also, the true nature of 
a memory bound workload will come to a reliable 
state when all the clients have started interacting 
with the memory. Figure 1 depicts a scenario 
where multiple clients (C1, C2, C3 and C4) are 
trying to access DRAM via the Memory Controller 
(MC) at certain times based on the workload 
demand. The individual client graphs show whether 
a client is active (requests to DRAM) or not. 
Figure 2 illustrates how the DRAM activity looks 
like with this client traffic. 
 
Fig. 1.  Client Activity 
 
Fig. 2.  DRAM Activity 
Modern SoC architectures have removed the 
Command Processor (CP) interface to MC and 
instead enabled the clients to directly fetch the 
PM4 commands (Non-Workload transactions). 
These Non-workload transactions will also appear 
as the client transactions along with the actual 
workload transactions, which would mislead the 
TP period calculation. Figure 3 below shows the 
Non-workload & workload transactions per client 
and the idle periods (No activity intervals). These 
idle periods should not be considered for the 
performance measurements. 
   Shravya Peddireddy* et al. 
  (IJITR) INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND RESEARCH 
Volume No.3, Issue No.6, October - November 2015, 2529 – 2533. 
2320 –5547 @ 2013-2015 http://www.ijitr.com All rights Reserved.  Page | 2530 
 
Fig. 3.  Client Transactions 
Virtual Memory (VM) or Address Translation 
Cache (ATC) enabled workload performance 
measurement is one of the key performance 
evaluations of the SoC designs. VM/ATC 
performance measurements  are done for both the 
cold (L2 Cache Miss) and warm (L2 Cache Hit) 
periods to understand the impact independently. 
We run the same workload  two times to have a 
cold and warm scenario. Figure 4 below shows the 
cold and warm intervals. 
 
Fig. 4.  VM/ATC Scenario 
In order to determine valid TP period for any  kind 
of workload, there is a need to implement a 
standard algorithm which would work irrespective 
of SoC architectural changes. The following 
sections describe algorithms which would 
estimate the right TP period for accurate 
performance projections. 
II. LEGACY THROUGHPUT PERIOD 
ALGORITHM 
In this approach, the TP period is calculated by 
choosing the time interval between the Latest 
Starting Client and the Earliest Ending Client, 
accessing the DRAM. Figure 5 depicts Ideal 
DRAM access scenario of different clients i.e. C1, 
C2, C3, C4, C5 and C6 (Y-axis) and their activity 
(X-axis). Client C4 is the latest starting client (at Ts 
Simulation Time), whereas Client C2 is the earliest 
ending client (at Te Simulation Time). As described 
in Equation 1 below, the legacy algorithm would 
interpret the TP period as the difference of these 
simulation times. 
Legacy TP period = Te – Ts (1) 
 
Fig. 5.  Ideal Case 
Figure 6 below shows the expected DRAM 
activity over time. The peak activity occurs during 
the TP period. 
 
Fig. 6.  DRAM Activity 
But as mentioned earlier, the workload presented 
by the clients may not be continuous and would 
constitute both workload/non-workload 
transactions along with idle periods. The 
drawback of Legacy algorithm is that it would 
include idle periods in the calculated TP period 
because of the presence of Non-workload 
transactions. Figure 7 depicts the Legacy TP 
period with idle period included, leading to 
inaccurate BW calculation. Client C2 is the latest 
starting client whereas client C5 is the earliest 
ending client. 
 
Fig. 7.  Legacy TP period 
Moreover, in VM/ATC enabled scenario, the 
algorithm fails to identify the cold and warm 
periods separately and the calculated TP period 
includes non-workload transactions along with idle 
periods as illustrated in Figure 8 below. 
   Shravya Peddireddy* et al. 
  (IJITR) INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND RESEARCH 
Volume No.3, Issue No.6, October - November 2015, 2529 – 2533. 
2320 –5547 @ 2013-2015 http://www.ijitr.com All rights Reserved.  Page | 2531 
 
Fig. 8.  Legacy TP period for VM/ATC Test 
III. MAXIMUM (MAX) THROUGHPUT 
PERIOD ALGORITHM 
In this approach, the workload is interpreted as a 
combination of multiple valid activity regions that 
are separated by No-Activity regions (Idle 
periods). The potential activity regions are 
identified and the Maximum active region is 
chosen as the recommended TP period for the 
workload. 
Outstanding requests to DRAM per each client are 
calculated and used as the basis for identifying the 
idle periods. When all the clients involved in the 
workload have become idle, then we mark that as 
an end of the current activity period. The next 
request on any of the clients would trigger the 
next activity period. Once the activity period is 
identified, we choose the “Latest start” and the 
“Earliest end” time as the start and end times of the 
activity period. The maximum period out of all the 
activity periods is chosen as the throughput period. 
In case of VM/ATC tests, the MAX TP period is 
chosen as the cold period and the second best 
MAX TP period is chosen as the warm period. 
Warm period should be less than Cold period on 
the basis that warm period won’t have L2 Miss 
requests. Figure  9  shows  the  MAX  TP  period  
with  non-workload transactions  and  idle  periods  
truncated.  As  described  in Equation 2 below, the 
MAX TP period would be, 
MAX TP period = Tme – Tms (2) 
 
Fig. 9.  MAX TP period 
In case of VM/ATC tests, the MAX algorithm 
identifies cold and warm intervals separately as 
illustrated in Figure 10. 
 
Fig. 10.  MAX TP period for VM/ATC Test 
Equation 3 and Equation 4 show the calculated 
cold and warm MAX TP periods respectively. 
MAX TP period (Cold) = Tmec – Tmsc (3)  
MAX TP period (Warm) = Tmew – Tmsw (4) 
IV. EXPERIMENTAL RESULTS AND 
ANALYSIS 
Once we obtain the desired TP period, the BW 
Utilization is computed by taking into account 
the number of requests within the TP period and 
the request size as shown in Equation 5 below. 
BW = Number of Requests in the TP period * Request Size    
TP period 
(5) 
As shown in Fig. 11 below, the Legacy algorithm 
is considering the Non-workload transactions and 
hence the No Activity regions will become part of 
the calculated throughput period resulting in wrong 
performance numbers. 
 
Fig. 11.  Legacy v/s MAX TP period (Non-VM Test) 
Even though the MAX algorithm considers the 
Non- workload transactions, Non-workload activity 
will be considered as a standalone activity  period, 
due to the idle period between the Non-workload 
and workload transactions. Since the Workload 
activity period is much larger than the Non-
workload activity period, Non-workload activity 
period won’t be considered as the throughput 
period. Table I shows the comparison between the 
TP periods identified by the Legacy and MAX 
algorithms respectively for Non-VM test. The 
BW utilization significantly improves in case of 
MAX algorithm as compared to Legacy. 
  
   Shravya Peddireddy* et al. 
  (IJITR) INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND RESEARCH 
Volume No.3, Issue No.6, October - November 2015, 2529 – 2533. 
2320 –5547 @ 2013-2015 http://www.ijitr.com All rights Reserved.  Page | 2532 
TABLE I.  LEGACY V/S MAX FOR NON-VM TEST 
Traff
ic 
Type 
Parameters 
 Algorit
hm 
Start 
Time 
(ns) 
End 
Time 
(ns) 
Durati
on (ns) 
BW 
Utilizati
on 
(GB/s) 
 
Textu
re 
Read/ 
Write 
(or) 
SDM
A 
Read/ 
Write
a 
Legacy 
8460
96 
.262 
1073
4 
68.3
75 
227372
.1 
13 
18.5 
(62%) 
Max 
9278
64 
.362 
1063
4 
68.3
75 
135604
.0 
13 
27.8 
(93%) 
a. The traffic type corresponds to when a Texture 
client is performing memory Read/Write operation. 
Figure 12 depicts the TP periods identified by 
Legacy and MAX algorithm for VM/ATC test. 
MAX will identify both warm and cold periods as 
opposed to Legacy considering it as a one single 
period. 
 
Fig. 12.  Legacy v/s MAX TP period (VM Test) 
Table II shows enhancement in BW utilization 
when MAX algorithm is used for VM/ATC test. 
TABLE II.  LEGACY V/S MAX FOR VM TEST 
Traf
fic 
Typ
e 
Parameters 
 Algorith
m 
Start 
Time 
(ns) 
End 
Time 
(ns) 
Durat
ion 
(ns) 
BW 
Utilizat
ion 
(GB/s) 
 
VM 
test: 
 
Legacy 
6773
88. 
426 
12171
70. 
011 
53978
1.5 
85 
16.4 
(55%) 
Text
ur e 
Read
/ 
Colo
r 
Writ
eb 
M
ax 
Col
d 
8773
88. 
426 
10634
68. 
375 
16782
5.9 
24 
20.9 
(70%) 
War
m 
1065
58 
8.047 
12171
46. 
680 
15155
8.6 
33 
27.2 
(91%) 
b. The traffic type corresponds to when a Texture 
client is performing memory Read or a Color client 
is performing memory Write. 
V. CONCLUSION 
In this paper, we have described various algorithms 
to identify the right TP period for accurate 
workload BW measurement. The MAX algorithm 
proves to be more efficient as compared to the 
Legacy algorithm in identifying the TP period, 
which is evident from the experimental results 
shown above. The MAX algorithm greatly helps in 
eliminating false alarms raised due to lower BW 
numbers as a result of inaccurate TP period 
calculation. Moreover, MAX algorithm can be 
used irrespective of SoC architectural changes. 
VI. ACKNOWLEDGMENT 
The authors would like to thank their manager, 
Kalyan Kumar Goje for his technical guidance and 
support. 
VII. REFERENCES 
[1]. Marwa Elteir, Heshan Lin and Wu-chun 
Feng, “Performance Characterization and 
Optimization of Atomic Operations on AMD 
GPUs.” IEEE International Conference on 
Cluster Computing,   CLUSTER   2011/ISBN   
#   978-1-4577-1355-2, pages 234-243, Austin, 
TX, 2011. 
[2]. Lu  Peng, Jih-Kwon  Peir, Prakash,   T.K., 
Yen-Kuang   Chen and Koppelman, D., 
“Memory Performance and Scalability of 
Intel's and AMD's Dual-Core Processors: A 
Case Study.” IEEE Performance, Computing, 
and Communications Conference, IPCCC   
2007/ISBN   #   1-4244-1138-6,   pages   55-
64,   New Orleans, LA, 2007. 
[3]. Taylor, R., Xiaoming Li, “A Micro-
benchmark Suite for AMD GPUs.” IEEE 
International Conference on Parallel 
Processing Workshops,    ICPPW    
2010/ISBN    #    978-1-4244-7918-4, pages 
387-396, San Diego, CA, 2010. 
[4]. Stratton, J.A., Anssari, N., Rodrigues, C. and 
I-Jui Sung, “Optimization and Architecture 
Effects on GPU Computing Workload 
Performance.” IEEE Innovative Parallel 
Computing, InPar 2012/ISBN # 978-1-4673-
2632-2, pages 1-10, San Jose, CA, 2012. 
   Shravya Peddireddy* et al. 
  (IJITR) INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND RESEARCH 
Volume No.3, Issue No.6, October - November 2015, 2529 – 2533. 
2320 –5547 @ 2013-2015 http://www.ijitr.com All rights Reserved.  Page | 2533 
[5]. Daga, M., Aji, A.M. and Wu-chun Feng, 
“On the Efficacy of a Fused CPU+GPU 
Processor (or APU) for Parallel Computing.” 
IEEE Symposium on Application Accelerators 
in High-Performance Computing, SAAHPC 
2011/ISBN # 978-0- 7695-4448-9, pages 
141-149, Knoxville, TN, 2011. 
[6]. Bouvier, D., Cohen, B., Fry, W., Godey, S., 
Mantor, M., “KABINI: An AMD Accelerated 
Processing Unit System on a Chip,” IEEE 
Micro, vol. 34, no. 2, pp. 22–33, Mar./Apr. 
2014. 
[7]. A. Branover, D. Foley, and M.  Steinman, 
‘‘AMD’s Llano fusion APU,’’ IEEE Micro, 
vol. 32, no. 2, pp. 28–37, Mar./Apr. 2012. 
[8]. Oberman, S., Favor, G. and Weber, F., 
‘‘AMD 3DNOW! Technology: Architecture 
and Implementations,’’ IEEE Micro, vol. 19, 
no. 2, pp. 37–48, Mar./Apr. 1999. 
