Modeling large Ethernet networks for the ATLAS high level trigger system using parameterized models of switches and nodes by Golonka, Piotr et al.
Modeling large Ethernet networks for the ATLAS
high level trigger system using parameterized
models of switches and nodes.
Piotr Golonka Krzysztof Korcyl Frank Saka
Abstract| Large local area Ethernet networks are strong
candidates to connect data sources and processing nodes in
high energy physics experiments. In the high level trigger
system of the ATLAS LHC experiment several Gbytes/s of
data, distributed over 1700 buers, have to be delivered to
around a thousand processing nodes. Due to the network
size, its performance and scalability can only be assessed by
modeling. To avoid lengthy simulation runs, and concen-
trate only on characteristics important for network trans-
fers, the components of the system need to be parameter-
ized.
The network performance depends on traÆc patterns gen-
erated by processing nodes and switching capabilities of the
network, we therefore evaluated and modeled both process-
ing nodes and switches. We have developed a parameterized
model of a class of switches, where a limited set of parame-
ters, collected from measurements on real devices, is used to
model switching characteristics. Another set of simple mea-
surements is used to collect values for parameters used to
model processing nodes running the Linux operating system
and the TCP/IP communications protocol suite.
In this paper we present the set of parameters used in
the models together with measuring procedures used to cal-
ibrate our models. Calibrated models are used to model
small test-bed setups with random traÆc to validate our
approach.
Keywords|Ethernet, switches, TCP, Atlas
I. Introduction
T
HE ATLAS detector is designed to investigate many
dierent physics processes at the Large Hadron Col-
lider (LHC) and is due to begin operation in 2005. Proton-
proton bunch crossing will occur at 40 MHz, of which 100
kHz rate of interesting event will be ltered out by the
rst level trigger (LVL1). The high level trigger (HLT)
composed of level 2 (LVL2) and Event Filter (EF) further
limit the rate of accepted events to tens of Hz, the rate
manageable for the tape/disk drives. To reduce bandwidth
requirements for data transfers the LVL2 system will use
information from the LVL1 to read full granularity data
from a subset of detector buers, typically 100 out of a to-
tal of 1700, containing interesting signatures (Regions Of
Interest). These data have to be delivered to a member
of the LVL2 processor farm (a set of 550-600 commod-
ity PCs) for further processing. Events accepted by the
P.G. is with CERN, on leave from the Faculty Of Nuclear Physics
and Techniques, Univesity Of Mining and Metallurgy, Cracow,
Poland; e-mail:Piotr.Golonka@CERN.CH
K.K. is with CERN on leave from the Institute Of Nuclear Physics,
Cracow, Poland; e-mail:Krzysztof.Korcyl@CERN.CH
F.S. is with CERN, on leave from the Royal Holloway University of
London, Egham, UK; e-mail:Frank.Saka@CERN.CH
CERN, European Organisation for Nuclear Research, 1211 Geneva
23, Switzerland.
LVL2 (1 kHz) require that data from all detector buers
are collected by an event builder to a PC belonging to a
subfarm for nal processing. The network connecting 1700
buers and around thousand LVL2 processors and EF col-
lecting nodes, has to provide a throughput of more than 6
GBytes/s throughput. A possible architecture for such a
network using Ethernet technology has been presented in
[1]. A number of concentrating switches aggregating data
from detector buers connected to Fast Ethernet will be
linked via Gigabit Ethernet uplinks to the central Giga-
bit Ethernet switch. A similar concentration is foreseen
for connecting the LVL2 processing PCs on Fast Ethernet
and connected to the concentrating switches with the Gi-
gabit Ethernet uplinks to the central switch. It is expected
that more than a hundred concentrating switches will be
necessary to provide the required throughput. The size of
the network excludes building a prototype. Performance of
such a large network can only be estimated by modeling. In
our approach to building a model of the whole ATLAS HTL
system we developed the parameterized model of switches
[1], [2]. In the rst part of this paper we review operation of
the model with recently added extensions. The extensions
were made to accomodate switches which split Ethernet
frames into small cells when transfering them internally.
In the course of our studies on the ATLAS HLT sys-
tem the issue of whether high level communication proto-
col such as TCP/IP [4] could be used for data transfers
arose. This led us to design and the development of a pa-
rameterized model of TCP/IP stack. In the second part of
this paper we present our approach to parameterization of
the TCP behaviour relevant to the operation in the HLT
system.
II. The parameterized model of Ethernet
switches.
The parameterized model reects the hierarchical archi-
tecture of the switch (ports are gropued into modules, mod-
ules are connected by a backplane to provide inter-module
transfers) and implements store-and-forward mode of oper-
ation. In the Figure 1 the parameterized model of a switch
for inter-module communication is presented together with
a full list of parameters (including parameters for intra-
module transfers). Full description of parameters can be
found in [1]. The operation of the parameterized model is
based on calculations using parameters representing buer-
ing and transfer resources in the switch. The operation















































P1 = Input buffer length (#frames)
P2 = Output buffer length (#frames)
P3 = Max to backplane throughput (MBytes/s)
P5 = Max throughput for intra−module(MBytes/s (not shown)
P4 = Max from backplane throughput (MBytes/s)
P6 = Max backplane throughput (MBytes/s)
P8 = Intra−module transfer bandwidth (MBytes/s)(not shown)
P7 = Inter−module transfer bandwidth (MBytes/s)
P10= Intra−module fixed overhead (not shown)
P9 = Inter−module fixed overhead (not shown)
Fig. 1. Switch parameterized model for inter-module communication
in the switch can be modeled as input queuing. When
the frame arrives at the switch, a check is made to see
whether there are enough resources to buer the frame (the
current count of frames buered in the buer should not
exceed the parameter P1). If result of the check is nega-
tive, the frame is dropped. Once the frame is buered in
the input buer, the current count of buered frames in
the source module is increased and the routing decision is
made. Depending whether it is an intra-module or inter-
modules transfer, the corresponding parameter P9 or P10
is used to model the xed overhead time for taking the
routing decision. Currently there are 4 types of transfers:
inter-module unicast, inter-module multicast, intra-module
unicast and intra-module multicast. The routing decision
is followed by the resource calculations made using P3, P4
and P6 or P5, depending on the transfer type. For ex-
ample: in the case of inter-module unicast transfer, the
resources for a single frame transfer from the input buer
of the source module to the backplane (limited by the P3),
transfer over the backplane (limited by the P6) and transfer
from the backplane to the output buer of the destination
module (limited by the P4) will be necessary. The frame
transfer is seen as a request to provide a certain amount
of bandwidth needed to commence the transfer for inter-
module transfers. The requested bandwidth is represented
by the parameter P7, and for an intra-module transfers,
the requested bandwidth is represented by the parame-
ter P8. Frames currently being transferred occupy some
part of the throughput represented by parameters P3, P4
and P6 for the inter-module transfers and P5 for the intra-
module transfers. Together with evaluation of the transfer
resources, another check is made to verify if there is enough
buering capacity in the output buer of the destination
module. If the available throughput is larger or equal to
the requested bandwidth, and there is buering available,
then the frame can start to transfer. Newly inserted frames
reduce the available throughput by a fraction correspond-
ing to the parameter P7 or P8, depending whether they
are inter or intra-module transfers. The current count of
buered frames in the output buer is also incremented.
Once the resources have been allocated, calculations are
made to obtain the resource occupancy times. These times
are calculated based on parameters P7 and P8 and the
frame size. The larger the frame size, the longer the re-
source will be allocated. When the available throughput
is less than that requested throughput, the frame has to
wait until the necessary resources become available (usu-
ally when another frame leaves the switch). If there are
more frames waiting for resources, it is up to the input
buer manager to decide which frame will be transferred
next. The input buer manager may implement dierent
policies to make a decision: the frame waiting the longest
time, the high priority frame etc.
Many modern switches split a frame received from a net-
work into small cells (typically 64 bytes) and use these cells
for inter-module transfers. It may happen that two or more
transfers originating from dierent source modules and des-
tined for a common destination module will overlap in sense
that the link from the backplane to the destination module
will be used to interleave cells from these transfers. As a
result the link bandwidth will be shared by the transfers
resulting in longer latencies - the more concurrent transfers
there are, the smaller the bandwidth available for each one
and longer latency. The same scenario can be applied to
more than one transfers originating from the same source
module and going to dierent destinations. In this case the
shared link is from the module to the backplane. In the pa-
rameterized model the limitation on a link throughput is
represented by P3 and P4 parameters. When modeling
switches with shared links these parameters are not used.
In the model we always accept a request for a transfer.
Any time a new transfer is added, the new share of band-
width for each current transfer is calculated and used to
recalculate transfer completion time. Each time a transfer
completes, the remaining transfers share equally the link
bandwidth and new calculations for their completions are
made.
When the frame arrives at the output buer of the des-
tination module, it frees the allocated transfer and buer-
ing resources in the input buer of the source module. It
is then up to the output buer manager to decide which
frame from the output buer will be sent out next. Similar
to the operation of the input buer manager, the output
buer manager can implement dierent policies when mak-
ing its decision. When the frame nally leaves the switch
via the Media Access Controller (MAC), the current count
of frames in the output buer decrements. The alloca-
tion of resources for multicast or broadcast transfer might
be dierent from the single frame transfer. The policy of
handling the multicasts and broadcast transfers is strongly
bound to the internal detail of switch, and we have not
found any generalizations there. Currently, the model cre-
ates a copy of the multicast (broadcast) frame for each
remote module housing at least one destination port and
it is tretaed in the same way as unicast.
A. Measurements of the parameters for the switch model.
In the paper [1] we reported limitations on measurements
imposed by using PCs. Using programmable network inter-
face cards ("intelligent" NICs) [2] we were able to generate
traÆc at GigabitEthernet line speed. Problems of saturat-
ing switches with large valencies were solved by a FPGA-
based testers [2] capable of generating line speed traÆc on
32 Fast Ethernet ports. The basic measurements used for
calibration of parameters are ping-pong and streaming.
 The Ping-Pong measurements: client sends a variable
size request to a server which replies with a response of the
same size. The latency of the frame's round trip time is
measured and used to calculate the transfer bandwidth.
 The streaming measurements: with ow control disabled,
the client(s) generate multiple frames continuously with a
xed inter-frame time. Reducing the inter-frame time we
look for coresponding banwidth at which the switch starts
loosing frames. This denes the limits of switch through-
put.
We show how results from basic measurements are used to
extract values for parameters in the table:
Setup of the What is What can be learned
measurement measured from the measurements
Ping Pong



















B. Comparison of parameterized model with measure-
ments.
In the Figure 2 we compare latency distribution pro-
duced by the parameterized model using parameters col-
lected on a 48 port Fast Ethernet switch to the latency
measured in the test setup with a real switch
1
. The switch
under test uses small cells for inter-module transfers, as de-
scribed in the section II. In the test setup, we used 32 port
Fast Ethernet tester generating random destination traÆc
with inter-packet time picked up randomly from a negative
exponential distribution with an average corresponding ei-
ther to 50% or 75% of maximal load for two data sizes 700
Bytes and 1500 Bytes respectively. The latency distribu-
tion is presented as a fraction of frames not reaching their
destination within time shown on the latency axis. The
biggest discrepancy for the 50% load can be seen for laten-
cies in the range between 400 and 440 s. Model predicts
that one frame per 1000 will have latency greater than 440
s, whereas measurements show that 1 frame per 1000 ar-
rives to the destination later than 420 s (error is still less
than 5% for one permill of frames).
III. Parameterized model of nodes.
High level communication protocols may have substan-
tial impact on the type of traÆc and performance of the
1
only FE ports were used




























32 FE ports (tester) through the T5 compact switch, nex exp ipg, uniform dest
measurement: 700 bytes, 50% load
measurement: 1500 bytes, 75% load
model: 700 bytes, 50% load
model: 1500 bytes 75% load
P1 = P2 = 1000 frames                                         
P3, P4 − not used −  modeling shared links 
              to and from the backplane
P5: NA − shared memory − frames not copied
P6 = 1052.64 MB/s (8 * P7: no limit on backplane)
P7 = 131.58 MB/s
P8: NA − shared memory − frames not copied                                                  
P9 = 7.32 us                                                      
P10 = 4.0 us                                                      
T5 compact model parameters: 
Fig. 2. Comparison between parameterized model and measurements
ATLAS HLT system. Protocol control messages may in-
crease the overal number of packets in the system, proto-
cols' algorithms may change the time distribution of pack-
ets (traÆc patterns). Moreover, serving a stack of proto-
cols requires additional computing power, thus it decreases
CPU time available for computation of other tasks.
We use Linux implementation of TCP/IP for our studies
and model calibration, as it is likely to be used by ATLAS.
We tried to ommit system dependent details and perform
our studies on higher level of abstraction, thus making it
possible to model other implementations in the future.
A. Network I/O : stack of protocols.
We simplied the behaviour of the TCP stack in a sin-
gle logical block called "TCPBroker". It models network
processing steps ranging from the operating system appli-
cation interface to the NIC. It is also able to model CPU
loading due to communication. We used [3] as the main
source of information about TCP/IP stack.
Data and control ow for Linux network input, together
with parameters we use in our model are presented in Fig-
ure 3. One should notice, that the processing of an incom-
ing packet in real system is split into two parts. When a
new packet arrives at the network card, an interrupt ser-
vice routine is executed. Then protocol-related, "bottom
part" routines are executed with high priority, but outside
the interrupt handler [5].
This two-step processing is modeled inside the TCPBro-
ker block. As we perform our modeling in the time domain
using discrete events, we calculate latencies and execute
time-driven event sequences using queues of messages. The
latency for the packet that comes from the network is ac-
counted in steps, as the packet traverses the TCPBroker
structure.
The rst stage models interrupt handler execution. It
increases the latency for each received packet packet by a
constant value. It also sends a request for high-priority
processing time to the CPU module (CPU usage modeling


































Fig. 3. Control and data ow for network input operation.
a queue of packets that wait for "bottom part" processing.
Second step of modeling simulates "bottom part" pro-
cessing, i.e. handling all aspects of network protocol stack.
At this stage, the latency of processed packet is increased
again and CPU time is requested, according to a set of
parameters described in the next section. In real system,
"bottom part processing" execution may still be altered
by a network interrupt. We model this behaviour as well:
when a packet arrives at NIC while second step process-
ing is simulated, additional latency must be accounted for,
in the packet processed in second step, in order to model
suspension of processing due to the interrupt.
Second step of processing has high priority, thus no data
may be passed downstream to lower priority routines (user
space/application), while any frame is pending for process-
ing. Due to this fact, we need to add another queue in our
model. When the second step processing is completed, it is
put to this queue, waiting until all pending frames execute
their "bottom part". In this case, the latency for a waiting
frame is increased accordingly in our model. If there are no
more frames pending for second step processing, all frames
from the queue are passed to a model of the application
(see section III-C).
B. TCP model parameters.
In our studies we need a simple behavioural model of
TCP, as ATLAS trigger network will be a local area net-
work. We decided to implement the following parts of the
TCP in our model: connection establishing phase, address
and TCP ports demultiplexing, splitting/reconstruction of
data stream to/from packets.
A peculiarity of TCP protocol for request-response type
of traÆc is sending acknowledments for received data.
These acknowledgments may use either special, dataless
packets (we use a term "dataless ACK packets"), or may
be "piggybacked" to a data packet. The whole mechanism
was modeled. Neither the retransmission mechanism, nor
Message length [bytes]
























Fig. 4. Comparison of model and measurements for data latency in
a simple request-response test.
roundtrip-time estimation algorithms were implemented,
as they are not of great importance to ATLAS. We found
that introducing the following parameters is suÆcient to
model traÆc patters in request-response data transfers:
 TCPServiceTime, RcvThroughput: control receive pro-
cess. When a packet arrives from a network, coresponding
latency is calculated as: TCPServiceT ime+
RcvThroughput
packetlength
 TCPRate, SndThroughput: control send process. When
certain amount of data is to be sent, it is split into






 DatalessACKTime: time taken when an acknowledg-




 InterruptServiceTime: time taken in network interrupt
handler routine. It is accounted for any frame that is re-
ceived from network.
In the gure 4 we present comparison of results given by
our model with measurements for simple request-response
test. Values of parameters used in simulation are presented
as well.
C. CPU and application modeling.
Apart from network traÆc patterns, we also model CPU
utilization on computation nodes. Our measurements in-
dicate that huge amount of CPU time is consumed for net-
work communications.
In addition to TCPBroker, our model of computation
node contains simplied model of a CPU and multi-tasking
operating system. One or more
3
application models as-
signed to a node may request processing time from CPU.
TCPBroker block may also send "interrupt" requests to
the CPU, as mentioned in previous sections, thus delaying
the application's requests.
When the amount of CPU time requested by the appli-
cation is accepted and elapsed, the application model is
informed about this. When no application requests CPU,
and no interrupts are raised, the CPU is in idle state. By
2
In fact this is the only parameter that is peculiar to TCP protocol.
3
functionality for multiple applications sharing the same processor
will be implemented soon













Fig. 5. Comparison of model and measurements for CPU load in a
simple request-response test.
comparing non-idle time to total time for a CPU we can
estimate the average CPU load.
In the gure 5 we present comparison of preliminary
results produced by the model with measurements for
request-response test. Although the results produced by
a model does not precisely refect measurements, we can
observe aproximate value and data length dependency of
CPU load caused by network I/O. The reason for this dis-
crepancy is that our model of an application was very sim-
ple: it used only one parameter to represent the time used
by application. We stress that these results were obtained
using parameters related to data latency only (no operat-
ing system related parameters, e.g. process switching time)
have been modeled.
D. Parameters measurement procedure.
The setup for the measurements consisted of two PCs
connected directly by Fast or Gigabit Ethernet. Both of
them were of the same type to assure symmetric setup:
using 400MHz Intel Pentium II with 128 MBytes of main
memory. We used Linux kernel versions 2.4.0-test10 and
2.4.2 and performed our tests using IP version 4 and stan-
dard Ethernet frames without VLAN tags. A variation of
ping-pong test, which was described in previous chapter,
was performed: client and server were exchanging a mes-
sage of variable size using already established TCP connec-
tion. Round trip time was measured and divided by two
to estimate latency for sending packet one way. In order to
measure CPU usage, another thread: "CPU burner" was
run on client machine. Its only task was to increment a
counter. By comparison of counters indications for "idle"
(i.e. not communicating) and "busy" (running ping-pong
test) on client, we found the fraction of CPU time used
for communication. To trace activity of a real system per-
forming TCP/IP communication we used kernel proling
facility called Linux Trace Toolkit [5]. We found this tool
very useful to look inside sequences of events that occur on
a machine during transmit and receive.
TCPServiceTime and TCPRate have been measured us-
ing the latency plot (Fig. 4) for the zero length message
(xed overhead). Proportion between TCPServiceTime
and TCPRate was found using the LTT tool. RcvThrough-
put and SndThroughput contibute to the slope of latency
plot for messages shorter than one Ethernet segment. The
RcvThroughput can be measured from the slope of latency
plot for messages spanning across two Ethernet segments
(SndThroughput contributes there as a xed overhead). In-
terruptServiceTime can be measured from the height of
step at the Ethernet segment boundary from the same plot.
Dierent heights of the steps for 1460 and 2920 can be used
to measure the DatalessACKTime (reception of a second
segment makes TCP produce and send the dataless ACK.
[6]
Another tool which was used was tcpdump, which is a
standard utility program found in many UNIX distribu-
tions. We used it to understand traÆc patterns generated
at the client and at the server nodes.
IV. Conclusion.
In this paper we presented our work to develop simula-
tion tool to be used in modeling the ATLAS HLT system.
A parameterized model of switches has been veried on
a number of commodity o the shelf products. The model
has been recently extended to include devices using small
cells to transfer Ethernet frames internally. A good agree-
ment has been reached between measurements and results
from the modeling of small setups with multiple switches.
Modeling of networking nodes is still in progress. How-
ever, simulation of the TCP protocol stack seems to ad-
equqtely represent measurements for simple applications
including CPU load. This leads us to conclusion, that our
simplied, parameterized model of protocol stack performs
quite well. No scalability issues for nodes modeling were
assessed in the paper, this important will be addressed in
the future. For further references and slides from the con-
ference poster, please see [6].
Acknowledgments
We would like to acknowledge a contribution to our work
from R.W. Dobinson, C. Meirosu, M. LeVine and R.F. Beu-
ran.
References
[1] Dobinson, R.W ; Korcyl, K. ; Saka, F. ; Modeling large
Ethernet networks using parametrized switches. OPNET-
WORK 2000 conference, Washington DC, 28 Aug - 1 Sep 2000,
http://nicewww.cern.ch/korcyl/opnetwork2000/op2k paper.pdf
[2] Dobinson, R W ; Haas, S ; Korcyl, K ; Le Vine, M ; Lokier,
J ; Martin, B ; Meirosu, C ; Saka, F ; Vella, K ; Testing and
Modeling Ethernet Switches and Networks for Use in ATLAS
High-level Triggers CERN-OPEN-2000-310 ; Pres. at: Workshop
on Network-Based Data Acquisition and Event-Building, Lyon,
France, 20 Oct 2000
[3] Stevens, W.R., TCP/IP Illustrated , vol.1,2 Adison-Wesley Pub-
lishing Company, 1994. ISBN 0-201-63346-9
[4] Postel, J.B., ed.1981c. Transmission Control Protocol RFC793
[5] Details and code for the Linux Trace Toolkit can be found at
http://www.opersys.com/LTT/
[6] http://cern.ch/Piotr.Golonka/conferences/RT2001.
