A real-time application for the CS-2 by Hauser, R & Legrand, I
Presented at the HPCN 95 Conference, Milan, Mav l995
Reiner Hauser & losif Legrand




B ‘/Q 5 5 A g OCR Output
<’ 6/<x\f mf fJ$· L5
March 1995
CN/95/5
QERN-COMPUTING AND NETWORKS r>1v1$10N
This work was sponsored by ESPRIT project P7255 - GPMIMD ll OCR Output
of dividing the trigger process into several clearly separated steps, where each
connected by a high—bandwidth switching network. Atlas follows the approach
rates both the second- and third-level trigger. All processors and data buffers are
The CMS solution uses a large general-purpose processor farm which incorpo
slightly different approaches for implementing the second-level trigger system.
The two main experiments proposed for LHC, Atlas [3] and CMS follow
parallel systems like the CS-2 can compete with custom-composed architectures.
years in the field of processors and networks. Our goal is to investigate how closed
take advantage of the expected technological progress during the next five to ten
a high-speed network switch like ATM, Fibre Channel or SCI. The goal is to
chitectures [1][2] to farms of commercial workstation processors connected via
possible architectures are under study, ranging from special-purpose systolic ar
bandwidth to about 100 GByte/sec. For the second-level trigger a number of
ware and is expected to reduce the event rate to 100 kHz and the total data
The first level trigger will be implemented in highly parallelized custom hard
‘interesting’ events and discard uninteresting data.
amount of data is usually reduced by using several levels of triggers, which select
sition and triggers using currently existing computers and networks. The large
MHz and the resulting amount of data represent a challenge for data acqui
will go into production in about 10 years. The high bunch crossing rate of 40
The Large Hadron Collider (LHC) is a projected accelerator at CERN which
1 Introduction
in the next few years.
today, which makes them promising candidates for a complete solution
are already able to reach about 25 to 30 % of the required performance
cessor system, the Meiko CS—2. Measurements show that such systems
an irnplementation of the second-level trigger on a commercial multipro
probably consist of a parallel processing implementation. We describe
the high bandwidth and short latency constraints, the solution will most
which can be stored on mass storage for later off-line analysis. Due to
trigger system is used to reduce these data in real-tirne to an amount
orders of magnitude larger than those produced by existing systems. A
next·generation collider LHC will generate amounts of data which are
Abstract. Future high energy experiments planned at CERN for the
DESY, 15738 Zeuthen, Germany
CERN, 1211 Geneva 23, Switzerland
Reiner Hauserl and losif Legrandz
A Real-Time Application for the CS-2*
i.e. the time it takes to make the final decision. OCR Output
for these buffers depends obviously on the latency of the whole trigger system,
they should be transfered to the third-level trigger or discarded. The size required
are stored in buffers until the second-level trigger has made the decision whether
assume that packets are smaller than 64 bytes/event. The complete raw data
which are sent to the Rol task are usually only a few bytes. ln the following we
This is typically in the order of a few kBytes/event. The computed features
The feature extraction tasks receive the data from the detector for one Rol.
collects the data from all the Rols to compute the final decision.
to a given Rol are collected into the Ro] task. Finally, the so called global task
tor. In our implementation all the features from different subdetectors belonging
The so called feature extraction tasks work on a single Rol ofa given subdetec
take advantage of the inherent parallellism of the problem.
algorithm for the second level trigger has been structured into several parts to
Interest (ROI}, which denote areas where interesting features were found. The
to the second level trigger, the first level trigger selects so called Regions of
tecture in mind. To avoid the transfer of all of the raw data from the detectors
The current study has been carried out with the Atlas second level trigger archi
2 The Atlas Second-Level Trigger System










~·=~ M ~¤·x l wm l~·~··~~~
l|}"-HV
mm xuow lR·\(`KI‘~Jl) I wilw
Latencw Rate IH:}
etc. Fig. 1 shows the general structure of the Atlas trigger architecture.
trigger system, depending on the required bandwidth/latency/processing power
different processor technologies and network switches in different parts of the
step parallelizes and reduces the bandwidth further. This allows them e.g. to use
latency increases to about 25 ps. OCR Output
order of 10 us on the lowest level. When using the channel interface the startup
time is usually dominated by the startup latency of the network which is in the
can be achieved. However, for small packets (less than 200 Bytes) the transfer
nel communication. For large packets, transfer rates of more than 40 MByte/sec
Fig. 3 shows the results of some bandwidth measurements for DMA and chan—
intervention on the receiver side.
allows e.g. data transfers to a given address on a remote processing node without
ferent processors. There is also a direct memory access (DMA) interface, which
library It has an application interface built on logical channels between dif
braries like MPI [6], but there is also a low—level interface, called the Elan Widget
The communication facilities can be used from high-level message passing li
allows the main CPU to continue with other operations.
The actual data transfer is handled by a communication processor [5], which
can be initiated by a normal user program without requiring any system call.
The distinguishing feature ofthe CS-2 is its internal network switch. Transfers
memory each and a processor clock speed of 40 MHz.
ESPRIT project GPMIMD-ll consists of 32 nodes, mostly with 32 MByte of
a high-speed, low-latency network. The system installed at CERN as part of the
Solaris operating system on each processing node. The nodes are connected by
The Meiko CS—2 is a distributed memory machine, using SPARC CPUs and the
3 The CS-2 at CERN




\|1|»deIcvIn>r\ Rui l`a>k (Hnbal Taslvr
different tasks.
thc nodes represent computation task and the edges the data How between the
Fig. 2 shows the structure 0i` the corresponding program as a graph, where
natural for the programmer. Buffers can be allocated as needed and just sent to OCR Output
then with direct DMA transfers. The channel implementation is much more
different methods described above: first with the virtual channel based method,
The communication part of the program has been implemented using the two
or receiving a packet, and then use polling to test for completion if necessary.
munication with their computation. They always initiate a request for sending
All parts of the program take advantage of this property to overlap their com
The communication library of the CS-2 allows for asynchronous transfers.
current event. It takes typically about 30 as to perform its task.
execution time for the global task is dependent on the number of Rols for the
from different Rols, until all the data for a given event have been received. The
The global task is structured in a similar way. A central loop collects packets
typically executes in less than 10 ps. The results are then sent to the global task.
have arrived, the actual procedure for the Rol algorithm is called. This routine
memory. Only pointers to them are passed around. When all expected packets
for a given Rol have been collected. Buffers are never copied when they are in
The Rol task receives packets and puts them into a list until all the data
can keep up.
we have an easy way of measuring the event rate with which the whole system
possible. Since the communication channels include a flow control mechanism,
programs generate random data and feed it into the rest ofthe system as fast as
extraction algorithms are not actually implemented in the nodes; instead the
extraction, Rol and global task is mapped to a single processor. The feature
gorithm on the Meiko CS—2 multiprocessor system. ln our setup each feature
We have implemented the global decision part of the second level trigger al
4 Implementation
Fig. 3. Bandwidth measurements 0n the CS—2
S1ze(By1es)






onanner Transfer Mode V
transfered to each sending processor. OCR Output
the addresses for buffer pool: for 2048 buffers about 8 kByte of data has to be
tems. This is due to the increased communication which is necessary to send
of the latter is usually better for small systems, while it decreases for large sys
tation, one with 512 buffers and one with 2048. As one can see, the performance
take up some considerable time. There are two columns for the DMA implemen
be neglected and that the polling of a large number of incoming channels may
getting larger. This indicates that the overhead of the channel library may not
plementation over the channel version increases when the system as a whole is
implementation. A more interesting point is that the speedup of the DMA im
As one can see the DMA implementation is usually faster than the channel
subdetectors/Rols. Table 1 shows the results.
used. As expected, the achievable rate decreases with an increasing number of
20 to 40 kHz, depending on the parameters and the communication method
from two to four. For the channel implementation, event rates vary from about
The number of subdetectors varied from two to four, and the number of Rols
can be calculated for the given setup.
the system and computing the mean value for one event. From this a event rate
processor node. The measurements were done by sending one million events into
mined by the current setup ofthe CS-2. Each basic task was assigned to a single
number of processors needed (see Fig. 2). The maximum numbers were deter
subdetectors and the number of Rols. These two parameters determine the total
The system was run with a varying number of parameters for the number of
5 Results
by a special flag in the packets).
the processor just checks if anything new has arrived in the buffers (indicated
buffers sent without further intervention on the receiver side. On the receiver side
overhead for the necessary synchronisation is amortized over the large number of
from the receiver. This reduces the overhead for a single packet transfer. The
after it has used all the buffers, does it have to wait for a new list of addresses
buffers without further permission and without waiting for an acknowledge. Only
is sent to the sender node. The sender is allowed to transfer data to the receive
on the receiver side are preallocated, then a list of the addresses of these buffers
The DMA implementation uses a different approach: a large number ofbuifers
one final function call to complete the transfer.
that the programmer polls each outstanding transmission separately and does
transfer the data to that buffer. Furthermore the communication library requires
between the communication processors to get the destination address and then
The disadvantage of this scheme is that is requires several packet exchanges
request, then continue with some other task.
receiver may also choose an arbitrary buffer and issue an asynchronous receive
the receiver node without caring about the buffers on the destination side. The
Elan Widget Library, Meiko Computing Surface Documentation, 1993.
MPI: A Message-Passing Interface Standard, Version 1.0, 1994.
1993.
Cornmunications Processor Overview, Meiko Computing Surface Documentation,
1994.
CMS The Compact Muon Solenoid - Technical Proposal, CERN/LHCC/94-38,
Collider at CERN, CERN/LHCC/94-43, 1994.
Atlas · Technical Proposal for a General Purpose Experiment at the Large Hadron
D. Belosloudtsev et al., to appear in NIM, Feb 1995.
J. Badier et al., IEEE Trans. Nucl. Sci. 40, 1993.
References
of two, bringing the system quite close to the required limit.
together should increase the performance ofthe application by at least a factor
processor nodes, thereby doubling the processing performance. These two factors
than 3 ps. In the near future the CERN machine will be upgraded to 100 MHz
at the end of this year. The latency for DMA transfers will be reduced to less
A new version ofthe communication processor, the ELAN-2 will be available
the required rate.
Meiko CS-2, achieving a throughput which is only a factor three to five below
part ofthe LHC second level trigger on a commercial multiprocessor system, the
We have shown that is possible to implement an algorithm for the global decision
6 Discussion










ROIs Subdetcct0rs}DMA (512 bufs) DMA (2048 bufs) Channcl|Pr0cess0rs uscd
