The aim of this work is to provide a proper framework for the simulation and the optimization of the event building, the on-line third level trigger, and complete event reconstruction processor farm for the future HERA-B experiment. A discrete event, process oriented, simulation developed in concurrent µC++ is used for modelling the farm nodes running with multi-tasking constraints and different types of switching elements and digital signal processors interconnected for distributing the data through the system. An adequate graphic interface to the simulation part which allows to monitor features on-line and to analyze trace files, provides a powerful development tool for evaluating and designing parallel processing architectures. Control software and data flow protocols for event building and dynamic processor allocation are presented for two architectural models.
Introduction
The high event and data rates of future high energy physics experiments require complex on-line parallel processing systems with large I/O bandwidth capabilities. The design of such dedicated distributed processing systems requires detailed simulations which must be able to describe many processing nodes running complex real-time data processing tasks and which are interconnected by a large bandwidth communication network.
The task of executing simulations requires, as a first step, the design of a model able to capture the dynamic behavior of the system. We are studying asynchronous distributed data processing systems for which the real processing time is very much input data dependent. By using a multi-thread approach for modelling such systems, which is both close to reality and well suited to the object oriented paradigm, the structure of the simulation program provides a convenient way to test different hardware architectures and software protocols.
Simulation of Distributed Systems
Modelling complex processes which occur concurrently and have to be time-correlated by message passing may be achieved by a process oriented, discrete event simulation. A multi-thread approach is well suited to describe the complex programs which will be used for such on-line data processing architectures. The basic simulation has been implemented in µC++, a concurrent object-oriented language [1] designed to support parallel programing models on uni-processor or shared memory multiprocessor systems. In a process-oriented simulator it is necessary to be able to create active objects, each having an independent thread and a program status attribute for parallel synchronization. A scheduling mechanism maintains a list of every threaded object in the system and controls the execution of these active objects and the way they interact with each other through message passing and simulation primitives.
Inheritance was used throughout the design of the active objects necessary for the simulation, and appropriate specific methods were implemented. Two main types of active objects are used to simulate distributed data processing architectures. The "hardware" objects are used mainly to describe the time constraints related to the I/O capabilities (bandwidth, latency) of the processors. The "software" objects are C-like programs which follow the data processing algorithms. In practice such a model of the program describes the computational time for each function using appropriate delay statements. It requires a pointer to a "hardware" object in which it is assumed to run, in order to access the resources and the I/O methods of the hardware model. This simulation scheme is well suited to the object oriented paradigm and it is close to reality allowing to evaluate easily different architectural models for the same data processing tasks.
We used such "active" objects to model general purpose processors or digital signal processors (DSP) running various programs. Note that this process oriented approach, described above, is not efficient for modelling large switches, for which an event oriented simulation is more adequate. In our simulation, the switching elements are therefore described by "passive" finite state systems, where the states of the links in the switching elements are controlled and updated by the more complex active, "hardware" objects. In this way the total number of threads in the simulation program is reduced, allowing for a more efficient execution.
Simulation needs to sample random variables from given distribution of every feature to re-create the input process. To represent the stochastic nature of the event's features, C++ classes are used in the simulation for generating random numbers based on theoretical distributions and evaluated histograms. The statistical analysis of the results requires other auxiliary tools which are included in this simulation environment.
A graphic interface is connected to the simulation program to control the main features and also to perform animation, display of different histograms, and analyze trace files from the simulation. We have found this a very useful tool for debugging and optimizing the system.
Simulation Studies

Requirements for the HERA-B experiment
The high event rate for the future HERA-B experiment [2] , (10 MHz -corresponding to the bunch crossing rate) with multiple interactions per bunch crossing will produce more than 10 7 particles per second per square centimeter in the innermost detector region. The event rate is expected to be reduced by about five orders of magnitude by a four-level trigger system.
The first and second level trigger will operate on a limited range of data, due to the hard time constraints for these systems. In the data acquisition scheme the event building is performed after the second level trigger decision. The events are then routed to the third level trigger, a farm of processors acting in parallel on successive events in order to keep up with the expected input rate of 2000 events/s. The third level trigger (L3) is based on "off-line" analysis code which performs track fits and kinematical reconstruction in order to provide a high quality selection of interesting physics. For the events which have passed the third level trigger, complete event analysis (L4) will have to be performed mainly "on-line" because of long running periods and high rate of events. Estimated distributions [3] for the event size and processing times for the third level trigger and full event reconstruction are presented in figure 1. It is most likely that this "fourth level" farm will be combined with the third level farm, using the same processors. This scheme provides more flexibility to the trigger and data analysis system, and reduces the storage and transfer of raw data.
"Classical" farm model
One possible architecture for solving this challenging data processing task is based on connecting all the farm nodes into a large switch (based on C104, 32X32 crossbar switch element [4]) which will distribute the complete events from a parallel event builder unit. Note that the event building task, which requires that all the data buffers have to send the data nearly at the same time to one destination, cannot be done with such switching elements without having multiple asynchronous sending processes at the data buffer level. This may be a very difficult task for the second level buffer units.
The control of the farm nodes and the dynamic node allocation is done by a supervisor unit connected to the same network. To estimate the processing time for the supervisor task we implemented a simple supervising algorithm on a Digital Signal Processor (DSP), TI-TMS320C40 [5] , and used three other DSPs to simulate the messages from the farm nodes and the event builder. Typically to serve such a request, including the I/O, the supervisor unit needs a few microseconds. In order to reduce the total number of messages to be processed by the supervisor unit, the control protocol for the farm nodes is based on two types of messages: "buffer nearly empty" and "buffer nearly full". By using the "buffer nearly full" as an instruction to stop the allocation of a node for subsequent events, we ensure that if an event is distributed to a node before this message is processed by the supervisor, the node is able to buffer at least one more event.
Due to the large differences in the processing time for different events (more than three orders of magnitude) the statistical fluctuations in loading the farm are significant. The buffering of several events per node is important for an efficient use of the processing power and to reduce the statistical fluctuation on the farm loading. For a simple farm control mechanism, such as has been presented above, an efficient use of the processing nodes cannot be achieved without using quite large memory resources for buffering. Farm loading characteristics for several input rates (normal distributed) are presented in figure 2 as function of time. Using the assumptions presented in 2, a farm having 128 nodes keeps up with an input rate of 2000 events/second. If the input rate increase to 2500 events/second quite soon all the nodes will be in a "nearly full" state and the supervisor will not be able to provide addresses for all the requests from the event builder. The overall latency distribution has long tails due to the fact that events are buffered and the processing time for the full event reconstruction is much larger than for the third level trigger.
"Farm of mini-farms" model
In this architectural model, presented in figure 3a , the farm is organized in several minifarms, with e.g. 16 nodes, each having its own local supervisor unit (LSV). A self-organized switch done with DSPs performs the event building task in two partial stages. Boards having six ADSP 21060s [6, 7] connected through a common bus are used as building blocks for distributing the data through the system. Four such units perform in parallel a partial event building for the same event by collecting data from the second level buffers. Each mini-farm unit is using a DSP, from such a board, to collect a complete event which will be processed in one of its nodes. This final stage of the event building is done in parallel for successive events. Performing the event building task in such a distributed way allows this scheme to handle up to 400 MB/s data rates. Once an event is collected, it is distributed to a farm node through a DSP link. Four DSPs on such a mini-farm control board are used for this task. The dynamic node allocation and the buffer memory management of each node are done by another DSP which is performing the LSV task. The DSP structure allows a very low latency for the control messages and makes possible a LSV algorithm which is optimizing the event queuing inside a mini-farm. A global supervisor unit connected with all the LSVs may be used to avoid the overloading of a mini-farm due to statistical fluctuation. A mini-farm address combined with the second level accept message is broadcast to the second level buffers. This scheme using independent supervisor units is reducing the size of the data buffers and is optimizing the overall latency distribution in the system which is presented in figure 3b .
Summary
In developing a general simulation framework to evaluate and optimize different on-line farm architectures we have found the multi-thread object oriented approach, combined with an adequate graphic interface, very useful.
An architecture which is distributing not only the events for data processing, but also the event building task and the control protocol is more flexible and allows a better use of the resources. This simulation work is complemented by a small farm prototype for the benchmarking of processing times, I/O bandwidth and latency in an integrated system. ...
...
