Low cost SMP (Symmetric Multi-Processor) systems provide substantial CPU and I/O capaci,ty. These features together with the ease of system integration make them an attractive and cost effective solution for a number of realtime applications in event selection. In ATLAS we consider them as intelligent input buffers (an "active" ROB complex), as event flow supervisors or as powerful processing nodes.
I. INTRODUCTION
This work is based on the idea of using commercial commodity components in the ATLAS high-level trigger and DAQ systems [l] . One of these components is the SMP -Symmetric Multi-Processor system. It provides substantial CPU power and allows access to, the main memory and inputloutput interfaces from all processors through a very high-speed system bus or switch (Figure 1 ). The workload is balanced among the processors by the single operating system. Single-bus SMPs are the most cost-effective solution to build small shared-memory multi-processor system (up to four CPUs), as modern microprocessors are designed to support such architecture.
yLFTL-

Suhrh
A 100 MHz system bus provides a burst bandwidth of 800 MB/s, which becomes a bottleneck for more powerful processors and larger number of CPUs.
The performance and scalability of the multi-processor system may be increased by using multi-bus SMP system (hierarchy of buses or multi-ported memory architecture), switch-based architecture or a hybrid between them. These solutions deliver significantly higher performance.
However they are generally optimised for a particular number of processors.
Low-cost single-bus commercial SMP systems have become generally available from several manufacturers since 1998 and presently are limited to 4-processor systems with 2 or 3 PCI buses (6-7 or 10-1 1 PCI slots correspondingly). 8-processor systems with up to 4 PCI buses (10-12 PCI slots) have recently appeared on the market and are still relatively expensive.
We discuss a number of possible real-time applications in the ATLAS TriggerlDAQ Data Flow System, present results of our measurements and suggest a long-term programme of work.
APPLICATION AREAS
In the ATLAS High-Level Trigger (HLT) system we consider several areas in the Data Flow System where SMPs can be used (Figure 2 ). In the Read Out subsystem they may serve as intelligent input buffers -"active" Read Out Buffer (ROB) complex, a possible cost-effective alternative to the VME-based implementation. In the LVL2 Data Collection subsystem SMP may find it place as a LVL2 Data Flow Supervisor. The SMP systems can also be used as powerful processing units in the HLT Data Flow subsystem.
LVLl Trigger
Data from Rea&Out Drivers (RODS) 
A. Active ROB complex
The proposed application for SMPs in the Read Out subsystem is the "active" ROB complex. The original idea was to built detector-adapted computing stations with multiple processors, all having access to all ROBs (ReadOut Buffers) of a full detector, using commercial components from the HPCN (High Performance Computing and Networking) market -several multiprocessor boards with proprietary interconnect, all working under a shared-memory paradigm (Figure 3) . The availability of components has reduced the initial goal to a multi-ROB station; we call it an "active" ROB complex because the processors. actively contribute to alleviate the critical traffic over the LVL2 selection network. Processors in the SMP based "active" ROB complex are assumed to share memory and access to a number of ROBs. The grouping of ROBs is adapted to individual detectors; the limit is set at 16 ROBS per "active" ROB complex. Open questions include the achievable aggregate bandwidth for multiple ROBs on multiple PCI buses, limits set by the internal system bus of the SMP system and the overhead for the multi-thread implementation of the "active" ROB tasks.
B. LVL2 Data Flow Supervisor
A detailed description of the LVL2 Data Flow Supervisor can be found in the literature [2] . It consists of several processors (VME or PC based), connected to the LVL2 Data Collection network. It is designed to be simply scalable by adding more processors. An additional unit, a Region Of Interest (ROI) Builder [ 3 ] combines the different data streams from the LVLl system into a record for each event and distributes the data to processors within the Supervisor farm. The Supervisor processor manages the event, through the LVL2 Data Flow system -allocates a LVL2 processing unit, forwards the ROI record to it, receives the decision back, packs the decisions and multicasts them to the ROBs. The Supervisor also transmits the LVL2 decisions, which may be grouped to reduce the rate of messages, to the Data Flow Manager of the Event Filter system. The SMP based LVL2 Data Flow Supervisor might prove advantageous for LVL2 processing units allocation, for grouping of LVL2 decisions, for interaction with the Data Flow Manager and for the Supervisor monitoring tasks. The main issues are input/output bandwidth, processing capabilities, scalability and fault tolerance.
C. Processing unit
The LVL2 Data Flow Supervisor allocates the LVL2 selection'task for each event to a LVL2 processing unit, whidh in turn distributes these tasks among several worker threads. The subsequent processing of the event (data collection from the ROBs, feature extractions and steering) is performed in a single thread. While a thread is waiting for the data from the ROBs, other threads continue to work on different events.
This inherent concept of multi-threaded event data processing makes possible an efficient use of SMP systems. Workload balancing is automatically provided by the operating system and communication and synchronization tasks are easily accomplished.
A substantial amount of work and measurements on the SMP application for the HLT Data Flow Event Filter system reported in [4] . We therefore didn't perform any specific measurements in this area.
ONGOING ACTIVITIES
A. Active ROB complex All programs were developed using standard software tools. The commercial microEnable driver was capable of working in the multi-processor / multi-bus environment. A high-level interface was written to handle requests for ROB data (by event number), and to provide a polling mechanism. A multi-threaded application program was written in C++ to perform the basic measurements. Multithreaded applications allow the system to distribute the tasks to different processors.
Measurements of the aggregate bandwidth for multiple ROBs, system bus limits and the multi-thread implementation overhead were performed.
The bandwidth measurements ( Figure 5) show that the double PCI bus can be put to use. The rate increase for the second PCI bus (i.e., going from two to four ROBs) is 88% for large packets, and 55% for the preferred packet size of 1 Kbytes. This is achievable because the bandwidth of the SMP system bus is higher then the total bandwidth of two PCI buses. The inpudoutput rate does not only depend on the PCI and system bus bandwidths. It is also depends on the system bus loading: available PCI bandwidth drops by about 20%, when loading the memory bus with a readwrite flow of 130+130 MB, i.e. it goes from 160 to 130 MB/s for 1 Kbytes ROB fragments. The achieved system bus bandwidth is, therefore, about half of its theoretical limit. This is not unexpected: the system bus of the four-processor boards is a known bottleneck.
A multithreaded test program was used to combine I/O and processing activities. All the events in these measurements contain four fragments of 1 KB each, i.e., each event is 4 KB. A processing time, varying from 0 to 400 ps is applied to each event. Four types of threads are used in the test program:
One RequesrThread generates the requests for the requestQueue.
One CollectionThread reads the requesteueue and collects all event fragments needed for the event from ROBs. Collected events are put on the eventQueue.
One or more WurkerThrends reads data from the eventQiieue and spends a certain amount of CPU time, "algorithm", on each event before the result is put on the resrrltQueue.
One RespotiseThread reads the resultQueue.
For the zero "algorithm" time the system is completely dominated by VO time, which varies between 33 and 50 ps per event corresponding to an event rate of 20-30 kHz (Figure 6 ).
1
-. . Since one CPU (running the Collection thread) handles the input from the ROBs, the remaining CPUs are mainly idle. Adding more Worker threads only increases the competition for scarce events resulting in additional overhead.
"Algorithm" times of 180 or more pslevent give consistent utilization results where all CPUs are occupied. The observed maximum of 75% CPU utilization is achieved for an "algorithm" time of 400 pslevent, i.e., one processor is doing data collection while three processors are 'processing events'. More work is required to show the user and system level effects of multithreading.
B. LVL2 Datu Flow Supervisor
Investigations are under way to determine the feasibility and possible benefits of using an SMP system as the Supervisor in place of the current complement of PCs.
A possible SMP Supervisor structure is shown in Figure 7 .
Input from LVLl and TT
Interface to LVL2 Data Flow The main concern is the inputloutput bandwidth of the Supervisor. In the final implementation of the LVL2 ROI Builder [8] , the input buffers on 7 links from the LVLl will accommodate up to 63 32-bit words per event fragment. This leads to a maximum aggregate input bandwidth to the LVL2 Supervisor of about 180 MB/s at 100 kHz event rate and 1.8 Kbytes event size.
12-206
Measurements on the four-processor boards show that a bandwidth of 40 MB/s per input port could be achieved for packet sizes of 1 Kbytes. Therefore 5-6 links from the ROI Builder to the Supervisor may be necessary to carry the event data for the maximum input load.
A similar output bandwidth from the Supervisor to the LVL2 selection network will be necessary for communication with the LVL2 processing units and ROB systems. Further measurements of the performance of the LVL2 Data Collection network interface need to be done in order to estimate the necessary numbef of output links.
It is quite obvious that the single-bus SMP system we are using for the preliminary measurements will not be suitable for the ,SMR Supervisor implementation. Preliminary measurements of the' multi-threaded implementation of the Supervisor program using a test program with two threads hence 'only two CPUs are occupied have been done. They show that, with increasing number of ROIs inthe event data (from 2 to 8), the threadswitching overhead is reduced from 4 to 2.7 pdthread.
The number of the PCI VO slots available limits the scalability of the SMP-based LVL2 Supervisor. A possible solution (which also addresses the fault tolerance issues) is a double-SMP implementation. But this may limit the possible advantages of the LVL2 processing units allocation, for groyping of LVL2 decisions, for interaction with the Data Flow Manager and for the Supervisor monitoring tasks.
IV. CONCLUSIONS
Preliminary measurements show commercial SMP systems, 1/0 cards and that present software may be used in several areas of the ATLAS Data Flow System (Read Out, 'Data Collection and HLT Data Flow). SMP systems of the type evaluated are available from several nufacturers and they are packaged in a way that makes ir integration into a system a relatively easy task.
The I/O capacity available in SPM systems can largely be put to use by commercial interface cards and can satisfy the requirements of different parts of the ATLAS Data Flow System. Measurements of internal communication at application level have shown that the substantial CPU capacity in the SMP system can be largely made available to user programs in situations approaching those of ATLAS Data vior of SMP systems needs more context of the ATLAS Data Flow
