The present work describes the architecture and data flow analysis of a highly parallel processor for the Level 
I. BTEV LEVEL 1 TRIGGER Fermilab's BTeV experiment has been proposed with a Trigger System in three levels [1] . The Level 1 trigger will perform calculations using data from two detector systems: the pixel detector and the muon detector. The Level 1 must do crude data preprocessing plus Track and Vertex reconstruction while keeping the processing time as low as possible. Level 2 and 3, based on events that passed Level 1, will gather information from the already mentioned sub-detectors plus other sub-detectors to perform full pattern recognition.
The Level 1 Trigger process all events generated by collisions of protons and antiprotons in the Fermilab's Tevatron. The average event interarrival time is about 132 ns at a luminosity of . Since the Level 1 processing time will be approximately three orders of magnitude longer, the Level 1 processor will pipeline and process many events in parallel. Figure 1 shows the L1 Trigger building blocks. The Data Preprocessors have two main building blocks, the Pixel Preprocessor and the Segment Tracker. The Pixel Preprocessor formats and sorts the raw data coming from the Pixel Detector. There are two main modules in the Pixel Preprocessor, the x-y coordinate translator (XYPC) and the Time Stamp Sorter. The XYPC module formats the data; converting groups of Pixel Detector raw hits into x-y coordinate referenced data. The data generated every bunch crossing (BCO) of the accelerator is stamped with a distinctive temporal label called a Time Stamp (TS) . The data sorting is done by the TS-ordering function based on the data Time Stamp. The next processing stage is the Segment Tracker, which generates triplets of points describing the beginning and the end of all tracks in each event. Each Pixel Preprocessor and Segment Tracker processes a small geographic portion of the pixel detector. The Data Router or Switch routes all data that share a same Time Stamp to the same Track and Vertex processor. Each data event is assigned to a single processing node because trigger decisions are made on an event by event basis. The Track and Vertex processors are grouped in larger units of hardware called Farmlets. Processors in a Farmlet share some resources such as data I/O, main buffering and network connections. A key subject in the design of a multi-thousand node processing system is the data-flow. A data flow analysis guaranties that bottlenecks are eliminated and that sufficient processing and storage are allocated. The analysis also generates results that can be used to optimize the system architecture.
The data flow analysis of the BTeV L1 Trigger has been done using mathematical models and simulations. The mathematical models make extensive use of queuing theory. The L1 Trigger data inputs and outputs are described as stochastic processes. Subsystem behaviors are described by a set of differentialdifference equations and solved either for their transitory or equilibrium states. Furthermore, the beauty of modeling resides in its generality. The current models are general enough to be applicable to a large class of parallel processing architectures.
The mathematical models have also been validated by behavioral simulation of the Level 1 Trigger Processor. The input to the simulators comes from the simulation of the BTeV detector [1] . The dataflow simulations represent the timing and trigger functions as conceived today and are as close as possible to their final implementation.
II. PIXEL PROCESSOR AND SEGMENT TRACKER (PP&ST) DATAFLOW ANALYSIS
The dataflow analysis of the L1 Trigger cannot be covered entirely in this paper. Only the main sections of the PP&ST will be shown. A data flow analysis of the Track and Vertex processors is presented in paper N36-61 in this conference. For a comprehensive reading on both subjects please see [2] [3]. The queuing model used for the PP&ST is shown in 
III. THE TS-ORDERING PROCESS
The TS-ordering process opens a new queue when it receives data with a TS different from all the ones in the existing queues. A queue dies when the data reception for that event is complete. Since the L1 Trigger has no way to tell the end of one event, we use a deterministic processing time. We have set this time equal to a complete revolution of the TS clock, that is 159 Bunch Crossing Clocks (BCOs) (~21µs).
The analysis of the TS event ordering is fairly complex because the process must not only consider the queue birthdeath distribution but, also, the size distribution of each individual queue. Each individual queues represents a nonstationary process. However, some simplifications can be made. We can define a new process that only considers the number of queues in the TS event ordering system, regardless of their size. This new process is a well-defined birth-death Markov chain. Each state represents the number of existing queues in the system ( Figure 3 ). The process can be modeled as a M/D/∞ process. The birth time of the queues are generated by random queue arrivals. The interarrival times can be considered exponentially distributed. Queue deaths are caused by complete events leaving the system at deterministic interdeparture times of 159 BCOs.
Figure 3. TS-ordering queues state transition diagram
Let λ represent the rate at which new queues are generated. From simulations the total Pixel Detector Half Station data rate is shown to be 0.9 events/BCO for 4int/BCO. This rate is reduced by the the number of highways N=8: λ=0.1125 events/BCO into the Data Preprocessors. µ, the service rate, is deterministic and equal to the time we want to wait before considering that the event is complete. In this example we set µ to 1/(159 BCOs) or 0.006289 BCO¯¹. The M/D/∞ process is always stable. The probability distribution function of this system is given by:
The average number of queues in the system is given by: The analysis of individual queue size can be performed as follows: We can calculate the conditional probability distribution function of queue occupancy given that there are n queues and the total sum of data words in the queues is m. The selection of data in the queues can be modeled as a generalized binomial distribution: (2) is still conditioned by a fixed number of queues in the system. However, it let us study the distribution of data in the queues for a certain number of key values. For instance we can let n be the average number of queues or some upper bound.
What equation (2) shows is that for a given n the distribution of M1(t)…Mn(t) are independent Poisson processes with data rate λt/n. It is also known that the interarrival times in a Poisson Process are exponentially distributed, the k-iterated interarrival of an event in (1) follows a k-stage Earlang distribution. In our case the distribution is conditioned for n fixed.
The average number of hits in the TS-ordering queues can be calculated using the average number of TS-queues and the average number of hits per event. 
IV. THE X-Y PIXEL CLUSTER (XYPC) QUEUE
The x-y pixel cluster (XYPC) queue can be modeled as a "bulk" M/M/1 process. In such a process the data arrives at the input queue in "bulks". Every time the TS ordering process closes a queue, that entire queue is placed in the x-y translator buffer. The bulks are variable in size and equal to the size of the event that generates it. In other words, the x-y translator's queue is composed by a number of queued customers, which are in turn of variable length. This problem is a generalization of the system with an r-stage Earlangian service, in this case using variable r. The bulk arrival state-transition diagram can be represented as in Figure 6 . Let gi = Prob[bulk size is i], then
The equilibrium equations for the bulk arrival system can be written by:
The numbers we are looking for are the mean size of the x-y translator queue and the average service time. The solution of the equilibrium equations involves z-transform methods. The bulk M/M/1 queue size in equilibrium suffers a "modulation" effect caused by the size of the events (bulks). The modulation is reflected in the discrete convolution shown by the summation in equation (1) . Convolutions show in the ztransformed plane as the product of the z-transforms. The ztransform of the probability distribution of the x-y transform queue size P(z) is:
where G(z) is the z-transform of the probability distribution of the bulk size. The utilization factor ρ is defined, as usual, ρ=1-po. The value of ρ can, also, be obtained from (2) taking into account that P(1)=1. Then, 
Using (4) into (2), the expected number of queues in the bulk M/M/1 process is
Using, equation (3) and simplifying (5) can be written as
The unknown parameter σ of equation (5) There is a queue associated to each input to store the incoming data. We have also defined 7 other internal queues for temporary data storage, which allows pipelining through the processing modules (Figure 8 ). independent and their input interarrival times are distributed exponentially with parameter λ, which can be easily estimated from the data sample. The service time distribution for each module can, also, be estimated from the simulations. The simulations show that they are exponencially distributed as well. Hence, the mean event queue sizes are obtained by
A more detailed analysis can model the Segment Tracker as a "bulk" service process. The data arrives hit by hit but is serviced event by event, where every event represents a "bulk". The "bulks" are of variable size and modulate the queue sizes. The equilibrium equations for this process can be derived from the state transition model shown in Figure 9 . 
VI. CONCLUSIONS
The dataflow analysis of the L1 Pixel trigger has feedback valuable information into the system design. An analysis of data bandwidth and latency has been very useful to balance the workload across the system for such parameters as bandwidth, latency, queue sizes and system service times. Those numbers are available at [2] . The latency in the L1 Trigger system is dominated by the TS-ordering function and by the Track and Vertex algorithm. The first one is constrained by the data generation process in the Pixel Detector, hence harder to modify, the last one can be reduced by speeding up the algorithms and taking advantage of the speed of FPGAs for part of their implementation. The dataflow analysis has also benefit the design of a fault tolerance trigger. Since stages cannot provide infinite data queuing or infinite processing bandwidth, they must deal with occasional buffering and processing overflows. The way we deal with this problem is by throttling the data stream and by purging events to reduce queue sizes and processing load. A well-implemented throttle must handle data inefficiency gracefully. The overflows worsen in the event of a failure in the system. This problem is the main subject of the analysis reported in paper NS-xx of the current conference. 
