Abstract-This paper implements and analyzes a highway traffic-flow simulation based on continuum modeling of traffic dynamics. A traffic-flow simulation was developed and mapped onto a parallel computer architecture. Two algorithms (the one-step and two-step algorithms) to solve the simulation equations were developed and implemented. Tests with real traffic data collected from the freeway network in the metropolitan area of Minneapolis, MN, were used to validate the accuracy and computation rate of the parallel simulation system (with 256 processors). The execution time for a 24-h traffic-flow simulation over a 15.5-mi freeway, which takes 65.7 min on a typical single-processor computer, took only 88.51 s on the nCUBE2 and only 2.39 s on the CRAY T3E. The two-step algorithm, whose goal is to trade off extra computation for fewer interprocessor communications, was shown to save significantly on the communication time on both parallel computers.
I. INTRODUCTION
A VERY important component of the Intelligent Highway System is a traffic simulation system. Such a system uses computer hardware and software to simulate traffic on freeways and arterial networks. Input/output (I/O) devices provide real-time traffic data measurements from a network of traffic detectors (loops or cameras). The system uses a mathematical traffic-flow model to perform traffic-flow simulation and predict the traffic conditions in real time. These predictions can be used for real-time traffic control and vehicle guidance. Fig.1 shows an architecture for such a traffic simulation system. It consists of three layers: a data layer, a simulation layer, and an analysis layer. The data layer is responsible for system I/O. It imports current traffic data from roadway sensors and provides access to databases containing roadway layout information. Above this, in the simulation layer, the traffic flow model is implemented to compute current and future traffic flow along the highway. At the top, the analysis layer combines simulation results with traffic management heuristics to make decisions about how best to control and/or route highway traffic. This information is then exported to both the traffic control systems and vehicle guidance systems via the data layer. Macroscopic or continuum traffic-flow models based on traffic density, volume, and speed have been proposed and analyzed in the past (see, for example, [1] - [6] and the references therein). These models involve partial differential equations defined on appropriate domains with suitable boundary conditions, which describe various traffic phenomena and road geometries.
The main objective of this paper is to demonstrate that the parallelization of the traffic-flow simulation component in a real-time system is feasible for macroscopic models. Some preliminary results on the issue of parallelizing computational fluid dynamics (CFD) methods for transportation problems were presented in [5] . Such a real-time simulation system can be designed using a parallel computer as its computational component. We design such a computational component by parallelizing a CFD method to solve the momentum conservation (macroscopic) model [4] - [6] and implementing it on the nCUBE2 and Cray T3E parallel computers.
Tests with real data from the I-494 freeway in Minneapolis, MN, were conducted. We run the simulations on a nCUBE2 and on a Cray T3E parallel computer. Processing elements (PEs) of the nCUBE2 are proprietary processors with rate of 20 MHz. Processors on the Cray T3E are DEC Alpha 21 164s with a clock speed of 450 MHz. Tests were run on the nCUBE2 at Sandia National Laboratory, and on the Cray T3E at the NPACI San Diego Supercomputing Center and at the Pittsburgh Supercomputing Center. The execution time for a 24-h traffic-flow simulation of a 15.5-mi freeway, which takes 65.65 min of computer time (on a 133-MHz single-processor Pentium computer), took only 88.51 and 2.39 s on the parallel traffic simulation systems implemented on nCUBE2 and Cray T3E, respectively. We adopted the lax-momentum traffic model [11] . We next outline the continuous model equations. Let and be the traffic density and flow, respectively, at the space-time point . The generation term represents the number of cars entering or leaving the traffic flow in a freeway with entries/exits. The traffic flow, density, and speed are related by
The relaxation time is assumed to vary with density according to where and are model parameters. The model equations are where and are the following vectors in discretized form A typical choice of parameters is and . These parameters depend not only on the geometry of the freeway but also on the time of day and even on weather conditions. We next outline the discrete model. Let and be the time and space mesh sizes. We use the following notation: is the density (vehicles/mile/lane) at space node and at time is the flow (vehicles/hour/lane) at space node and at time , and is the speed (mile/hour) at space node and at time . At time , the density value and volume value are computed directly from the density and volume at the preceding time step
The method is of first-order accuracy with respect to , i.e., the error is . To maintain numerical stability, time and space step sizes must satisfy the Courant-Friedrichs-Lewy (CFL) condition , where is the free-flow speed (see [4] ). Typical choices for the space and time meshes of ft and s are recommended for numerical stability [6] , [11] .
Let be the number of processors available in the system. The parallelization of the discrete model is obtained by partitioning the space domain (freeway model) into equal segments Seg Seg and assigning each segment to the processors (PEs) . The choice of indexes defines a mapping of the segments to the processors [3] , [11] .
The computations associated with each segment have as their goal to compute the density, volume, and speed over that segment. The computation in the time dimension is not parallelized. At a fixed discrete time, this essentially means that the quantities and are computed by processor , iff the space node belongs to the segment Seg . This segment-processor mapping must be such that the communication delays for data exchanges, required in the computation, are minimized. This segment-processor mapping is a linear array mapping onto a hypercube (nCUBE) and Torus (T3E).
The contents of this paper are organized as follows. In Section II, a processor allocation scheme is presented. In Sections III, the parallelization of the traffic flow model is discussed. In Sections IV and V, the implementation of the model on the nCUBE2 and the Cray T3E is described. In Section VI, the simulation tests are shown. In Sections VII-IX, a performance study of the simulations is presented. In Section X, conclusions are drawn.
II. PE SCHEDULING
All the space nodes in the simulation must be allocated to PEs. This is also known as PE scheduling. For a given number of PEs, , it is desirable to allocate them in such a way that all PEs are utilized in the most efficient way possible, while keeping in mind any load-balancing requirements. Assume that there are space nodes. Then ideally, we allocate nodes per PE. Obviously, the closest we can come is an allocation of (ceiling operation) or (floor operation), where "floor" equals the integer division and "ceiling" equals "floor" 1 if is greater than . Our first approach to this problem we call Method I: allocate or space nodes to as many PEs as possible, with the remainder going to the last utilized PE. This algorithm has some side effects. Namely, as gets larger, so does the number of idle PEs. As an example, consider the case where and is an increasing power of two. We are faced with the situation in Table I , where rem is the number of space nodes that the last PE will need to process (i.e., the number remaining after the initial or distribution). This scenario is clearly not desirable from a load-balancing perspective. It does, however, free up some PEs for some other potentially useful work. In a real-time environment, these PEs could be doing I/O, or applying algorithms to the simulation results in order to effectively manage the traffic flow. However, at the largest values, where the lowest run-times are obtained, nearly 17% of the PE's are idle-an unreasonably large number. To obtain better load balancing, a second scheduling method was developed that was used in our implementation.
Method II seeks to improve load balancing by distributing the space nodes as evenly across the PEs as possible. This will require some combination of and . If and are the number of PEs allocated and space nodes, respectively, then we know
Given (1) and (2), we combine (3) and (4) to solve for and and obtain from Therefore, PEs each allocate space nodes, and PEs each allocate space nodes. In addition to knowing how many space nodes to allocate to each PE, we must devise a mapping of a subset of space nodes to each PE. Given that each PE knows its identity via a uniquely valued parameter , then the mapping is defined via the following pseudocode:
We introduce here the concept of upstream and downstream space nodes. If we think of the traffic as moving from upstream to downstream, then an upstream node is the leftmost or lowest index node within a segment, and the downstream node is the rightmost or highest index node within a segment.
Note that the first PEs were chosen to allocate space nodes. It could just as easily have been the first PEs allocating space nodes. The following example helps clarify this procedure:
Given and , then two PE's allocate 5 space nodes, and one PE allocates 6. For : upstream-node downstream-node offset temp upstream-node downstream-node offset temp upstream-node downstream-node
We should also note at this point that this allocation method places some constraints upon the simulation. Clearly, there are situations where either or can be zero. This is an indication that there are more processors than are necessary for the given number of space nodes (i.e., some PEs would be idle). This situation is flagged as an error and the program terminated. This allocation mechanism is used for all algorithms that follow.
III. THE ONE-AND TWO-STEP ALGORITHMS
In this section, we first explain the existing algorithm for the lax-momentum computation. In the one-step algorithm, one time step is executed before an interprocessor communication (IPC) is required. In the two-step algorithm, two time steps are executed before an IPC is required.
The core of the lax-momentum computations, for a given time step and space node, is quite simple. The general form of the computations is as follows (expressed in a -like pseudocode). In this notation, the odd and even references correspond to successive time steps. The evenk, evenq, and evenu variables are the previous time steps values for density, flow, and speed, respectively, and the oddk, oddq, and oddu variables are the current time step density, flow, and speed for which a new solution is being sought: and are constants across all processors, and oddk oddq and oddu are the and described in Section I. In a straightforward implementation, once the odd values are computed and distributed to the neighboring segments (on neighboring PEs), the even variables are loaded with the odd values, and the computation resumes for the next time step. The major steps of the one-step algorithm are demonstrated in Fig. 2 , which depicts a hypothetical situation with three PEs and 15 space nodes. Each segment can be viewed as a boundary value problem whose upstream and downstream nodes contain new boundary values for each time step. Thus, after partitioning the space nodes into segments of size nodes each (five in our example), each processor will allocate space for nodes to hold the boundary value as well.
The flow of the algorithm (time) begins at the bottom of the figure and proceeds upward [ Fig. 2(a)-(c) ]. We begin the algorithm at some arbitrary time with all even variables set to some initial value [step (a)]. From (5) and (6) Note that the upstream boundary on the farthest upstream segment, and the downstream boundary on the farthest downstream segment, will be determined by other means [marked by a box in step (b)]. These data are either prerecorded roadway data (our case) or could be obtained, in real time, from sensors strategically placed upon an actual roadway. Once the new (odd) values have been moved into the old (even) variables, the algorithm is ready to proceed with step . As mentioned previously, in any system with high IPC latency, the designer must structure the algorithm so that large amounts of computation are performed between communication steps. The only possible way to reduce the number of IPCs between processing elements here is to see if more than one computation can be done before an IPC is necessary. Whether this can be done or not depends upon the functional form of the computation. If we rewrite (5) and (6) We ignore oddu because it is simply a function of oddk and oddq and is easily obtained once they are known. We can see that each new computation has inputs from only adjacent (both upstream and downstream) and current nodes from the previous (even) time step. In a sense, each new (odd) computation is independent for each node, given that the neighboring nodes' data are known. Therefore, it should be possible to do a second computation on at least part of the nodes within a segment before incurring the cost of an IPC. The challenging question now is what to do with the upstream and downstream boundaries and their neighbors. During each IPC, we will send both the boundary values from the first complete time step and the inputs necessary for the neighbor to compute the second time steps' boundary values. When these computations are complete, we will be ready to begin the next two-step iteration. This is the two-step algorithm. It incurs a very small overhead in central processing unit (CPU) time, but the IPC time is cut almost in half. The assumption is that the extra computations in completing the second time steps boundary are more than made up for by the saved IPC. We will quantify these savings in Section VI. Finally, the two-step algorithm places one additional constraint on our PE allocation technique: both and must be three or more in order for there to be enough nodes for the partial second step to be performed.
Returning once again to our example and Fig. 3 , we can see the major steps of this algorithm.
Steps (a) and (b) are exactly as they were in the one-step algorithm. In step (c), the "inner" space nodes are computed for the second time step.
Step (d) is the new IPC. Note that not only the boundary value from the first time step [step (d), items 2 and 3] but also the additional data necessary for the neighbor to compute the "missing" information are sent so it can complete the second time step [step (d), items 1 and 4]. Several computations are performed in step (e). Once the first time steps' boundary has been loaded [step (d), item 2], the PE can then complete the second time step for its own upstream node [step (e), item 5]. With the partial information delivered in step (d), item 1, it can complete the upstream boundary for the second step [step (e), item 6]. Similar processing is applied to the data from step (d), items 3 and 4, to compute a new downstream node [step (f), item 7)] and boundary value [step (g), item 8 and step (h), item 9]. The algorithm is now ready to proceed with step .
IV. IMPLEMENTATION ON THE nCUBE2
The nCUBE2 is a multiple-instruction multiple-data (MIMD) hypercube parallel processing computer. Each PE contains a proprietary processor with a 20-MHz clock and 4 MB of local memory. Peak theoretical performance is 4.1 millions of floating point operations per second (MFLOPS) per PE. The hypercube architecture is a distributed-memory message passing architecture. In a hypercube of dimension , there are processors, labeled . Two processors and are directly connected (neighbors) iff the binary representations of and differ in exactly one bit. Each edge of a hypercube graph represents a direct connection between two processors. Thus, any two processors in a hypercube are separated by at most other processors. Fig. 4 illustrates a hypercube graph of dimension . The number of processors to be allocated to a job is chosen by the user, but must be a power of two. Table II summarizes interprocessor communication times for neighbor processors and basic floating-point operation times for the nCUBE2 [13] . We see that communication even between neighboring processors is many times slower than floating-point operations.
In an architecture with high communication latencies (such as nCUBE2), the algorithm designer must structure the algorithm so that as much computation as possible is done between communication steps.
Two important factors that influence the delivered performance on this machine are load balancing and reduction of communication overhead. 
V. IMPLEMENTATION ON THE CRAY T3E
The Cray T3E (model 900) is a distributed shared-memory MIMD architecture with a three-dimensional (3-D) torus topology and bidirectional channels. Each PE consists of a DEC Alpha 21 164 processor, a system control chip, local memory, and a network router. The custom-made control chip implements the distributed shared memory, which consists of all the local memories in the PEs. Each processor is connected to six other processors in a 3-D toroidal mesh, as seen in Fig. 5 . All PEs in opposite "faces" of the mesh are connected to each other. The T3E supports low-latency, high-bandwidth communications via this mesh and is capable of delivering a 64-bit word every system clock in all six directions, for a raw bandwidth of 600 MB/s, with data bandwidths of 100-480 MB/s after protocol overheads.
The DEC Alpha 21 164 processor is capable of issuing four instructions per clock period, giving it a theoretical peak rate of 900 MFLOPS. Each PE supports 128 MB of local memory.
Optimization on the T3E is not a straightforward process. The user is given fine-grained control over what aspects of optimization they wish to control. There are no less than 26 selectable options, most having more than one level of control. Without in-depth knowledge of the exact relationships among the compiler, the source code, and the objectives of the application, it is nearly impossible to know which combination of optimizations would be optimal. Most were selected based on their descriptions, together with some basic benchmarking of those whose effects could not be predicted beforehand. Some combinations yielded better performance, others worse. Table III shows the options as benchmarked (these benchmarks were made using the complete simulation code, but on a smaller data set size for expediency).
All pairs of options were tried, the most obvious being pipeline3 with scalar3, since they both resulted in improvements. This combination resulted in a run time of 5.211 s (a ratio of 0.923). But a better combination was discovered: pipeline3 with unroll, resulting in a run time of 5.170 s (a ratio of 0.915), and the best run time of all the options/combinations tested. Note that no three-option combinations were tested due to time constraints. The unroll directive instructs the compiler to unroll all loops generated by the compiler. The pipeline directive instructs the compiler to aggressively pipeline the software to the CPU, including speculative loads and operations. Both of these compiler directives will result in longer compilation times but faster execution times.
Because the T3E maps a linear array of PEs into a mesh, the mapping of highway segments to processors is a straightforward linear mapping. For performance reasons, it is important that neighboring segments are mapped to neighboring processors. The most problematic implementation issue is the paradigm with which distributed memory is implemented. The T3E has several different IPC mechanisms to choose from. At the highest level are standard message passing interfaces (like the standard PVM application programming interface) to the lower level (and faster) interfaces built around shared-memory operations. It is at this level that an IEEE POSIX-like shared memory interface is defined which is much more efficient than PVM. Because of this, the Cray shmem get() and shmem put() shared-memory routines were selected.
VI. SIMULATION TESTING
The simulation was implemented on the 1024-node nCUBE2 computer located at the Massively Parallel Computer Research Laboratory, Sandia National Laboratories, Albuquerque, NM. The simulation was also run on the Cray T3E at the San Diego NPACI, San Diego, CA, and Pittsburgh PSC Supercomputing Center, Pittsburgh, PA.
As a test site, a multiple entry/exit section of the I-494 highway was chosen in Minneapolis, MN. This section of eastbound I-494 extends from I-394 in the west to Nicollet Avenue in the east. It is 15.5 mi long, with 17 exit and 19 entry ramps. Data for the simulation were recorded on April 9, 1997, and span a 24-h period beginning at midnight of that day. The Appendix shows a sample of the data. To test the simulation, the time and space mesh sizes were chosen as s and ft. The discrete model contained 814 space nodes. The tests were analyzed in two ways: comparison with real data and computational performance.
Traffic data are collected at the upstream/downstream boundaries of the freeway section and at check-station sites (checknodes) inside the freeway section. The error statistics are sampled in Tables IV and V. The relative errors (Rel. 2-Norm) are at a level of about 10% for the volume but lower for the speed measurements. These error levels are consistent with past simulations carried out by simulation systems based on a single-processor computer (see [1] and [11] ). 
VII. PERFORMANCE STUDY
In general, the serial (or single PE) computational performance of a given algorithm implemented on a given computer architecture is expressed in terms of MFLOPS. In order to derive this measure, an estimate of the number of floating-point operations (FLOPS) is needed for the algorithm in question. Upon examining (5)- (7) in the algorithm, we see there are some 32 floating-point operations contained within the main simulation loop. There are, in actuality, 34 such operations (this pseudocode was somewhat simplified for purposes of clarity). In general, each space node computation requires 34 FLOP. If we rewrite the one-step algorithm pseudocode in terms of the number of FLOPS performed, we arrive at the following:
for ns 5-minute time steps do for 300 seconds do for each space node on this PE do end for IPC advance time seconds end for end for Some explanation is in order. The space node loops certainly make sense in terms of the number of nodes that are being solved for in both the first and second steps of the algorithm. Plus it makes sense that there would be two additional nodes solved for (marked ) since the second step does not operate on all the space nodes that the first step does (two less). In actuality, the two-step algorithm requires slightly more computation than the one-step. In the one-step case, the segment end-nodes are exchanged between neighboring PEs during the IPC to be used as segment boundary values for the next compute cycle. This cannot happen in the two-step case, since the segment end-nodes have yet to be calculated for the second step. This forces neighboring PEs to both compute the boundary values, but for different purposes: one as the segment end-node and the other as the boundary value for the next compute cycle. Thus, we must perform two additional node computations (marked ) for a total number of operations of ns To derive the desired MFLOPS value, we need only divide the total number of operations by both the single PE execution time and 10 . For this simulation, ns and .
VIII. nCUBE2
Table VI summarizes the MFLOPS results for nCUBE2. Here, we see the two-step algorithm outperforming the one-step algorithm when the software is run on a single PE, as mentioned at the end of Section VII.
For the parallel performance analysis (for both the nCUBE2 and T2E), we evaluate the following measures: the serial execution time , the parallel execution time , the parallel speedup , and the parallel efficiency . Additionally, can be broken down further to component measures of input, computation, and output. The computation time is simply the time for the discrete model computations. This time can be further decomposed into calculation time and IPC time.
NCUBE2 performance data are presented first as Tables VII and VIII, then as Figs. 6-8. Table IX contains the two-step over the one-step gains.
Based on the single PE timings, the restructuring of the code, which eliminates the odd and even swapping, saves a total of 15%. Even assuming that this fraction remains constant across PE sizes, the two-step algorithm, by way of halving the number of IPCs, still saves an additional 12% at the 256 PE size, where IPC times are the highest (as a fraction of total compute time). On average, the IPC time is reduced by 42% by the two-step algorithm. A theoretical expected peak value would be 50%, but in practice cannot be obtained. The data content of the IPC for the two-step algorithm is more than twice that of the one-step algorithm, which will result in slightly longer IPC times. Overall, these data show that the two-step algorithm is ideally suited to the nCUBE2 architecture, where IPCs are quite costly compared to computation.
IX. CRAY T3E
Table X summarized the MFLOPS results for T3E. One may wonder why the two-step algorithm outperforms the one-step algorithm when the software is run on a single PE. This is a side effect of the implementation of the two-step algorithm. Let us recall from (5)- (7) in the algorithm that the space nodes currently being solved for have their data stored into locations prefixed with odd, while the data for the same space node, but for the previous time step, are prefixed with even. In the one-step algorithm, after the odd data are computed, the data are simply copied into the even variables for use in the next time step. However, in the two-step algorithm, the computations are done "in place," as it were, so that the first step is stored into the odd variables and the second time step is stored into the even variables, thus avoiding the overhead of the copy operation. This has the benefit of lower computation times but the disadvantage of approximately doubling the size of the core computational section of the code.
Cray T3E performance data are presented first as Tables XI and XII, then as Figs. 9-11. Table XIII contains the two-step over one-step gains. We see that the two-step algorithm is faster than the one-step algorithm for . For , the one-step is faster because a) the IPC time is very fast on the T3E, and even halving it does not lead to great savings overall and b) there are slightly more computations in the two-step algorithm than the one-step. Thus, there is a breakpoint (in ) for the gain of the two-step, and it occurs at . This breakpoint is so small here because the number of road space nodes assigned to each PE in our simulation is very small (e.g., 12 or 13 nodes per PE for ). In realistic simulations, we expect this to be much larger and, thus, the two-step algorithm will be even faster.
X. CONCLUSION
A very important component of an intelligent highway management system is a traffic simulation system. The design of a real-time traffic simulation system is a challenging problem. In this paper, the design of a parallel (macroscopic) traffic simulation system is demonstrated. This system could be used as a component of a real-time simulation system. This parallel system was implemented on the NCUBE2 and on the Cray T3E parallel computer. Tests were run with real traffic data to validate the accuracy and computational rate of the system. A 24-h 15.5-mi simulation, with real traffic data, took 2.39 s on the Cray T3E and 88.51 s on the NCUBE2 versus 65.65 min on a typical single processor system (a 133-MHz Pentium). Two algorithms were implemented, offering tradeoffs in execution time, IPC time, and memory size. The two-step algorithm, when compared to the one-step, reduced significantly the communication time on both parallel computers and would be of considerable savings when larger highway segments (e.g., entire beltway systems) are modeled. With increase in problem size, even higher efficiencies are to be expected. APPENDIX SAMPLE HIGHWAY SCHEMATIC AND DATA FILE Fig. 16 is a small section of a schematic from I-494 near Minneapolis (see Sections V and VII). Specifically, this is the interchange at I-494 and I-35W. It should be treated as example data only. The detector locations correspond to simulation site locations, e.g., detector 120 corresponds to exit ramp 14, located at space node 382 in this study. The values that show XXX or XXXX indicate a detector failure. Some ramps contain multiple detector locations. For example, detectors 3090, 3091, and 863 all point to the ramp from northbound I-35W to eastbound I-494. But only detector 863 data were contained in the data files (please see the equation at the top of the next page). This is a fragment from an original data file, exactly as delivered by MNDOT. This fragment is one of nine total files that covered the 24-h period beginning midnight April 9 through midnight April 10 and covering the 15.5-mi section of highway simulated. A three-digit line number has been added to the beginning of each line for discussion purposes. The " " at the end of most lines indicate that additional columns have been removed for brevity. The column containing the detector number contains the volume data for that detector.
001 VOLUME n OCCUPANCY
