We discuss a parallel-processing experiment that uses a particle-in-cell (PIC) code to study the feasibility of doing large-scale scientific calculations on multipleprocessor architectures. A multithread version of this Los Alamos PIC code was successfully implemented and timed on a UNI VAC System 1100/80 computer. Use of a single copy of the instruction stream, and common memory to hold data, eliminated data transmission between processors. The multiple-processing algorithm exploits the PIC code's high degree of large, independent tasks, as well as the configuration of the UNI VAC System 1100/80. Timing results for the multithread version of the PIC code using one, two, three, and four identical processors are given and are shown to have promising speedup times when compared to the overall run times measured for a single-thread version of the PIC code.
INTRODUCTION
Anticipating a need for increased computational speed 1 for laboratory codes (which is unlikely to be attained by single-processor systems), we have initiated studies to test the feasibility of doing parallel processing on multipleprocessor architectures. 2 In part, our hope is to learn about multiple-processor architectures, the compatibility of algorithms with particular parallel-processing environments, parallel-processing speedups as a function of the number of processors, and the desirable characteristics of multipleprocessor architectures in general.
This paper presents the results of our investigation concerning the feasibility of parallel processing a specified scientific problem on a commercially available multiple-processor system and particularly the computational speedups as a function of the number of processors employed. The problem used in this experiment involves a particle-in-cell (PIC) method for simulating the electrostatic interactions of a collisionless plasma. We first outline the PIC algorithm and graphically describe its parallel-processing structure as implemented in our experiment. A general description of the UNI VAC System 1100/80 is then given, followed by a discussion of the implementation of the PIC code on that system. The results of our experiment are given, showing overall computational speedups as a function of the number of processors and the equivalent number of parallel activities.
PARTICLE-IN-CELL
The problem selected for our parallel-processing experiment models the collisionless, electrostatic interaction between two superimposed plasma beams with a relative drift velocity. 3 The code uses a particle-in-cell method for studying the interaction and resulting motion of the charged particles in this simulation. 4 This code is of general interest to us because it represents a class of algorithms exhibiting limited vector capabilities for implementation on our vector computers. Due to the PIC algorithm (discussed in the following paragraph), the conversion to parallel processing was made with relative ease.
PIC Algorithm
The particle-in-cell method used in this study decomposes a region of space into a collection of cells. These cells are then used for tracking particle movement, and they assist in evaluating relevant physical properties. An initialization stage sets up two ensembles of charged particles (we shall use particles to mean charged particles throughout this paper) constituting the two superimposed, collisionless plasma beams. During this initialization, the particles are distributed uniformly in space and randomly in velocity. The movement of particles is discretized in a time step (dt).
During each computational time step of the simulation (see Figure 1 ), cell-centered charges (C) are calculated by linearly weighting each particle's charge contribution to the four nearest-neighbor cell centers. Using this charge distribution, Poisson's equation with periodic boundary conditions is solved for the associated electrostatic potential (<j >) on the grid of cell centers, with the resulting electric (E) field interpolated to individual particle positions. Under this E field, each particle's position and velocity (see Figure 2 ) are advanced (pushed).
PIC-Parallel-processing Structure
The computational structure of the PIC algorithm, as implemented on the UNIVAC System 1100/80, takes advantage of the large, natural computational divisions of the particle initialization and aspects of the particle-in-cell calculations. Our parallel-processing version of the PIC code was implemented on a UNI VAC 1100/80 multiple-processor system.* The System 1100/80 may be configured with from one to four processors. UNIVAC's designation for its System 1100/80 with a one-, two-, three-, or four-processor configuration is denoted by 1100/81, 1100/82, 1100/83, or 1100/84, respectively. A global software manager (EXEC) executes out of all processors and, coupled with hardware devices, drives the multiple-processor architecture of the System 1100/80. The aggregate of processors share a common memory, which allows for multiple-program execution for tasks written in FORTRAN or COBOL. A principal feature of the System 1100/80 is its ability to parallel-process a single instruction stream upon data in common memory. This capability, supported by the COBOL compiler but not by the FORTRAN compiler, was essential for our particular experiment. (1 and 4) , where A n = total number of parallel activities (multithread), n = total number of particles, n, = number of particles for activity i, C = total charge (distribution), and Q = charge computed for activity i.
Implementation
The PIC code was written entirely in FORTRAN and implemented with a single copy of the instruction stream. The management of data addressing and the mechanics of parallelprocessing synchronization were devised and implemented by Dave Hammer of Sandia National Laboratories, Albuquerque, NM.t Figure 4 represents a simplified diagram of a UNIVAC 1100/84 (four-processor) system, on which our PIC timing
Central Memory
Cache Cache tBy devising an address mapping and a synchronization scheme for multithread activities, Hammer essentially converted the System 1100/80 into a FORTRAN parallel-processing machine for our use.
runs were made. Although not indicated in the diagram, the processing of each activity is not necessarily handled by only one physical processor. In fact during the complete computational cycle of such an activity, all processors may timeshare the execution of the activity. A distinction, therefore, is made between activities and processors. All relevant particle-in-cell data were put into various common blocks and partitioned for use by specific activities. Due to software addressing limitations, the PIC code was restricted to a maximum of 262k (decimal) words of total memory. For each particle, five data quantities (two for position and three for velocity) were required. Three mesh quantities, constituting a 34 x 34 mesh size, were required and duplicated for a maximum of eight (particle-push) activities. A total of 37k particles were initiated for processing, requiring 213k words of memory (particle plus mesh data). A further 47k of memory was used for the instruction stream, address mapping, and activity synchronization scheme.
PIC PARALLEL-PROCESSING RESULTS

Number of Processors
Figure 5-Plot of number of processors versus speedup corresponding to Table I .
A multithread version of the PIC code was executed on a UNIVAC System 1100/80$ with one, two, three, and four identical processors. Overall run times were measured, and the results are given in Table I and Figure 5 . The speedup values are the ratios of the overall execution time of a singlethread version of PIC (running on one processor) to the overall execution time of a multithread PIC code running on two, three, and four processors. We found that a maximum speedup of three was attained when using four processors with four activities spawned for each task. Because the multithread PIC was not totally parallel (see Figure 3) , the speedup for four processors may not indicate the full potential of the PIC algorithm. The times recorded and used for the parallel-processing speedup calculations were based on wall clock times, with timing runs made in a dedicated mode. Due to resource and time limitations, actual CPU times were not measured: therefore, no estimates could be determined for losses in effective processing time during the synchronization stage of each multithread activity.
CONCLUSIONS
Our results strongly suggest the possibility of significant computational speedups for a multiple-processor architecture similar to the UNIVAC System 1100/80. The coupling between algorithm and processing architecture illustrates not only the seemingly high degree of compatibility between our particular code and the computing environment, but also the need to distinguish those algorithms for which specific multipleprocessor architectures are most effective.
The straightforward use of FORTRAN in coding the multithread PIC algorithm greatly simplified the overall task of implementing our parallel-processing experiment. Programming in FORTRAN is certainly a characteristic of Laboratory codes, and would be a desirable feature to retain when converting such codes from serial-to parallel-processing systems.
Encouraged by our results, we currently are studying the possibility of a totally parallel version of the PIC algorithm. We also plan to investigate parallel processing on multipleprocessor architectures possessing as many as 16 processors.
