Introduction
The ASC (Associative Computing) model was developed by Johnnie W. Baker, Jerry L. Potter, and others at Kent State University [16] . This model is a generalization of the SIMD paradigm where each of up to k instruction streams (IS) sends commands to a unique set in a dynamic partition of the SIMD processors. A wide range of different types of algorithms and several very large programs have been implemented using the ASC language including a parallel optimizing compiler for ASC, two rule-based inference engines, and an associative PRO-LOG interpreter [15] [16] [7] [2] [6] [20] [21] . The goal of this paper is to simulate PRAM, a popular model of computing, with ASC; this provides an automatic transfer of PRAM algorithms to ASC with runtime bounds. PRAM is the best known model of parallel computation, and more algorithms have been written for it than any other parallel model. By simulating PRAM, ASC can utilize all PRAM algorithms.
Much previous work has been done concerning simulating PRAM with parallel computers. The work here expands what was known previously by combining existing network methods with the added power gained by having one or more instruction streams that can coordinate and communicate globally.
ASC: A Generalized Data Parallel Model
Hillis and Steele [10] defined a the data parallel programming paradigm in the early 1980s, but this work did not provide a complete computational model. The ASC model extends this concept to be a complete programming computational model. It is appropriate to have a model that emphasizes data parallel programming, even though other computational models may be capable of supporting it. Other recently developed models such as BSP and LogP tend to focus on MIMD and distributed computing concepts and thus capture this style of computing best [5] . Furthermore, a 1991 survey reported that 90 percent of parallel applications are data parallel in nature [4] .
This section describes the associative model of computation presented in the IEEE Computer article "ASC: An Associative Computing Paradigm," which is based on work done at Kent State University [16] . Also, see [1] [2] [11] [15] . The model can be supported efficiently on a wide variety of existing platforms. Moreover, ASC is a model for a currently buildable, multi-purpose parallel computer which supports a wide array of applications from grand challenge applications requiring massive parallelism to on-chip parallelism for PC interactive graphics. The model reduces the arduous chore of task allocation and embodies the intuitive and well accepted method of data parallel programming. There are many data parallel languages available that ASC encompasses and also an actual language called ASC [16] . The ASC language has been implemented on various platforms such as Thinking Machine's CM-2, Goodyear/Loral/Lockheed-Martin's Aspro, the Wavetracer, personal computers, and UNIX workstations.
The data parallel style of programming is actually very sequential in nature, and therefore is much easier to master by traditional sequential programmers than MIMD programming which employs task allocation, load balancing, synchronization points, etc. A standard associative language (such as ASC) is implementable across many distinct platforms providing true portability for parallel algorithms [16] . No specific machine architecture is required to effectively use the ASC model. It can be efficiently implemented on PCs, single workstations, SIMDs, MIMDs, and distributed systems. This model is intended to standardize the concept of associative computing and to provide a basis for complexity analysis for data parallel and associative algorithms [8] .
Description of ASC
The ASC model is a hybrid SIMD/MIMD model and is capable of both styles of programming. A frequent criticism of SIMD programming is that several PEs may be idle during if-else or case statements. The instruction streams provide a way to concurrently process conditional statements by partitioning the PEs among the ISs.
In the most basic terms, the associative model (ASC) has an large array of processing elements (PEs) and one or more instruction streams (ISs), each of which broadcast their instructions to a unique set in a dynamic partition of all PEs. The number of ISs is normally expected to be small in comparison to the number of PEs. The multiple ISs supported by the ASC model allows greater efficiency, flexibility, and reconfigurability than is possible with only one IS. An ASC machine with j ISs and n PEs will be written as ASCn; j.
Each PE (or cell) has a local memory and ASC supports the associative processing concept, which is to locate objects by content instead of by location in the combined local memory of the PEs. This is accomplished by searching a specified field of each PE for a given data item. Each PE is capable of performing local arithmetic and logical operations and the other usual functions of a sequential processor other than issuing instructions.
Cells can be either active, inactive, or idle. Active cells execute the program broadcast from an IS. An inactive cell is in an ISs group of cells, but does not execute instructions until the IS instructs inactive cells to become active again. Idle cells are currently inactive and contain no essential program data but may be re-assigned as an active cell at a later time. ISs can be active or idle. An active IS is currently issuing instructions to a group of cells or waiting to perform a join. An idle IS is not assigned to any PEs and is waiting until another IS forks, partitioning its PEs between itself and a new previously inactive IS. All PEs may be assigned to one IS in a single operation. Also, a PE can change the IS to which it listens dynamically using local data and comparisons. This, in fact, is the only form of IS switching needed for the ASC simulation of PRAM presented in this paper. For instance, if an IS broadcasts some value to a set of PEs, the PEs could set this value to their active IS in the next instruction cycle, or choose not to switch.
ASC specifically supports data parallelism; the data reduction operations of and, or, min and max; one or more instruction streams (ISs), each of which is sent to a distinct set in a dynamic partition of the processors; broadcasting by the ISs; and task assignment to ISs using control parallelism or data locality which allows PEs to switch ISs based on local data. The ASC model is shown in Figure 1 , and there are three networks, real or virtual, shown in Figure  1 : the PE interconnection network, the IS interconnection network, and the network between the PEs and ISs.
There are no restrictions on the type of cell network used with the ASC model. It could be the mesh, hypercube, shuffle-exchange, or many others. The programmer does not need to worry about the actual network present or the routing scheme, but only that ASC is capable of generalized routing with some latency. Some of the most obvious choices for an actual ASC machine are the linear array or the mesh because of the ease to implement them in VLSI and their expandability. 
Simulation of PRAM
This section presents algorithms for ASC to simulate PRAM Most notably, the ASC to PRAM simulation achieves a constant lower bound when the PRAM algorithm uses the same order of shared memories as ASC ISs. A specific case is when an ASC(n,1) machine simulates a priority CRCW algorithm employing only O1 shared memories, then a cycle is simulated in 1 time.
Results in [13] [14] [4] show that network models, with PEs having a constant bound on the number of links per node (bounded degree networks), can simulate a priority CRCW cycle in Olog n with high probability. Thus by combining the one IS ASC simulation presented here with existing methods that use networks, the resulting simulation has a has a runtime of 1 when emulating a constant number of shared memories, and a probabilistic upper bound of Olog n for certain logarithmic diameter bounded degree networks (like the hypercube). Moreover, for any network these simulations have a upper-bound of routen, where routen is the amount of time in terms of n to perform a priority CRCW operation using the network. For example, a mesh network could perform this operation in O p n [19] .
When simulation of a model is performed there are various operations to consider. For parallel models, the operations that need to be simulated are parallel execution of processors and data movement between processors [3] . These are determined with a mapping of processor resources from one model to another and with the algorithmic emulation of operations. When the time complexity of each operation performed in a cycle is divided by the complexity to perform the same operations on the machine being simulated, the maximum time to emulate any operation gives the slowdown of the simulation and also an indication of the relative powers of the two machines or how different they are [13] [14] [4] . Similar models should simulate each other in nearly the same amount of time. A stronger model simulates a weaker model in significantly less time than conversely.
A Synchronous Definition of PRAM
A PRAM(n,m) machine is defined as a collection of n sequential (RAM) machines and a set of m global memories. Each RAM of the PRAM has an instruction set, a local memory, and a specific address in the range 0:::n , 1 [9] .
During one cycle of execution, each processor executes the same instruction with different data synchronously. The instruction can be a local computation, a read from a global memory 0:::m , 1, or a write into global memory 0:::m , 1. Almost all known PRAM algorithms are for synchronous PRAM [17] . It is assumed that one machine cycle takes a constant amount of time regardless of hardware requirements to build such a machine. This paper deals mainly with priority CRCW and combining CRCW although the EREW and CREW simulations are also derived. Priority CRCW allows the PE with the largest address to write its value when several PEs are writing to the same global memory. With combining CRCW, all PEs which write to the same location are combined by some arithmetic or logical operation such as addition. A diagram of PRAM is shown in Figure 2 . 
Why Simulate PRAM with ASC?
ASC is a practical data parallel model for real machines, and there is a wealth of PRAM algorithms, so it is of interest to obtain ASC algorithms by simulating PRAM. Many PRAM algorithms map very well to ASC while some may need adjustment in terms of reducing communications at perhaps the cost of performing more computations, especially when considering parallel slackness which is defined as the ratio of data to the number of processors. In other words, there should be sufficiently more data than processors so that a parallel algorithm runs optimally.
One algorithm that maps very well to ASC is the Olog n expected time PRAM convex hull algorithm that assumes a random distribution of points [12] . It is shown in this section that an ASC machine with only one IS and no network can simulate priority CRCW with a constant number of shared memories in constant time. There are no doubt many other useful priority CRCW algorithms that also use a constant number of shared memories, and these have a corresponding ASC algorithm that executes in the same time.
Even without a PE network, ASC has intrinsic capabilities to perform some operations faster than PEs connected only by a bounded degree network since such a network can not perform a 1-to-n broadcast operation in less time than an order of the diameter of the network. For example a mesh needs p n time to broadcast data, and a hypercube needs log n time to broadcast [19] [14].
Operations Needed for Simulation
To simulate PRAM with any machine, three operations need to be handled: parallel execution of the PEs, reading from the shared memories, and writing to the shared memories [3] . It is assumed that the ASC machine has the same number of PEs and that an ASC PE has the same computational power as a PRAM PE. However, the PRAM reads and writes are highly parallel and complex in nature since all PEs may either simultaneously send or receive data at arbitrary memory locations. Thus all possible communication patterns need to be considered. Since there are j possible data paths provided by the presence of the ASC ISs, the time to communicate is dependent upon the number of ISs, giving an On=j routing time. The routing scheme becomes more complicated when concurrent reads and writes are allowed, and when number of shared memories is greater than the PEs.
Notation Used
The number of PEs is n, and the number of ASC ISs is j. There are m shared PRAM memories, and it is assumed that PRAM is synchronous. The term priCRCW will denote all PRAMs up to and including the power of priority CRCW , and the term comCRCW will imply all CRCW PRAMs at least as powerful as combining CRCW PRAM. The combining CRCW resolves write conflicts by combining data written to a shared memory with some operator 
ASC Simulation of Concurrent Read
Without a PE network, the concurrent read operation of CRCW(n,m) is simulated with ASC(n,j) in On=j time with high probability. Figure 3 shows the ASC algorithm.
The m PRAM shared memories are hashed into ASC's n PEs, divided more or less equally between each PE. Each IS also manages about an equal number of PEs and thus the memories stored in those PEs. This PE to IS assignment is also based on an optimal hashing function. When PRAM PEs are concurrently reading, this is simulated by each IS collecting, on average, On=j read requests from the PEs. This step is accomplished in On=j with a high probability if an optimal hashing function (uniform parallel hashing) is assumed. Then each IS processes On=j read requests in On=j time, reading the required data from the n PEs. This is possible because each IS may connect to j strictly disjoint subsets of PEs.
* Gather O(n) read requests to the j ISs (On=j avg. time) 1 set these selected PEs to inactive * Read O(n) data requests into j ISs (On=j avg. time) 10 set all P E s to active 11 assign each P E to I S which manages memory to be read 12 while (any P E s a r e active) 13 process next entry in ISs read-request list 14 ISs read a hashed(memory) into read-request list * Broadcast read data into requesting PEs (On=j avg. time) 15 if (a P E is not reading) then set the P E to inactive 16 assign each P E to I S which manages the memory to be read 17 while (any entries in IS lists are unprocessed) 
Examining CR Algorithm
Figures 4, 5, 6, and 7 show an example of the major simulation steps of a concurrent read. The following steps referring to Figure 3 show the operations taken to simulate one concurrent read. First, in line 1 all PEs and ISs perform operations in parallel. At line 2 all PEs are made active. Line 3 forces any PEs that are not reading to be temporarily inactive and no longer perform any operations. Line 4 forces each PE listen to the IS processor which manages the PE holding the memory location it wishes to read. In the 5th line, iterations are performed until all PEs are inactive and thus have had their read requests saved in the ISs. At line 6 each IS selects an arbitrary PE out of all of its active PEs to save its read-request in a list located at an IS. Each read-request will be saved in only one list in an ISs, one list All operations except the loops are assumed to take a constant amount of time in ASC. Thus the limiting factor is the size of any read request list. Because there are n PEs, on average the lists should be On=j in size, and the entire read process takes On=j with high probability. If all PEs are reading from only k shared memories where k n then the size of the lists will be Ok=j and the time to complete a concurrent read is Ok=j. Figures 6 and 7 show the data being moved out to the PEs.
ASC Simulation of Concurrent Write
In On=j time, ASC(n,j) can also simulate concurrent write of priority CRCW(n,m) with high probability using a method very similar to the concurrent read simulation. An overview only is given due to the correspondence between the two algorithms. The ASC algorithm is shown in Figure  8 . Again, the shared memory of PRAM is stored in the PEs, hashed among each PE. The memory addressing scheme is the same as the concurrent read algorithm in that each IS manages a roughly equal number of PEs (and shared memory) based on a hashing function. While concurrent writing, data needs to be moved from potentially all n PEs to the PEs containing the target memory cells. If more than one PE is writing to the same cell, the priority rule states that the PE with the highest self address is allowed to write. This is handled by selecting the maximum addressed PE with the ASC reduction operation across a set of active PEs in constant time.
* Gather On write requests into IS lists (On=j time)
* Use reduce max when 1 PE writes to same location 1 for all P E s and I S s do in parallel 2 set all P E s to active 3 if (a P E is not writing) then set the P E to inactive 4 assign P E s to I S that manages memory request to write 5 while (any P E s a r e active) 6 all ISs select maximum active PE insert reduced write data in IS write-request list 10 set these selected PEs to inactive * write On data out to memories hashed in PEs (On=j time) 11 set all P E s to active 12 assign each P E to I S that manages memory to write 13 while (any entries in write-request list remain) 
Examining CW Algorithm
The steps of the for simulating a concurrent write with ASC is shown in Figure 8 . Without great detail, similar but shorter than the reading algorithm, lines 1-10 form On=j sized lists of PE write requests with a high probability. When several PEs are trying to write to the same location, the ASC maximum reduction operation is used to store the PE write request of the largest addressed PE. Thus write conflicts are resolved in constant time.
Lines 11 , 15 write the data in the write request lists currently located in the ISs to the correct PEs. First, line 11 sets PEs back to active status. Line 12 assigns each PE to the IS which manages it. The algorithm then loops in line 13, each IS processing one list entry from their own lists until all are considered. Line 14 fetches the next entry in each ISs write request list starting with the first. Lastly, line 15 writes a word of data into the correct hashed memory location.
Much like the concurrent read algorithm, all statements except loops are assumed to take a constant time in the ASC model. The simulation of PRAM priority concurrent write takes the average time of the loops each of which is based on the size of the lists created. If the memories are optimally hashed into the n PE, on average the list sizes created in each IS are On=j and the algorithm also runs in On=j time. If k n distinct memories are written into, then the list size s Ok=j, and the algorithm runs in Ok=j.
Overview of ASC simulation of PRAM
The time to complete a priority concurrent write with ASC(n,j) simulating priority CRCW(n,m) is the same as that for completing a concurrent read. By the priority rule, multiple PEs writing to the same memory location are handled in one constant time step, using the ASC maximum reduction operation, so the write operation finishes in On=j time. If there are reads or writes involving only k memories, where 0 k n, then the time can be written as Ok=j. In short, any n processor PRAM algorithm that requires only a constant number of shared memories is simulated with an ASC(n,1) machine in O1. An example of this is shown in a paper by M. Atwah where the PRAM algorithm to solve the convex hull problem for n randomly distributed points is solved in Olog n time with only a constant number of shared memories [2] . There are no doubt many efficient priority CRCW algorithms that require only a constant number of shared memories. These algorithms can execute on a simple one IS ASC machine in the same time as on a PRAM.
Furthermore, by obvious transitivity ASC(n,j) simulates the weaker CREW(n,m) and EREW(n,m) PRAM in the same time as it simulates priority CRCW(n,m) [18] [12] . If the number of PRAM shared memories is of the same order as the number of instruction steams, m = Oj, then ASC(n,j) simulates priority CRCW(n,j) in O1. It may be interesting to note that no matter how much memory or ISs are present, as long as all PEs write to a subset of memory cells that is constant in size, then a step of simulation on ASC(n,1) takes constant time. So it is possible for the simulation to run in 1 extra time per machine cycle in the best situation.
Considerations when ASC has a PE Network
If the ASC machine simulating PRAM has an interconnection network between the PEs, traditional methods can be used to simulate PRAM with the network alone, not using the ISs to route data. The simulation of PRAM with parallel machines that have PEs with local memories and a network has is well studied. The concurrent read and write problem is essentially a routing problem on the network. For instance simulation methods that use a mesh network exist where the time to simulate each cycle is O p n. This is fact is the same time it takes to sort n data items on a mesh with n processors. Simulation methods for other networks exist with a bounded number of connection per node (bounded degree networks). The fastest simulates PRAM in probabilistic Olog n time, while a wide variety of constant and non-constant degree networks simulate PRAM in average time OD where D is the diameter. These networks include the cube-connected cycles, butterfly, shuffleexchange, mesh, hypercube, star, etc [14] [3] . The shared memories are generally hashed amongst the local memory of the n PEs similarly to the ASC simulation method.
However, even if the PRAM being simulated has a constant number of shared memories (e.g., one shared memory), the network simulation of PRAM time is still lower bounded by the diameter of the network, or the time it takes one PE to send a message to another PE. The priority ASC simulation of priCRCW with a constant number of shared memories requires 1 time which is better than the network methods cited above perform.
When ASC has a network, the ASC simulation of PRAM is a hybrid algorithm consisting of a simulation algorithm using only the network, and the ASC method as presented previously. This hybrid algorithm proceeds by performing one network routing step, and then alternately one ASC routing step with a constant extra cost in time and space. Whenever one of the methods finishes a single cycle of simulation, the other algorithm is terminated, and the next cycle is then started. At worst, if both methods finish at the same time, the simulation requires only a constant factor more time, and is no worse than the time for the fastest method to complete. At best, when considering k n shared memories, this hybrid algorithm simulates a step of PRAM in Omink=n; routen where routen is a function of the number of n PEs it takes to simulate one cycle of PRAM using the PE network, and Ok=n is the time to simulate PRAM with ASC. The ASC method has the capability of improving the fastest simulation of PRAM time, even with the one IS ASC machine. The best running time of this hybrid algorithm is 1 for a constant number of shared memories and is no worse than Orouten using a network simulation method.
Reexamining the PRAM algorithm for convex hull on random data points, the fastest known network method to simulate this algorithm on a logarithmic diameter network (e.g., a hypercube, cube-connected cycles, or butterfly) takes Olog n simulation steps each of Olog n time for a total execution time of Olog 2 n. However the ASC method alone, or the ASC-Network hybrid method, executes this algorithm in optimal Olog n time, an improvement over existing bounded degree network algorithms.
An Obvious PRAM Simulation of ASC
This section presents a way to simulate ASC(n,j) with combining CRCW(n,m). It is included for completeness, even though the algorithm is simple, in order to provide some way to compare the two models. Combining CRCW (comCRCW) is used since it simplifies implementing the ASC reduction operators of AND, OR, MAX, and MIN by having the power to make such combinations.
The n PRAM PEs perform the same operations that the ASC PEs can perform, and the IS information for I S k is stored in P E k M O D n where 0 k j. The P E kMODn simulate the operations of instruction stream k, and it is assumed that each PRAM PE also has the power of an IS.
However, since ASC ISs are asynchronous, the execution of all ISs is done iteratively in j steps. Since j steps are taken regardless, all control communication between
ISs finishes in Oj time by having each PE simulating an IS write and read synchronization information to a j sized array stored within a single shared memory (assuming that a shared memory has j words). Since EREW and CREW PRAM simulate the CRCW using the same number of PEs and shared memories with a known extra cost of On=m + log n, the extra time for these two models to simulate ASC are easily obtained from the Oj combining CRCW simulation [12] [18].
Conclusions of ASC-PRAM simulations
It has been shown how the ASC model can simulate PRAM, and a means for PRAM to simulate ASC has been given. The results in this section allow the number of ISs and shared memories to be unbounded to allow wide comparisons of the models. Table 1 summarizes the current simulation results for the indicated models. Table 2 shows the comparison of PRAM and ASC when the amount of shared memory is proportional to the number of ISs or m = Oj. When this is true, ASC(n,m) simulates priCRCW(n,m) in constant time with a constant amount of space, while comCRCW(n,m) simulates ASC(n,m) in
Om time with Om=n extra space required per PE. The actual number of ASC ISs implementable is no doubt a slow growing function of n, yet even ASC with one IS has a great deal of computational power in practice [15] . On + m log n priCRCW(n,1) ASC(n,1)
O1
ASC(n,1) comCRCW(n,1) The main goal and future direction of this work is to provide a theoretical foundation for a parallel programming system (ASC) that can be implemented on virtually any type of machine, sequential or parallel, tightly bound or loosely coupled, and to show that ASC has the power to execute well known PRAM algorithms, some of which optimally. Allowing parallel slackness, even difficult PRAM algorithms such as sorting or FFT could be simulated with ASC optimally. It is hoped that this model will enable portable parallel programs to be written for existing and future machines in such a way that the masses of programmers accept the data parallel model as easily as they accept classic sequential programming, finally creating a bridge to practical parallel programming.

