Abstract -In this correspondence we consider a model for dynamic memories that are characterized by small cell fan-out and a small number of I/O ports. Many schemes have been proposed in the literature to interconnect dynamic memory cells. These usually exhibit a tradeoff between random and block access times. We propose a scheme that combines the interconnection scheme of a previous work with the idea of interleaving.
interconnection scheme of a previous work with the idea of interleaving.
With this we show that both random and block access times can be optimized. We analyze access times for our scheme and compare them to those for other schemes in the literature. We define delay between two block accesses and compare the dynamic memory organization schemes on the basis of delay.
Index Terms -Access algorithm, access times, dynamic memories, interconnection networks.
I. PRELUDE
Introduction: Fast small memories in the form of registers and content-addressable memories are implemented in current semiconductor technology. Both semiconductor technology and magnetic core technology are used to implement primary memories. Much larger volumes of storage space are available on magnetic disks and drums, where data items are stored at fixed locations on a magnetic surface which rotates relative to a read/write mechanism. Data items stored at contiguous locations on the magnetic surface are often grouped together into sets called blocks. The average time to access a single datum on a rotating magnetic surface includes the time for one half rotation of the storage media, which is commonly several milliseconds [1] . A block of data can be accessed at rates comparable to primary memory access speeds once the first datum of the block has become available.
The considerable gap of almost four orders of magnitude that exists between the access times for a single datum in primary memories and in rotating mass memories may be bridged by certain types of dynamic storage technologies such as magnetic domain and charge-transfer devices. These devices require the continuous movement of data within the storage medium itself. In contrast to disks and drums, this movement need not be cyclic. Integrated devices, consisting of magnetic bubbles [2] , charge coupled devices, or MOS shift registers, are already in limited use [3] , [4] .
These storage technologies require the development of new memory models and access algorithms. One popular model of a dynamic memory due to Stone [3] control mechanism provides each subset with a control signal that determines the data path to be taken by each datum in the subset. Each datum moves from its source cell to its destination cell in one unit time. The result of the data transfers within each subset during this interval is a permutation of the order of data items within that subset. All data transfers between the memory and the outside world take place through a memory cell called the I/O port whose contents may be read and written from outside the memory.
Let t, be the time at which the control mechanism starts to service a read request and t2 . t1 be the time at which the datum is available at the I/O port. t2 -t, is the random read access time for the datum.
For a write request, let t, be the start of service and t2 be the time when the datum is written into the I/O port of the memory. Then 12 -tl is the random write access time. When the random read and write access times are the same, they are referred to as the random access time.
A block is a set of b logically contiguous data items and a block access is the access of this set in the order of their logical addresses.
If t, is the time at which the control starts servicing a read (write) block access request and t2 is the time at which the last datum of the block is read from (written into) the I/O port, t2 -t, is defined as the read (write) block access time.
When memory traffic is heavy, arriving requests are-likely to find the memory servicing prior requests. We seek a measure that quantifies memory throughput. Let us consider two consecutive access requests, which have yet to receive any service from the memory. We assume access requests are serviced on a first-come first-served basis. Let t1 be the time when data transfers through the I/O port for the first memory request terminate and t2 be the time at which data transfers through the I/O port for the second memory request begin. t2 -t, is defined as the delay between the two requests. Delay is inversely related to memory throughput.
The random and block access times and the delay are performance parameters for the memory. The problem is to come up with an organization of cells and interconnections that optimizes the performance measures defined above.
Previous work dealing with similar models can be found in [3] , [5] - [7] . The work of Wong and Coppersmith [8] takes the constraints of a specific memory technology into account.
II. THE DECK MEMORY By using a dynamic memory organization due to Stone [3] as a basic unit and interleaving data addresses among the proper number of units of proper size, we optimize the organization for both random and sequential access while actually decreasing the delays over other organizations.
Stone [3] showed that the lower bound for random access time in a dynamic memory is logarithmic in the number of locations in the memory. The memory organization he proposed used two permutations, the shuffle and the exchange shuffle, and a single binaryvalued control signal to select between them. The fan-out of each cell is 2. For the sake of simplifying the control mechanism, memory sizes are restricted to N = 2' locations, where m is a positive integer. Each [5] - [7] . We show that it is possible to reduce block access times without affecting the coefficient by increasing 'All logarithms are to the base 2. The deck memory is partitioned into modules among which logical addresses are interleaved. Each module is organized as a dynamic memory as proposed by Stone [3] . The I/O port of each individual module will be referred to as a distinguished location and serve as the access port for the module. The I/O port of the deck memory is an additional single cell. An I/O propagation network is used to route data items between the distinguished locations of modules and the memory I/O port. The network may be integrated with the memory modules using the same technology, or constructed separately, possibly using a different technology. We will show that the network is extremely simple for memories of reasonable size.
If the I/O propagation network is constrained by fan-out restrictions it can be realized as a binary tree. Alternately, a hierarchy of shuffle/shuffle exchange networks can be used as described in [9] with similar results. The choice of implementation is likely to be determined by the memory technology. It is beyond the scope of this paper to consider implementation details for specific technologies. Some recent research that may be of relevance in this respect is the work on VLSI layout complexities for various interconnections [10] , [11] .
The memory control mechanism consists of a memory module controller for each module that obtains directions from a global controller. Transfer time for control information from the global to the local controllers contributes to the access time. The modules are numbered from 0 through r -1, and datum j resides in module j mod r and has a local address Lj/rI within that module.
When service of a request to access datum j begins, the global controller broadcasts the logical address j to all the module controllers. Module controller j mod r recognizes that it contains this datum at local address Lj/rj and routes it to the module's distinguished location. Under severe fan-out restrictions the control broadcast can be done by a binary tree network. The major control paths in the deck memory are shown in Fig. 1 , for r = 8.
Logical addresses are integers in the range (0,N -1). The r modules contain N/r addresses each. To simplify the control mechanism N, N /r, and hence r are restricted to powers of 2. In the next section we show how to choose r.
Two observations can be made with reference to possible semiconductor technology realizations of dynamic memories. In the case of large memories, the partitioning of memory into several modules reduces the problem of crossing connections. Recent results by Kleitman et al. [10] indicate that the area for laying out an exchange shuffle graph for N locations on a planar surface is 0 (N2/log2 N).
The area required to lay out r exchange shuffle graphs for Nlr locations each is smaller than that required to lay out one exchange shuffle graph for N locations. Also, fabrication of circuitry on the device is simplified due to the replication of the modules constituting the deck.
A. Access Timesfor the Deck Memory Let tc be the time required for control information to travel from the global controller to the module controller and ta the time for data to pass between a distinguished location and the memory I/O port. Random access at a memory module involves the application of log N /r permutations within a module [3] . The random read access time is tra = tc + log N/r + ta. For a write operation, the datum to be written is routed from the I/O port to the appropriate distinguished location to arrive there exactly at the completion of the access sequence within the module. A datum must enter the I/O propagation network ta time units before the module completes the necessary sequence of permutations. The write access time is the time between the start of service for the write request and the insertion of the datum into the I/O port. Hence, the random write access time is twa = tc + log NIr -ta, assuming that ta ' tc + log Nlr.
Access to r logically contiguous data items can proceed concurrently in the r memory modules. Each module can deliver one data item to the I/O propagation network in log N/r time and hence the maximum data delivery rate of the memory is r data items per log Nlr time.
For a block access the starting address of the block (logical address j) and the length of the block are given to the global controller. Each time unit, the global controller broadcasts the logical address to be accessed next. One time unit after module j mod r accesses datumj module (j + 1) mod r accesses datumj + 1. This behavior is repeated for all subsequent data items in the block. r time units later module j mod r has to access datum j + r. We require that the module have completed its service for datum j by this time. This requirement restricts the choice of r, i.e., log Nlr -r. Then the I/O propagation network can continuously receive one datum every time unit after receiving the first datum of the block.
The number of data paths and the amount of logic required to realize the I/O propagation network can be expected to rise at least linearly with r, while increasing r decreases the time to bring a random datum in a module into its distinguished location only logarithmically. In addition, we will show later that the random access time for the entire memory may actually increase as r is increased due to the control transfer and I/O propagation times. Hence, r will be chosen to be as small as possible.
Specifically, r = f(N) where f(n) is defined by f(n) = inf p ((p 2 log n -log p) and p is a power of 2) defined for n > 1. The random access time for the deck memory is, therefore, t, + log N -log f(N) ± ta (+ for read and -for write)
where log N -log f(N) is the module access time and is shown in [9] to be greater than log N -log log N -1, for N > 1. 
B. A Simple Deck Memory
The I/O propagation network and the control propagation network perform the tasks of multiplexing/demultiplexing and broadcasting. If the number of memory modules is small, these networks can be implemented in much the same fashion as MSI multiplexers and demultiplexers. In this case tc and ta are assumed to be negligible. Table I illustrates how f(N) grows with N, whenf(N) and N are related byf(N) + log f(N) = log N. Observe from this table that a dynamic memory of 1 million logical addresses need be partitioned into only 16 modules. Therefore, even for large memories the number of memory modules required is small, validating the assumption that the control and I/O propagation times are negligible.
The random access time is log N -log f(N) and the block access time is log N -log f(N) + b -1. The minimum time between consecutive random accesses is one time unit except in the case that both accesses are made to the same module. The average delay given is (log N -log f(N) + 1)/2, computed under the assumptions that (1) all accesses are to blocks of size >r and (2) starting block addresses are independently and uniformly distributed over the r memory modules [9] . The first assumption gives us a worst case bound under the second assumption. If some blocks are of size <r, delays will be smaller on the average. If all blocks are multiples of r in size and every starting block address is in the same module, then the delay is exactly one time unit.
III. THE TREE-CONNECTED DECK MEMORY Design: One criticism of the simple deck memory is the assumption that the time to transfer information through the control and I/O propagation networks is negligible. In this section we consider the design of a deck memory for which fan-out restrictions invalidate this assumption.
If the fan-out restrictions for the I/O propagation network and control circuitry are similar to those for the memory modules, we can construct these networks as binary trees. The control propagation network consists of 2r -1 cells as shown in Fig. 2 . The links represent unidirectional data paths going in the direction of the leaves from the root. The global controller is connected to the root of the tree, and the module controllers to the leaves.
A similar tree is used to interconnect the distinguished locations and the I/O port, but all links are bidirectional. The resulting memory organization (shown in Fig. 2 ) will be referred to as the tree-connected deck memory.
There should be no conflicts (two or more items routed to the same cell at the same time) within the I/O propagation network during a block access. Conflicts involving successive block accesses can be prevented either by spacing the service of consecutive requests appropriately or by replicating the network and using one for reads and the other for writes. For details see [9] .
Each network requires 2r -I additional cells, further motivating the need to minimize r. For r = f(N), there are 0 (log N) additional cells. Each cell in the control propagation network has a fan-out of 2. With replicated I/O propagation networks for reading and writing, the fan-out is at most 2 for each cell.
Access Times: The random and block access times are summarized in Table II . We assume that ta = tc = log r. Note that the random read and write access times differ for the tree-connected deck. This is because for a read, the propagation of a datum through the I/O network must follow its access within a module, but for a write, module access --: Not analyzed eamp: empirical than a random write access, and is bounded from above by log N + log log N + 1 [9] , which is O(log N).
A. Heavily Loaded Tree-Connected Deck
The memory is often a performance bottleneck in computer systems. In many systems memory requests are queued for service and the queues are seldom empty. If we assume that there is always at least one access request waiting without having received any service, then an important performance parameter is the average delay.
The average and worst case delays are summarized in Table II . For this analysis we assume that the tree-connected deck memory has replicated trees for input and output propagation. The delay between consecutive accesses is a function of the nature of the accesses (read, write), and the difference mod r between the final address of the first access and the first address of the second. In computing the average delays, we assume that starting addresses are uniformly distributed over the r modules. Table II gives the worst case and the average case delays for each possible combination of read and write requests. The details of the analysis are found in [9] . An upper bound for the average delay that is independent of the relative frequencies of reads and writes is given in Table II . If reads occur three times as often as writes, a smaller upper bound for the average delay of w -1/2 log Nlr + 43/64 + 1 is obtained [9] .
The price we pay for overlapping consecutive accesses is in the increased complexity of the control mechanism. Table II also summarizes the performances of other proposed dynamic memory organizations. These are referred to as Stone's random access memory [3] , the Tour memory [5] , [12] , the Aho Ullman memory [6] , Stone's cyclic memory [5] , Wong and Tang's two permutation memory [7] , Wong and Tang's three permutation memory [7] , the Lenfant memory [13] , and the unpartitioned and partitioned Kluge memories [1] . A summary of these organizations is given in [9] . Since we ignored I/O and control propagation times for the simple deck, we shall not include it in the comparison. Instead we will consider the deck with log r propagation times.
IV. PERFORMANCE COMPARISON
In memories that are not partitioned into subsets (all those listed in Table LI except the deck memory), random access time is equal to the delay because permutations for successive accesses cannot be overlapped.
Overall, the deck organization results show improved performance when compared to other dynamic memory organizations.
At the same time, the complexity of the control mechanism has increased considerably. The interconnection structure of the memory also may have increased in complexity, making implementation more difficult, the average cell fan-out is increased to slightly above 2, and O(log N) extra cells are used, adding to the cost of the memory.
Comparison of the memory organizations in Table LI indicates the advantages offered by the deck memory over other memory organizations. The average case random access and block access times are smaller for the Wong and Tang's three permutation memory. This is due to the larger cell fan-out of 3 in this organization. However, the deck outperforms Wong and Tang's organization with respect to worst case random access and block access times and worst and average case delays. On the basis of the delay expressions, the deck is the best organization for operation under heavily loaded conditions. In a paging environment where all accesses are to pages whose sizes are multiples of r and whose starting addresses lie in the same module, the delay between consecutive accesses is one time unit. A study of the performance of the deck in nonpaged environments where request reordering is permitted would be interesting.
Processing -Moduli Selection and Logical Minimization C. C. GUEST (ROM) . These systems require a memory size of S = 2P x q bits where p is the number of input bits and q is the number of output bits. In these processors the inputs determine the address of the answer.
2) Content-addressable memory: The truth table for each output bit may be stored in a content-addressable memory (CAM). A unity result or a null result truth table may be constructed from those' combinations of inputs which cause a particular output bit to be a logical one or a logical zero, respectively. The unity result (null result) truth table represents the canonical sum-of-products (product-of-sums) expression for the logical function corresponding to each output bit. In a content-addressable memory, inputs are compared to the stored tables and detected matches determine the state of each output bit. The stored input words ("reference patterns" in pattern recognition terminology) are the function minterms in the sum-of-products expression or the function maxterms in the productof-sums expression. The numbers compiled in this correspondence represent the number of function minterms for each output bit for addition and multiplication. In the optical holographic implementation of a content-addressable memory, the number of function minterms represents the number of holograms that need to be stored in the system [1] .
Manuscript received May 13, 1983 3) Hardware logic gates: A truth table may also be implemented through the direct use of Boolean logic gates. Each binary output variable when represented as a sum of products (or product of sums) of binary input variables may be implemented with three levels of logic to form a programmable array logic (PAL) device [2] . The numbers compiled in this paper represent the minimum number of AND gates that must be formed to realize each output bit.
Independently of how a truth table might be implemented and of which technology might be utilized, it is essential to know how many reference patterns must be recognized in order to judge the size and the complexity of the resulting system. It has recently been shown [1] that the number of reference patterns significantly decreases if residue arithmetic is used. Also, a laser holographic system functioning as a content-addressable memory to implement a truth-table look-up processor that operates on binary coded residue numbers has been proposed [1] and its operation analyzed [3] . Such a parallel processor might ultimately be used, for example, to compute real-time images from synthetic aperture radar [4] . This type of operation requires several trillion multiplications per second [5] , [6] . The results presented here apply both to contentaddressable memory systems and to hardware logic gate implementations such as in very large-scale integration (VLSI) design. In VLSI design, the use of PAL's with their highly repetitive geometric structure greatly simplifies the design problem which otherwise may be too time consuming and expensive to attempt [7] .
The purpose of this correspondence is to present the minimum sizes of, the truth tables needed to perform full-precision addition and multiplication in the residue number system. The resulting sizes of the truth tables depend on 1) selection of the moduli set, and 2) logical minimization of the corresponding truth tables. The sizes of the reduced truth tables are presented for full-precision addition and multiplication represented as multiple-input single-output operations for moduli . The moduli sets that are optimum in the sense of requiring the minimum number of reference patterns are determined for the addition and multiplication of pairs of 4, 8, 12, and 16 bit words for both unreduced and reduced truth tables. The number of required reference patterns for each case is determined and compared to the corresponding number for direct binary arithmetic.
A related analysis on gating complexity in residue number system computing has been performed by Guffin [8] . However, his results are limited to the special cases of state-change addition and statechange multiplication and further limited to only prime moduli. Papachristou [9] has presented a technique for direct truth table implementation of residue-based functions by an encoding scheme that employs PAL's. In the present research 1) results are presented for each logical expression minimized separately (representing a multiple-input single-output system) even though results for treating sets of logical expressions together (multiple-input multipleoutput system) are available during the intermediate steps; and 2) results are presented only for full-precision operations. The very practical and important results for fixed-point and floating-point computations are contained within these more general full-precision results.
II. RESIDUE NUMBER SYSTEM TRUTH TABLES
The use of residue arithmetic in computing has been extensively studied over many years [10] - [17] . The [1] . In residue arithmetic, the calculations associated with each modulus are independent of the calculations associated with the other moduli; e.g., there are no 0018-9340/84/1000-0927$01.00 © 1984 IEEE
