a power of two), the best choice for f is 64, giving a memory requirement of approximately 272 Kbits per input port.
To close, let's estimate the overall chip count required to add the multicast capability to a point-to-point network with 256 inputs and outputs. We will assume that for the chips that have little internal memory, we are constrained only by pin count, that each chip has 128 pins available for inputs and outputs and that the data paths are one bit wide. For the example conguration then, 32 chips are required for the concentrator, 16 for the adder, 6
for the dacs and 56 for the copy network. Assuming that each of the 64 btcs consists of one memory chip and one control chip, we have 128 more chips for the btcs. This gives a total of 238 chips or less than one per port. The fifos following the btcs would add perhaps another chip each. Hence, the cost of adding a multicast capability to Lee's architecture is about two chips per port. The eect of using fanout-aligned addresses is that copy j of any given cell can only appear at a copy network output with index k where k mod f i = j. This means that only the btcs connected to those copy network outputs require the information about how to translate copy j. To obtain the largest possible reduction in the total amount of memory as a result of this change, the copy network outputs that share a common btc must be spaced apart from one another, by a distance N=s positions and N=s must be constrained to be a power of 2. That is, copy network output k should be connected to btc h where h = k mod N=s. This is shown in Figure 5 .
The number of bits of memory per input port, using fanout-aligned addresses is The third improvement we can make trades o additional capacity in the network stages preceding the btcs for reduced memory. The idea here is to constrain the routing of connections through the copy network so that only a subset of the btcs can receive any given copy, hence eliminating the need to store the information about that copy in the remaining btcs. We accomplish this by modifying the dummy address encoders to use \fanout aligned addresses." Let f i be the fanout of the cell appearing on the input to dac i and let F i = P i01 j=0 f i be the fanout sum which was computed by the adder network and placed in the cell header. As we have seen, the dac replaces these two elds with values lo and hi and the copy network then copies the cell to all outputs in the range from lo to hi.
In the original design lo = F i and hi = F i+1 0 1. We modify the scheme as follows. First, let f i be the smallest power of 2 that is at least as big as f i . Next, let lo be the smallest multiple of f i that is at least as big as 3F i and let hi = lo + f i 0 1. As before, the cell is accepted by the dac if hi is no larger than N. Figure 5 shows an example of this algorithm.
Observe that 3F i lo < 3F i + f i < 3F i + 2f i and so hi < 3F i + 3f i = 3F i+1
This ensures that the range of copy network outputs selected by dierent cells are disjoint from one another. Notice also that if i is the index of the rst dac to reject a cell, then,
