This paper describes a novel VLSI CMOS implementation of a self-compacting bu er SCB for the dynamically allocated multi-queue DAMQ switch architecture. The SCB is a scheme that dynamically allocates data regions within the input bu er for each output channel. The proposed implementation provides a high-performance solution to bu ered c ommunication switches that are required in interconnection networks. This performance comes from not only the DAMQ approach but also the pipelined implementation and novel circuitry. The major components of the SCB are described in detail in this paper. The system has the capability of performing a read, a write, or a simultaneous read write operation per cycle due to its pipelined a r chitecture.
Introduction
An n by m bu ered switch is a critical component in many i n terconnection networks. The performance of these networks is closely related to the architecture of the bu ered switch. This paper describes a VLSI design and implementation of a switch architecture that uses a self-compacting bu er 1 .
A router is composed of input controllers, a n by n switch, and output controllers. The input controller receives incoming packets and determines the appropriate output channel number according to a routing algorithm. The n by n switch delivers the packets from n input controllers to the n output controllers and the output controller sends the packet to the neighboring node. Figure 1 show an example of a block diagram for a router and an input controller. The input controller has three major functions. First, the input controller is responsible for receiving the packet and distributing the header part of the packet to the routing algorithm handler and to the packet ow controller. Second, it determines the output channel number based on the header information which i s received from the input controller 3, 4 . This task is carried out by the routing algorithm handler. Third, it allocates and deallocates the bu er space for incoming and outgoing packets. This function is performed by the packet ow controller. cation is based on how the input queues are manipulated and how data is stored. The four types are: rstin rst-out FIFO, statically allocated fully connected SAFC, statically allocated multi-queue SAMQ and dynamically allocated multi-queue DAMQ. FIFO, SAFC, and SAMQ bu ered switches do not use eciently the bu er space 2 . A better way of using the bu er is to dynamically allocate bu er space as is done with a dynamically allocated multi-queue DAMQ. Space allocated for each bu er changes dynamically to ful ll the bu er space demands at a particular time. It has been reported that the DAMQ switch a c hieves the best performance among four switch t ypes 1, 2 . The self-compacting bu er implements the DAMQ using a small amount of hardware and taking advantage of VLSI technology. This paper has been organized as follows. Section 2 introduces the self-compacting bu er architecture and its properties. In Section 3, the cell designs and implementation of the bu er, bu er controller and channel pointers are described in detail. The timing of the self-compacting bu er is explained in Section 4. Some concluding remarks are provided in section 5.
Self-Compacting Bu er
The self-compacting bu er SCB architecture has been organized to implement the dynamically allocated multi-queue scheme for bu er management. The SCB consists of a bu er, bu er controller, channel pointers and channel update. The overall organization of the SCB is shown in Figure 2 . The function of the SCB is to store incoming packets from the input port and transfer outgoing packets to the switching network. An output channel number, received from the routing algorithm handler, points to an address in the bu er where its data is held. The channel pointer determines the bu er address for that channel number and updates the channel's address depending on the action read or write. The bu er address is passed to the bu er controller along with the empty selected channel contains no valid data and full all locations contain valid data signals. With this information, the bu er controller sets the corresponding lines to move data from and or to the bu er and shift data within the bu er.
The self-compacting bu er is divided dynamically into regions with every region containing the data associated with a single output channel. This scheme supports the dynamically allocated multiqueue DAMQ bu er management method introduced by T amir and Frazier 2 . The self-compacting bu er scheme has the following properties: Property 4. For every output channel i, there is an integer number i , denoting the numb e r o f e n tries present in the region reserved for that output channel.
The set of properties of the bu er suggests that when an insertion deletion in the bu er occurs via a write read operation, there should be a mechanism to access arbitrarily the region associated with a channel. In particular, if the insertion of the packet requires space somewhere in the middle of the bu er, the required space must be created by m o ving all the data which reside below the insertion address. Furthermore, a reading from the top of the region for output channel data may create empty spaces in the middle of the bu er. The data below the read address is shifted up to ll the empty spaces. The bu er space maintained under the self-compacting bu er scheme is shown in Figure 3 . Below, we discuss in detail the proposed high performance self-compacting bu er organization and VLSI implementation. In this section we describe the VLSI implementation of the bu er architecture described in Section 2. The requirements and circuitry of the three major components, bu er, bu er controller, and channel pointers, are described in detail here.
Bu er Organization and Cell Design
The bu er consists of a nite number of storage locations. The organization of the bu er is shown in Figure 4 , where p storage locations are considered. For a location k, the following actions can occur:
Shift up: row k moves its contents up to row k , 1, Shift down: row k moves its contents down to k + 1 , Hold no action: row k holds its data, Write: row k moves the write bus contents onto the cell, and Read: row k moves its contents to the read bus. These actions have to be performed by the bu er once the proper paths are set up. The bu er cell has to implement shift up, shift down, hold, write, and read actions as described above. A bu er cell in this bu er organization shares the up, down, read and write signals with cells in the same row, while the read and write bus lines are shared by cells in the same column. Figure 5 shows a CMOS circuit that implements the proposed bu er. It should be pointed out that at this time transistors T f b and T pass are both o to isolate the incoming data from the outgoing data. The proposed bu er cell allows read and write from the same cell to take place at the same time; as the previous data leaves the cell the new data can be stored. This capability is also required to implement shift up and shift down as explained below.
Shift up down data: When shifting down or up, the cell must be able to separate the incoming data from the outgoing data. Transistor T d or T up is on when shifting down or shifting up occurs, respectively.
While a shifting operation takes place, transistors T f b and T pass are set o to isolate data in and data out. When shifting data down from a storage cell k to k+ 1 , the path is set as follows. Data comes from the second inverter in cell k, passing through transistor T d k + 1 and into the rst inverter of cell k + 1 . Similarly, when shifting data up from storage cell k to k , 1, data comes from the second inverter in cell k, passing through transistor T up k and into the rst inverter of cell k , 1.
Bu er Controller
The self-compacting bu er operations include read and write and simultaneous read write. There are four distinct cases by which the actions of storage cells in the bu er are determined. These four cases are explained below. case 1 Single Write Insertion. For a given address to write data in, all storage locations whose addresses are less than the write address retain their data. The storage locations whose addresses are greater than or equal to the write address must shift their data contents down to open a space in the bu er for incoming data. case 2 Single Read Deletion. All storage locations whose addresses are less than the reading address hold their data. The rest of the storage locations shift the contents of their storage location up. case 3 Simultaneous Read Write address of read address of write. In this case, the storage locations with addresses smaller than the read address are not a ected. The storage location with addresses which are greater than the read and less than or equal to the write address should shift up their contents. The rest of the storage locations take no action. case 4 Simultaneous Read Write address of write address of read. In this case, only the storage location whose addresses are greater than or equal to the write address and less than the read address, shift their contents down. Other storage locations require no action.
The bu er controller determines how the bu er's data is moved within the bu er as well as from the read bus R BUS and to the write bus W BUS . Based on the case of the current requests, the controller will set the path for the data to be moved.
Case selection is determined just after a write and or read requests is are received. When a single write or read occurs, the bu er controller decodes the address and selects the corresponding write or read line for that row of cells. The rest of the rows with address larger than the selected one are set to shift down or shift up the stored data. These cases correspond to case 1 and 2. If a simultaneous read and write occurs, the decoding of both addresses is done at the same time. The distinction between cases 3 and 4 is in which direction the data is shifted. This shifting depends on where the read and write actions are located within the bu er. This needs to be determined to set the bu er's down and up lines.
When the bu er is full i.e. all storage locations contain valid data, the bu er controller prevents data from being written to the bu er unless a simultaneous read write occurs. In this case, writing data is allowed since the simultaneous read creates a space in the bu er. When the selected channel is empty, an empty signal informs the bu er controller that this channel bu er address contains no data. I f a c hannel is empty, the bu er controller cancels read request for that channel. Thus, the full and empty input signals cancel write and read operations, respectively. Figure 6 shows the CMOS circuit diagram for selection of a down line in row k. The internal write signal W int is generated using the full signal that prevents writing when the bu er is full and the external write request signal. Therefore, signal W int determines when writing to the bu er is allowed. If there is an allowable write request i.e. W int = 1, the down , w k line is precharged through transistor T pre .
At the same time, transistor T b is on to set transistor T kill o , preventing a short circuit between T kill and T pre since T c is on. After precharge occurs, transistor T pass allows the write k signal to pass to the bu er control cell. Writing to row k sets a 1 i n signal write k , this signal turns T kill on and T c o . All down lines above r o w k will be discharged and all down lines below will remain charged i.e. set to a logic 1. When there is no allowable write request i.e. W int = 0, transistor T a and T kill are on and transistor T c is o .
This in turn causes a discharge of the down , w k,1 line through T kill and the down , w k line through cell k+1. Thus, all the down signals are set to logic 0. A similar circuit as the one in Figure 6 is used for the read request to set the up,r lines. In this case signals read k and R int are used.
Once the down and up lines have been set, they are allowed to pass to the bu er depending on the 
Channel Pointers
If there are n output channels in the router system, the bu er should be able to accommodate data for these many channels. Data for each channel is dynamically allocated into the bu er; there is an address that corresponds to the beginning of the data for such c hannel. As a write or read operation occurs the amount of data in each c hannel increases or decreases. This makes the bu er to expand or compact, changing the location address of data for a set of output ports. This address is kept and updated by a set of channel pointers. Each channel pointer holds the address of the data for the corresponding output channel.
When a single read or write occurs, data is read from the top or written to the bottom of the selected channel decreasing or increasing its space. All channels greater than the selected channel update their addresses, decrementing or incrementing by 1. A read and write can occur simultaneously to the same channel; no channel is updated for a simultaneous read write request in this case.
When a read and write requests are for di erent channels, the channels that are between these two channels will be updated. This is done in the same fashion that the bu er is updated. If the read channel number is less than the write channel, then all channels greater than the read channel and less than or equal to the write channel decrement their addresses by 1 . Similarly, when the write channel number is less than the read channel number, all channels greater than the write channel and less than or equal to the read channel increment their addresses by 1 .
When a c hannel contains no data i.e. empty its starting address is the same as the starting address of the next channel. The empty signal can be generated by comparing the addresses of adjacent c hannels; when these addresses are identical the empty signal for the particular channel is set. The channel update increment or decrement is canceled if the read or write operation for the given address is canceled due to an empty or full condition.
A CMOS circuit diagram for a channel pointer cell is shown in Figure 9 . Reset of the circuit through transistor T reset causes all channels to set their starting address to 0 i.e. they are all empty. It should be noted that the rst channel i.e. channel 0 always contains the rst address in the bu er. An output channel address is provided to the channel pointers with every read and or write requests. The channel address is decoded and the corresponding channel is selected, turning on transistors T r and or T w respectively. This allows the bu er address to be passed to the read and or write address bus. The sum or difference when incrementing or decrementing a channel's starting address is generated through an XOR of the stored address bit and the carry in C in . This result passes through T clock and T sum which are both clocked transistors. However, T sum only turns on if the sum k signal is set to logic 1. The sum k signal is generated when an add or subtract occurs. In addition, transistors T clock and T sum are never on at the same time since this would form a closed loop through the inverters and the output of the XOR. Adding or subtracting determines whether C in is propagated to C out or killed set to logic 0. If the add k signal is set to logic 1, transistor T add is on allowing the address bit to pass. Therefore, C out is equal to C in when the address bit is set to logic 1, otherwise C out is killed. Setting the sub k signal to logic 1, allows the address bit to pass through transistor T sub . In this case, if the address bit is set to logic 0, C out is equal to C in .
Finally, while no operations are being performed transistor T f b is turned on to provide feedback. The current self-compacting bu er implementation uses a two-phase clock scheme. In this implementation the operations of the channel pointers, bu er controller, and bu er are pipelined; thus, these operations can be overlapped. A detailed description of the timing approach follows.
A timing diagram for some operations that can occur is shown in Figure 10 . The operations shown are simultaneous read write, single read when the selected channel is empty and single write when the bu er is full. The channel pointers receive the output channel number at the beginning of clock 2 . At this time, the channel address is decoded and subsequently, the bu er address is passed on to the bu er controller at 1 . Updating of the channel pointers begins at the trailing edge of 2 and generation of the read, up, write and down lines at the trailing edge of 1 . In the next 2 cycle the data is written, read and shifted in the bu er. This timing is used for a single read or single write except that the operations followed are for the read or write, respectively. If the selected channel is empty when a read is requested or the bu er is full when a write occurs, the channel update, bu er controller and bu er operations are canceled during that cycle. This is shown in the timing diagram for a read only and write only. For a simultaneous read write with an empty and or full signal, the read and or write operations would be canceled. The empty and full signals in a given cycle occur due to the channel update from the previous cycle.
The channel and bu er addresses remain valid for a complete clock cycle although they are not needed the entire cycle. However, the next address cannot be presented until the end of the cycle since updating the channel pointers requires a full cycle. Therefore, the packet ow controller can receive a c hannel address at every cycle. Data can also be read and written every cycle except if a channel is empty during a single read or the bu er is full during a single write.
Concluding Remarks
A novel VLSI CMOS implementation of a selfcompacting bu er SCB for the dynamically allocated multi-queue DAMQ switch architecture has been presented in this paper. The DAMQ switch has been shown to provide the best performance among the bu ered switch architectures 2 . The SCB is a novel scheme to dynamically allocate data regions for each output channel as required in the DAMQ switch. The SCB allocates only the required bu er space per channel allowing data expansion as needed to accommodate data storage demands.
We have presented the SCB architecture major components as well as the VLSI CMOS circuitry associated with these components. The components of the SCB are capable of performing a read, a write, or simultaneous read write operations. The major blocks of the SCB architecture include the bu er that stores, moves in and out, and shifts data; the bu er controller that selects a case for each operation which determines data movements in the bu er; and the channel pointer which k eeps track of the data dynamic changes in the bu er. For each of these components we h a v e developed novel circuitry.
The proposed SCB VLSI implementation has been extensively simulated and fabricated. The SCB system has been pipelined to further enhance its highperformance. This system has the capability of performing a read and or write operations per cycle as it was shown in the system's timing.
