Design and implementation of interface units for high speed fiber optics local area networks and broadband integrated services digital networks by Pang, Joseph et al.
High
Appendix D
Speed Fiber Optic
by
Dr. Fouad Tobagi
Stanford
Interfaces
University
/v c c /_-" O6/
//_ -7_ " c/2""
3117D 
(NASA-CR-I_7384) DESIGN AND IMPLEMENTATIqN
OF INTERFACE UNITS FOR HIGH SPEED FIBER
OPTICS LOCAL AREA NETWORKS AND BROADBAND
INTEGRATED SERVICES DIGITAL NETWORKS
(Stanford Univ.) 38 p CSCL 20F G3174
N91-12336
unclas
0311769
https://ntrs.nasa.gov/search.jsp?R=19910003023 2020-03-19T20:07:30+00:00Z
Design and Implementation of Interface
Units for High Speed Fiber Optics Local
Area Networks and Broadband Integrated
Services Digital Networks
ismail Dalglq Joseph Pang
Fouad A. Tobagi
Department of Electrical Engineering,
Stanford University, Stanford, CA94305-4055
31 October 1990
Abstract
This constitutes the final report for the portion of the JSC/Stanford
Cooperative Research Program dealing with High-Speed Fiber Optic
Networking. More explicitly, the research effort in this area comprised
the design and implementation of interface units for high speed fiber
optic local area networks (FOLANs) and Broadband Integrated Ser-
vices Digital Networks (B-ISDN). Since we initiated this work several
years ago, a number of network adapters that are designed to support
high speed communications have emerged. We have been examining
these proposals very closely, in order to identify their critical features,
to build upon their experiences and to assess their appropriateness for
very high speed communication in the Gb/s range. Our approach to
the design of a high speed network interface unit has been to imple-
ment packet processing functions in hardware, using VLSI technology.
We describe the VLSI hardware implementation of a buffer manage-
ment unit (BMU), which is required in such architectures.
1 Introduction
An unfortunate problem with high speed networking today is the limited
throughput that can be attained by the users in comparison with the avail-
able bandwidth of the transmission medium. This limitation exists even
when only a single pair of users are actively communicating over the network,
and stems from the complexity of the protocols used at the various layers
(namely 2-4), and their implementation (typically in software). With the use
of fiber optic technology and efficient multi access protocols, the transmis-
sion medium's bandwidth of local area networks has been increasing. Today,
the FDDI standard is designed to run at 100 Mb/s. Furthermore, a great
deal of research effort is under way to support broadband multimedia com-
munications over a wide area network. An effort within CCITT is already
underway to standardize Asynchronous Transfer Mode (ATM), which is the
most appropriate switching technique to support communications over such
a broadband integrated services network. At present, the ATM standard
specifies line speeds equal to 155.520 Mb/s, 620.080 Mb/s, and above. Con-
sequently, numerous applications have emerged with very high bandwidth
requirements, such as the transmission of high resolution images or video
signals, communications among remote supercomputers to perform jointly a
task such as flight simulation, etc. Such applications require packet process-
ing within the network interfaces at rates several orders of magnitude higher
than that possible with current software implementation of network proto-
cols. Accordingly, the problem of designing high speed network interface
units has become a crucial one.
The throughput limitation in today's network interfaces is mainly due to
the software implementation of the protocols. Indeed with the exception of
the CRC and addressing functions at the logical link control (LLC) sublayer,
all functions residing at the LLC sublayer and above are implemented in
software, either in the host processor or using a microprocessor in a network
adapter board. Furthermore, packets and their headers are stored in a shared
memory which gets accessed several times while processing the same packet,
thus complicating the memory management problem as well as introducing
memory access delays.
In one of our earlier research efforts, we measured the performance of
a program implementing the IEEE Std 802.2 Logical Link Control (LLC)
protocol in order to improveour understandingof packetprocessingrequire-
ment within a network interface [1]. Our results showedthat most of the
program execution time is spent in a few relatively low-level functions; most
importantly, the block movementof data in memoryand packetqueueman-
agement.The importanceof thesefindings wasparamountasthey suggested
the main areasto beaddressedin the designof high speednetwork interfaces.
We are taking advantageof these findings in our researcheffort described
here.
Since we initiated this work several years ago, a number of network
adapters that are designed to support high speed communications have emerged.
All of these network adapters were designed around an on-board micropro-
cessor with the exception of the XTP Protocol Engine which is designed
for VLSI implementation [2]. We have been examining these proposals very
closely, in order to identify their critical features, to build upon their experi-
ences and to assess their appropriateness for very high speed communication
in the Gb/s range. This task is not yet complete, and we will pursue it in the
future. At the present time, we are in a position to describe accurately the
architecture underlying each adapter and the expected performance given
the specification of their components. In Section 2 we give a brief descrip-
tion of three key architectures, namely; NAB [3], CAB [4] and XTP Protocol
Engine. Not unlike XTP, our approach to the design of HSNIU has been to
implement packet processing functions in hardware, using VLSI technology.
One principal reason for this choice is that with VLSI hardware implementa-
tion of protocol functions, high rate on-the-fly processing may be achieved,
thus overcoming many of the bottlenecks experienced in architectures based
on software implementation. Another reason is that, with VLSI hardware
implementation, one may incorporate multiple packet processors in the _ame
user device, either to achieve higher throughput or to accomodate multiple
distinct streams of data as will be required in multimedia applications, while
keeping the interface design relatively compact and at low cost. In Section
3, we describe the steps accomplished so far in this endevour particularly
the VLSI hardware implementation of a buffer management unit (BMU),
required in such architectures.
3
2 Related Work
2.1 Introduction
In this section, we describe several existing high speed network adapters,
explaining their operation and identifying the similarities and differences
among them. We have chosen to study 3 such designs, namely, the VMP
Network Adapter Board (NAB) [31, the Nectar CAB [4] and the XTP Pro-
tocol Engine [2]. Among the three, the NAB and the CAB are designed
around a microprocessor-based architecture, whereas the Protocol Engine
was designed with the VLSI implementation in mind.
In addition, we are still examining several other designs such as a net-
work adapter being designed at the IBM Zurich Research Laboratory which
is based on a multiple-processor architecture using transputers [5]. How-
ever, our studies of these designs are not finalized yet, and thus we are not
discussing these designs here.
2.2 The VMP Network Adapter Board (NAB)
The NAB is a network adapter designed to support efficient distributed pro-
cessing over a high speed network. It is designed for the VMP multiprocessing
system, but it is applicable to other computer systems as well. It is designed
to implement the VMTP transport protocol. However, without any major
changes in its architecture, it can implement other transport protocols as
well.
The internal architecture of the NAB consists of five major components,
as shown in Figure 1. The following is a description of the functions of each
component:
Network Access Controller (NAC): It implements the network access
protocol, performs the data conversion between serial and 32-bit par-
allel forms, and transfers the data between the network and the packet
pipeline.
Packet Pipeline: It consists of two separate pipelines, one for sending pack-
ets and one for receiving packets, thereby achieving full duplex com-
munications. The sending packet pipeline generates transport level
Host Bus
I E
HC68020 I
CPU BusI
?i
Serial portI
32
L
_J2
k2__t2
Neb ork
Figure 1: The NAB
5
checksums and optionally encrypts the data. Similarly, the receiving
packet pipeline performs checksum calculations and if needed, decyrpts
the incoming data. The packet pipelines operate at the network data
rate and perform processing on the fly. In order to be able to do
that, the checksum portion is placed at the end of a packet in VMTP.
Checksumming and encryption are the only two functions that require
accessing to the data portion of the packet, so using these pipelines,
the memory access requirements for the on-board processor is greatly
reduced as it would only need to access the header portion of a packet.
Buffer Memory: Video RAM ICs are used as buffer memory in the NAB.
Such an IC provides two independently accessed ports: one provides the
traditional random-access transfer, and the other provides high speed
serial access through a 4-bit wide shift register. The organization of
the VRAM chips used in the NAB design is shown in the Figure 2. 8
chips are used per memory bank to provide a word size of 32 bits in
both the random access and seriaI ports, as also shown in that figure.
The serial access port does not need address setup and decode time;
so it is faster than the random access port. In the NAB prototype, the
serial access time is 40ns per word and the random access time is 200
ns, giving an effective transfer rate of 800 Mb/s over the serial port
and 160 Mb/s over the random access port. Because of this difference
in the bandwidth, the serial access port is used for both transferring
the data between the network and the buffer memory and between
the host computer and the buffer memory. The random access port
is only used by the on-board processor for protocol processing. The
transfers between both the buffer memory and the host and the buffer
memory and the network have a bursty nature since an entire packet
is transferred at a time. Therefore, the choice of using the serial port
for these transfers is appropriate.
In order to be able to receive a packet coming from the packet pipeline
into the buffer memory at the same time as writing a previously received
packet from the buffer memory into the host (or vice versa), the buffer
memory is organized as two separate banks. A 2x2 crossbar switch
connects both of these banks to both the packet pipeline and the host
adapter so that a packet can be copied from the packet pipeline into
8Video
RAM ICS
per bank
for 4
=b'_ / !word
I
I
Address
I
andom
Accessllll 4bits
Data IIII
P_FI IH_
i_ 256x256x4bit
memory array ]
I I Serial
I2_4o,,_ Acces:
I shift reg I Data
Port
I
_J
i
Figure 2: VR, AM buffer memory organization for one memory bank
one of the memory banks, and then by changing the state of the switch,
it can be copied to the host, while another packet is being received into
the other buffer memory bank. This effectively doubles the bandwidth
of the serial port. The state of the crossbar is controlled by the on-
board processor.
On-board processor: The purpose of the on-board procesor is to manage
the buffer queue and implement the transport level protocol, and also to
control the operation of the host block copier and the packet processing
pipeline. A 16-Mhz MC68020 was chosen for this purpose, which has
a rating of 2 MIPS.
Host Block Copier: Its purpose is to transfer data between the host com-
puter and the NAB. It includes a DMA controller to efficiently copy
data from the host memory into the NAB buffer memory. There is
a 1024 byte control register in this unit which accessible by both the
NAB and the host computer. This register is used to exchange requests
and responses between the NAB and the host. When sending packets,
two different schemes are used depending on the size of the packet. For
long packets that would not fit into the control register, the host places
the pointers into the control registers for the DMA controller to get the
data from the memory. For short packets, the host writes the data into
the control register directly without invoking the DMA controller on
the NAB. This scheme is motivated from the idea that a short packet
to be transmitted is most likely to be a message that is very recently
prepared, and hence a copy of it is in the host processor cache. Thus,
instead of putting it back into the main memory and then using the
DMA controller, it is more efficient to copy the data directly from the
cache into the control register.
When receiving a packet from the network, the NAC performs the serial
to parallel conversion and starts transfering the received packet into the
decryption hardware in the packet pipeline. Also, it interrupts the on-board
processor to notify it about the packet arrival. At the first decryption stage, a
decryption key is searched for, and if it is located, the data proceeds through
the decryption stages. Otherwise, the decryption stages are disabled and the
data proceeds directly to the checksumming stage. If the packet passes the
8
checksummingcorrectly, it is transferred to the shift register in one of the
memory banks, depending on the state of the crossbar switch. In exceptional
cases, such as the buffer memory being full or the packet not having a correct
checksum, the packet is lost and the packet pipeline is flushed and restarted.
This is handled by the on-board processor.
When the entire packet is moved into the shift register, it is transferred
into an available location in the random access memory specified by the on-
board processor, and queued there for header processing. After the header
processing is completed, the packet is trasnferred to the host memory using
the host block copier.
For sending a packet, the host writes the request to the control register
(along with the data for a short packet, and the pointers to the data for a long
packet). As soon as the control register is written, its contents are transferred
to a queue of requests for the attention of the on-board processor. While the
data of the packet is being transferred into the buffer memory, the on-board
processor prepares its header. When the header processing is finished, the
packet is transferred to the packet pipeline, by putting the crossbar in the
opposite state as it was when the packet was moved into the memory and
then copying the packet to the pipeline through the shift register. When the
entire packet is passed from the pipeline and transmitted over the network,
the NAC interrupts the on-board processor which in turn returns the packet
buffer to the pool of empty buffers. If a packet is being received from the
network while another packet to be sent is being transferred to the buffer
memory, the received packet is queued at a stage provided between the packet
pipeline for reception and the shift register of the buffer memory. Other than
this situation, a packet that is to be sent is delayed until there is no packet
being received, i.e., the priority is given to the received packets.
It is clear that for the header processing function not to be the bottleneck,
the on-board processor must finish processing one packet header in a time
shorter than the time to receive an entire packet. The maximum network
bandwidth at which this can be achieved depends on the relative bandwidths
of the seriaJ and random access ports, the on-board processor speed and the
length of the packets, as well as the header processing requirements of the
particular protocol being implemented. Another bandwidth limitation comes
from the packet pipeline. In the current prototype, the packet pipeline is able
to operate at data rates around 100 Mb/s.
2.3 The Nectar Communication Accelerator Board
(CAB)
Nectar is an experimental network for multiprocessing, built to support the
WARP operating system. It consists of a number of crossbar switches, called
HUBs, and network adapter boards, called CABs.
Host bus
I Program i cPui rMemory,IReg'ster5
_._, I_o_e_,ooI
I !
Network
I'lemory
!
VPIE I
Adapter
i
I
Data
Hemory
i
I
Controller
I
Devices
I
I
Figure 3: The CAB
Just like the NAB, the CAB is a processor-based design. It is designed
to be general enough to implement several transport protocols. It consists of
8 main parts which are interconnected as shown in Figure 3. The following
is a description of these parts:
CPU: It is used to implement most of the protocol processing, and also it
controls the operation of the CAB. In the CAB prototype, a 16Mhz
SPARC processor is used for this purpose.
Registers and Devices: This part consists of various devices to support
high speed communications such as a hardware checksum device and
hardware timers. Also, some of the registers to store the state of the
10
CAB areincluded in this part. Differently from the NAB, thesedevices
are placed on the CPU bus, providing flexibility at the expenseof
additional data memory bandwidth due to the carrying of the data
from the memory to thesedevicesover the CPU bus.
Program Memory: This is the memory that is intended to be used only
for storing the programsfor the CPU. A fast static RAM (35ns) is used
for this purposeso that the peak execution rate of the processorcan
be sustainedwithout any caching.
Data Memory: It is usedas the buffer memory for both the receivedand
sent packets. In the prototype implementation, a 35ns dual ported
static RAM is usedas the buffer memory,with one port connectedto
the data memory bus and the other to the CPU bus. The CPU bus
is used only for the protocol processingby the CPU which includes
copying the entire data portion of the packetover this bus to the hard-
ware checksumdevice. The queuesin this memoryaremanagedby the
CPU.
Fiber Out/Fiber In Queues: They consist of the media accessunits for
the sentand receivedpackets,respectively.TAXI chipsareusedto per-
form the serial to parallel conversion.Also, FIFO queuesare provided
here for both incoming and outgoing packets.
DMA Controller: A DMA controller resides on the data memory bus and
it is normally used to transfer the packets between the network and the
data memory as well as between the memory and the host. The DMA
controller also waits for data to arrive if the input queue is empty, or
for data to drain if the output queue is full, thereby performing some
form of flow control.
VME Interface: It provides an interface for both the data memory bus
and the CPU bus to the host bus, which is assumed to be a VME bus.
Through this interface, both the program and the data memories on
the CAB are directly addressable by the host, and so are all registers
and devices of the CAB. In addition, the CAB CPU can access the host
resources through this interface.
11
Memory Protection: The CAB is intended to offload the host of applica-
tion tasks when it is not performing communication functions. In order
to support this it runs a simple multitasking kernel which requires a
memory protection mechanism that is provided in the CAB. Currently
it supports up to 32 protection domains.
The communication between the host and the CAB is normally accom-
plished using DMA, shared buffers and VME interrupts. However, it is also
possible for the host to bypass the CAB by performing all of the protocol pro-
cessing by itself and accessing the Fiber In and Fiber Out queues directly.
In the following, we will describe the former mode of operation.
The DMA controller continually polls the Fiber In queue for new packets.
When a new packet is received, after the serial to parallel conversion of the
packet is performed and the packet is buffered in the Fiber In queue, it is
transferred by the DMA controller to a memory location in the data memory
that is provided by the CPU, where it is queued for protocol processing. The
CPU performs the processing of the packet header, which is dependent on
the particular protocol being executed. Meanwhile, the data portion of the
packet is transferred over the CPU bus to the hardware checksum device,
either by the CPU itself or by the checksum device which incorporates a
simple DMA mechanism.
When the processing of the packet is finished, the data is copied to the
host memory using the DMA controller. Therefore, the packet gets moved
twice over the data memory bus and once over the CPU bus. Additionally,
on the CPU bus, the packet header is accessed for protocol processing. This
access volume is likely to be less than the access volume for transferring the
packet once over the bus for reasonably long packets. As a result, an upper
bound on the throughput that can be achieved is half of the bandwidth of
the data memory bus.
The sending of the packets proceeds in an analogous fashion. The CPU
is notified of a send request by the host by placing a request message in a
special mailbox in the CAB data memory. Then the CPU sets up the DMA
controller to transfer the data from the host memory to the data memory.
After the packet processing is completed, which now includes packetization
and hardware checksum generation, the DM_ controller is set up by the CPU
to transfer the packet from the data memory to the Fiber Out queue.
12
Overall, the CAB architecture does not seemto be too much different
from the NAB architecture. The main differencesare the following: (i) in
the CAB, the checksumdevice is placedon the CPU bus, resulting in extra
memory accesseson the bus, whereasin the NAB it is placedbetween the
network accesscontroller and the buffer memory; (ii) the CAB memory and
devicesare accessibleto the host, thereby allowing the host to produceand
consumedata directly in the CAB data memory; (iii) support is provided
to offioad the host applications to the CAB CPU when it is not performing
networking functions.
2.4 The Protocol Engine
Contrary to the other network adapters, the protocol engine is designed to be
implemented as a VLSI chip set. It executes a simplified transport protocol
called XTP, which is optimized for receiver performance, as normally the
receiver part of a protocol constitutes a bottleneck. It is intended to perform
the packet processing on-the-fly, in real time.
The architecture of the protocol engine consists of seven main parts as
shown in Figure 4. The following is a description of each part:
Protocol Engine Control Logic: Along with the address logic and buffer
logic, the protocol engine control logic forms the core of the system.
It contains a state machine for handling the transport protocol and
controllers for the address, network, buffer and host components. The
XTP is connection oriented and each connection has an associated state
vector stored in the protocol engine. The protocol engine control logic
is responsible for managing these state vectors, performing sequence
number checks, packetization, observing round-trip delays, adjusting
connection timers accordingly and performing retransmissions.
Address Logic: Instead of sending a long address associated with a connec-
tion with every packet, it is only sent in the header of the first packet
along with a key. In the subsequent packets, only the key is sent in
the header, implying the long address. When a packet is received by
the protocol engine, the address logic searches for the long address in
the route table and state vector memory, and if it is found, loads the
state vector corresponding to that connection to the control logic. The
13
Protocol
engine
control
logic
Network
I
Network
interface
&
network
_l interface
I°gl I
l°gl I
I I Buffer
logic
Route table
State vector
Memory
!
Buffer memory
Host Bus
Figure 4: The Protocol Engine
14
addresslogic also is used to store the updated state vectors into the
route table and state vector memory.
Route Table and State Vector Memory: This memory is usedto store
the route table, long addressesand the state vectorscorrespondingto
the connections. It is managedby the addresslogicasmentionedabove.
Buffer Memory: It is the placewhere the incoming and outgoing packets
are queued. Similarly with the NAB, VRAM chips are used for this
purpose where the serial port is used for the network and host accesses
and the random access port is used for protocol processing.
Buffer Logic: It is used for transferring the data between the buffer memory
and the protocol engine control logic, network interface logic and host
interface logic. It is also used for managing the multiple queues in the
buffer memory.
Host Interface Logic: This component is used for moving the data be-
tween the protocol engine and the host computer. Its design is depen-
dent on the particular host bus on which the protocol engine is being
used. However in most cases, it would contain a DMA controller to
copy the data between the host memory and the protocol engine buffer
memory.
Network Interface Logic: This part performs the data transfers between
the network and the protocol engine, and it implements the media
access protocol. It also performs the checksumming function and serial
to parallel conversion. The frame check sequence is stored at the end
of the packet in XTP, hence allowing the checksumming to be done on
the fly.
When receiving a packet, the network interface logic begins to transfer
the packet to the address logic as it comes in. As soon as the header gets
into the address logic, the address logic uses the address key in the header to
find the corresponding state vector for the packet's connection. While this is
being done, the incoming data field of the packet gets to be stored into the
buffer memory. Also, the computation of the frame check sequence (FCS)
gets performed as the data passes through the network interface logic. If the
15
state vector is located, it is loaded into the control logic. Otherwise, this
is either a new connection, in which casea new state vector is generated,
or it is a faulty packet and it is rejected. After the state vector is loaded,
the control logic checksthe sequencenumbersagainst the expectedsequence
numbers. At the time whenthis is finished, the packetis completely received
and buffered,and the FCSis received.If the FCSchecksout correctly and the
sequencenumbersconformwith the expectedones,then the packet is linked
to the appropriate queueand the state of the correspondingconnection is
updated and stored. After this, the host is notified about the arrived packet
and the data is copiedinto the host memoryby usingthe host interface logic.
When sendinga packet,it is first copiedthrough the host interface to the
buffer memory and linked in the queue that correspondsto the belonging
connection. Then the state vector for that connection is loaded into the
control logic which in turn preparesthe packet header. Then it updates
the state vector and sets up the timer corresponding to this connection.
After this, the packetsbeginsto get transmitted on the network through the
network interface logic. As the data passesby, the network interface logic
computesthe FCS and appendsit to the end of the packet.
3 The High Speed Network Interface Unit
(HSNIU)
3.1 Architecture
3.1.1 Introduction
Our goal is to design and implement a high speed network interface unit
(HS-NIU) capable of providing network users with the throughput required
for broadband applications. Furthermore, we are seeking a design which can
be useful for a wide range of applications with different end-to-end commu-
nications requirements. This effort requires the performance of three basic
steps:
• The determination of all the functions that the interface must perform,
with a clear identification of the critical ones;
16
The determination of an architecture, according to which the func-
tions are to be partitioned, which provides the desired modularity and
achieves the required performance;
• The choice of appropriate technologies by which to implement the func-
tions.
It should be clear that the three steps identified so far are closely inter-
related, and that the process of devising an adequate solution to the design
of an HS-NIU is an iterative one, comprising in-depth evaluations of many
trade-offs and alternatives which cut across all three areas.
Preliminary investigation has shown that there are 5 basic critical factors
which must be taken into account in the design. These are:
1. The medium's bandwidth
2. The host's bandwitdh
3. The interface's memory bandwidth
4. The packet processor's bandwidth
5. The medium access controller's reaction time.
Functions within the interface will undoubtedly be performed at different
speeds, and it is the goal of an optimum design to keep in balance all the
various elements performing these functions. Given the five factors and the
differences among them, it appears logical to divide the interface into three
basic units (see also Figure 5):
• The Medium Dedicated Unit (MDU)
• The Storage and Processing Unit (SPU)
• The Host Dedicated Unit (HDU)
The MDU will certainly have to run at the speed of the medium as far as
data transfer is concerned, and will have to include a media access controller
whose reaction time is compatible with the performance requirements. The
17
Host bus
Host
Interf.
Unit
!
I
i
l
i
i
i_ Transmit Buffer I data
_ _add_essT_
Media
Access
Unit
Network
Y
Figure 5: Block diagram of the HSNIU
18
MDU will perform parallel-serialconversions,and all functions that may be
implementedserially,suchasCRC and encoding. The MDU will beentirely
implementedin hardware.
The SPU will provide the storagecapability for user data, and will per-
form packetprocessingsuchas packetization,LLC and the transport proto-
col. We pay attention to the designof the SPU so that the memory band-
width is sufficiently high to interfaceto the MDU and the HDU. Implementa-
tion of the processingunit may be a combinationof hardwareand firmware,
dependingon the specificfunctions implemented.
Finally, the HDU will provide the interface with the user host and the
SPU. It will alsobe entirely implementedin hardware.
Of the three units mentioned above, it is consideredthat the SPU is
the most crucial one and hencewe concentratedon it as the main part of
our researchat this stage. Indeed it is the coreof any interface regardless
of the type of medium, accesscontrol protocol or host under consideration.
Furthermore, it is the leastexperimentedwith, and it is the most challenging
in determining the best architecture for a high speedinterface.
As part of our effort on designingthe SPU, wehave designedthe Buffer
ManagementUnit for transmissionat the logic level. We arecurrently work-
ing on designingthe buffer managementunit for reception. After this, the
next step will be to implement a prototype of the SPU in VLSI.
The folowing parts of this section mainly cover the description of the
Buffer ManagementUnit for transmission.
3.1.2 Functions of the Buffer Management Unit For Transmission
(BMU-T)
The BMU-T serves to replace the most frequently executed routines in con-
ventional software buffer management. After some careful considerations,
it has been suggested that the BMU-T must perform the following 3 major
functions:
• To manage the transmit dual-port buffer as a FIFO queue.
, To perform packetization on the fly.
• To handle data link Go-Back-N ARQ protocol.
19
3.1.3 Specific algorithms used in the Buffer Management Unit
Although there are interactions between the three functions of the BMU,
they can be more or less discussed separately.
a) FIFO Management:
As mentioned before, the dual-port buffer is managed as a FIFO queue to
avoid the addressing computations that are done in software management.
Depending on the application, the host will generate messages of variable
lengths and forward them to the transmit buffer. Upon request from the host,
the BMU-T must supply the appropriate addresses so that each message will
be stored in a sequential manner.
Each message will be packetized while it is being stored in the buffer. The
addresses defining the packet boundaries (the end address of each packet) will
be stored and maintained in an internal FIFO queue and inside the BMU-T.
Packets will be removed from the buffer by the Media Access Unit sequen-
tially. Again, the BMU-T will be responsible of supplying the addresses.
To implement the above algorithm, two address counters are required,
one for the incoming data (from the host) and the other for the outgoing
data (to the network). Furthermore, a number of supervisory signals are
required. Figure 6 shows a block diagram of the FIFO management section.
The supervisory signals are Full, Empty and EOP (end of packet). Full
indicates that the address is full and prevents the host from writing new data
to the buffer. Empty indicates that the queue is empty I and prevents the
Media Access Unit from reading from the buffer. EOP indicates that the
current outgoing data word is the end of a packet.
To generate the supervisory signals Full and Empty, we need a number of
status pointers for the address queue. They are TOQ, BOQ+I and LACK.
TOQ is the top-of-the-queue pointer, BOQ+I points to the next position
after the bottom-of-the-queue and LACK points to the last acknowledged
packet. When BOQ+I coincides with LACK, Full becomes true and when
TOQ coincides with BOQ+I, Empty becomes true.
EOP is generated by comparing the outgoing address counter with the
end address of the current packet.
lWe shall see later that Empty may have another meaning.
2O
Addr
Cntr
Full
Address
Queue
TOQ
BOQ*I
LACK
End
Packet
Addr
Out
Addr
Cntr
End of Packet
Empty
Figure 6: FIFO Management Section of the BMU-T
b) Packetization:
Messages will be broken up into as many maximum size packets as possible
until the end of message is asserted by the host. Thus a message will contain
a number of maximum length packets with possibly a short packet at the
end.
c) Go-Back-N ARQ protocol:
To achieve reliable communication, the BMU-T will be required to execute
the Go-Back-N ARQ protocol. In this project, we use a modified Go-Back-
N ARQ protocol whereby each outgoing packet will be attached a sequence
number N(S) and these packets will be acknowledged by a sequence number
N(R). A new N(R) acknowledges all packets from the last N(R) to this new
N(R) minus 1. Thus, the new N(R) is the expected sequence number at
the other end of the communication path. Only a certain maximum number
of packets can be outstanding without acknowledgement at any given time.
When this is maximum is reached, a local timer is triggered. If the timer
expires before an acknowledgement arrives, all the outgoing packets must be
retransmitted. If an acknowledgement arrives, the timer will be reset and
the normal operation resumes.
21
In view of the aboveprotocol, the BMU-T must maintain outstanding
packetsby the pointer LACK. Moreover,if the numberof outstanding packets
is equal to the maximum possible,the BMU-T must stop the Media Access
Unit by the signal Empty. This is why Empty may take on a different
meaning from an empty queue. When the timer expires, a number of actions
must be taken:
1. Reload TOQ by LACK+I.
2. Reload outgoing address counter.
3. Reload next end packet address.
4. Reload N(S).
5. Reset timer.
6. Disable any incoming N(R).
The details of implementing the timeout mechanism is rather complicated
because so many actions have to be taken and not all of them can be done
in the same cycle. We present the details in subsection 2.2 below.
3.1.4 An Integrated Block Diagram of the BMU-T
The result of the above considerations led us to the logic block diagram of
the BMU-T as shown in Figure 7.
3.1.5 Testing Strategy
We shall use the LSSD testing technique. The scan path will include:
1. All counters
2. All pointers
3. Stored N(S) and N(R) values
4. End of packet address
22
In Trl-
Addr State Addr Queue
Cntr Input
Buffer
Pktz
Block
Pointers
and
Control
End
Pkt
Addr
Com-
parator
Out
Addr
Cntr
I
Go-Back-N Block
Timer Section
Figure 7: The logic block diagram of the BMU-T
By shifting in the appropriate test vectors, we can test:
1. All counters
2. All pointers
3. Logic generating external signals Full, Empty and EOP
4. Address queue cells (By writing through in-addv-cntv and reading from
nezt-end-pkt-addr)
5. Packetization Logic
6. Go-Back-N Logic
3.2 Micro-architecture
3.2.1 Individual Block Descriptions
Figure 7 is an appropriate partition of the BMU-T. We describe these blocks
in detail in this section.
23
/
4In
Addr
Cntr
enable
en&_1
e
(b)(a) (c)
Cln
a
Figure 8: In Address Counter
In Address Counter: The in-addr-cntr is a 4-bit binary counter with an
enable signal shown in Figure 8a. Figure 8b shows the wiring among the four
counter cells and Figure 8c shows the schematic of each counter cell. When
enable is high, the counter will advance but when enable is low, the counter
outputs will remain the same.
The enable signal is generated by the packetization block by host-req .T_'[.
Address Queue with Tri-State Input Buffers: Figure 9a shows the
Address Queue and the input buffer along with their sizes. The Address
Queue has an input bus and two output busses. The input bus is a static
bus while the outpus buses are precharged as shown in Figure 9b.
The write and read select signals come from the TOQ, BOQ+I, LACK
pointers. The enable signal comes from the packetization block by maz-pkt
OR EOM (maximum length packet written or end of message).
Output Address Section: Figure 10a shows the sub-blocks and their sizes
in the output address section. Figure 10b,10c and 10d show the schematics
of the next-end-pkt-addr, comparator and the out-addr-cntr respectively.
24
Tri-
4 State 4
Buffer 4
(a)
5
Addr Queue
Pointers
v 4
4
Vdd
-' ,.Out Bus 2
G_S2&el P RSI &el
I" IRegisterCell
en&,, .I I! .I "J
(b)
Figure 9: Address queue and the tri-state input buffer
25
J End
Pkt
I Ad_
Out bus 2
_oad
¢,
Comparator
EOP
(a)
Data
In
LD&el
(b)
>
Out Addr
Cntr
Load Enable
aO bO al bl
(c)
a2 b2 a3 b3
EOP
Data In LD&et
.en&LD&el
en&el
(d)
Figure 10: Output Address section
26
The load signal to end-pkt-addr is derived from EOP, Empty and timeout
signal from the Go-Back-N block. The load signal to the out-addr-cntr is
derived from the timeout signals. The enable signal is derived from man-
req.Empty and also the timeout signals.
Packetization Block: Figure 11a shows the internal structure of the pack-
etization block and Figure 1 lb shows the schematics of the pkt-siz-cntr cell.
This cell is similar to the in-acldr-cntr except there is a reset signal here. The
pkt-siz-cntr is a 3 bit counter and the count of the most significant bit is the
cntr-full signal. The host-req and the EOM are external signals from the
host. The Full signal is generated from the pointer section.
The two output signals are used to enable the in-addr-cntr, the tri-state
buffer and to advance the BOQ+I pointer.
Pointers and Control: All pointers are 8-bit long (no encoding). Fig-
ure 12a show the internal structure of the pointer section. Figures 12b, 12c
and 12d are the schematics for the cells in TOQ, BOQ+I, LACK pointers
respectively. TOQ is shiftable and loadable by LACK+l, BOQ+I is only
shiftable (advanced by a signal from pktz-block), LACK is only loadable (by
new N(R) less 1).
Comparators are used to generate the Full and Empty signals.
Go-Baek-N Block: Figure 13a shows the structure of the Go-Back-N sec-
tion. The N(S), N(R) are 4-bit shift registers, N(S) being loadable and the
undecoded N(R) is glued beside the decoded N(R) (it is not part of the shift
register).
Timer Section: Figure 13b shows the internal structure of the timer sec-
tion. The timer is a 4-bit counter with anenable signal. Figure 13c shows
the schematics of the counter cell.
The rest of the circuit is really a small FSM to generate the timeout and
timeout(2 '_d cycle) from the tirneout(1 _t cycle) signal.
As mentioned before, there are a number of actions needed to be taken
when timeout occurs and these cannot be done in one single cycle. Thus
27
Pkt
Size
Cntr
(3bit)
Full
host-_
eoao,_....
cntr-f_
EOM
reset _F-
enable
in-addr-cntr
enable
te buffer
&
BOQ+ 1
(a)
en & oi
i
en &ol
t Comb. ILogic I
I
(b)
Figure 11: Packetization Block
28
Load
Shirt
TOQ
BOQ * I
LACK
F LACK
(a)
q-empty
Data LD & el
,e2 To_e_t
Prey cell [___
(b)
5H&a!
(c)
LD&al
Data 1
e2
(d)
Figure 12: Pointers and Control Block
the small FSM is used to extend the timeout signal by one cycle. A timing
diagram will illustrate the idea:
timeout(1 "t cycle)
timeout(2 '_d cycle)
timeout
0000100...0100
0000010...0010
0000!!0...0110
Notice that timeout(1 "t cycle) is really the Co,t of the most significant
bit. The counter is either enabled or disabled but in both cases the Co,, of
the MSB will become zero in the next cycle. This is how we arrive at the
above timing diagram.
3.2.2 Clocking
We adopt a strict clocking discipline in this project. All inputs and outputs
of each block are stable _z except for the outputs of the address queue which
are valid _1, Furthermore, we have configured all our blocks in such a way
that all combinational logic are performed during _2. This facilitates easier
29
I
load
tlmeout( Ist c_
EOP
Tlm eou._I_
m
N(R) valto
Encoder
4
N(S)
N(R)
4
Decoder
2
new N(R)
q-empt
empty
enable
timer
N(R) > new N(R)
enable
Ca)
4-bit
cntr
(timer)
tlmeout tlmeout timeout
(Ist cycle) (2nd cycle)
(b)
(C)
Figure 13: Go-Back-N block and the Timer section
30 ¸
testing and provides a natural scan path for LSSD. Figure 14 shows the
clocking types of the inputs and outputs of various blocks.
v!
In
Addr
Cntr 4S_1 :rl-
_tate
)uff_
ql
Addr Queue
TOO
ql
BOO + I
LACK
sI
,pty (st)
P (sl)
Go-Back-N (S)(sl)
J(R)(sl)
T_mer Section L_ -req
(sl)
I rimti eout(si)
Figure 14: Clocking types of the various lines
3.2.3 Logic Descriptions of Datapath
The datapath of the system is given by the top part of Figure 14. The source
of data is the in-addr-cntr. When there is a host-req and the queue is not
full, the in-addr-cntr will start counting. When a packet has been received
(i.e., current packet size is maximum or EOM is asserted by the host), the
tri-state input buffer will be enabled and the end address of the packet will
be stored into the FIFO queue according to the pointer BOQ+I, the first
available position. Clearly, BOQ+I should be updated.
On the output side, the out-addr-cntr will be enabled when there is a
man-req and the queue is not empty (or no maximum number of outstanding
packets). The value of this counter is constantly checked against the end
address of the packet to generate the signal EOP. When EOP is true,
the TOQ pointer is shifted and then the end-pkt-addr will be loaded. This
operation requires two machine cycles so we have to assume that no packets
31
areshorter than 2 data words in order to maintain correct operation. There
are also subtleties when the queuebecomesempty. Figure 15b showshow
the control signalsto the output blocks aregenerated.
In En_
kddr _'tatel I I I I I Pk,
Cntr I Ib°"! i I I l_Ad
I O!"TOtLL
A packet has
been received
r 1 COJ
(a)
enable eop
_i..__ I )ut
:; ,.r
or :ntr
load enable
o-empt' !> ' I I
End
Pkt
Addr
timeout (I st cycle) load
timeout (2nd cycle) Out
mau-req Addr
Cntr
empty
(b)
Figure 15: The datapath
Several remarks will be useful. The in-addr-cntr and out-addr-cntr always
hold the next expected addresses. Thus, the host and media access unit
can use these addresses without delay. The Full and Empty signals will
show whether they are valid or not. Secondly, reading and writing in the
same address queue location is not possible. Thirdly, the timeout actions
can only be taken in 2 cycles. Thus, the three distinct signals timeout,
timeout(1 't cycle), timeout(2 "d cycle) are all used to generate the necessary
32
control signals. A final point is that we need to load in the appropriate
end-pkt-addv when the queue goes from empty to non-empty. The circuit
diagram in Figure 15b shows how this is done.
3.2.4 Logic Description of Control Blocks
The control blocks shown in Figure 7 (lower portion) can be collectively
considered as one large FSM. However, it can be broken down into a number
of relatively simple functional units. We shall look at the functions, inputs
and outputs of each unit.
Pkt
Siz
Cntr
host-req EOM Full
cntr-ful_.
enabl_ counter
Combinational
Logic
enable in-addr-cntr
store end-pkt-addr
advance BOQ*I
Figure 16: Packetization block
Packetization Block: This block is used to keep track of the size of the
current incoming packet. When a maximum size packet is received or end
of message arrives, the unit has to let the end packet address be stored into
the address queue and advance the BOQ-+-i pointer. Figure 16 shows the
functions of this block.
The details of the combinational logic block is shown in Figure 11a. The
host-req and EOM are external signals from the host. The Full signal is
from the pointer control block.
Strictly speaking, there are 2'* possible internal states in this FSM where
n is the size of the counter. However, as far as the inputs and outputs are
concerned, there are only 2 states: counter full or not full. Thus, with the
help of the counter, this machine becomes a trivial FSM.
33
Pointer and Control Block: This block is used to maintain the queue,
i.e., to keep track of the top of the queue, bottom of the queue and the
last acknowledged packet and to generate the Full and q-empty supervisory
signals. Figure 17 shows the inputs and outputs of this block.
from Load LACK* i
Go-Back-N
block i_
Advance
from TOQ
packetization
Advanceblock
80Q ÷ I
From Go-Back-N
block Load
r_ew
ack
iI
To Go-Back-N
TOO block
q-empty
BOQ + I
S .To Packetization
LACK Ful}Block
i,n%w LACK
from Go-Back-N
block
Figure 17: Inputs and outputs of the Pointer and Control Block
This is again a trivial FSM with only 3 states: queue full, queue empty
or neither. When TOQ coincides with BOQ+I, the queue is empty. When
BOQ+I coincides with LACK, the queue is full.
Go-Back-N block: This is the most complicated among the 3 control
blocks. It consists of 3 sub-blocks. Figure 18a shows these blocks. The
first block is the output control block. Although it is more complicated, it
is analogous to the packetization block. It generates the Enable and Load
signals to the out-addr-cntv and end-pkt-addr. The second is the sequence
number block. This block keeps track of the sequence numbers N(S) and
N(R) and generates the appropriate control signals. This third block is the
timer section which implements the timeout mechanism.
Figure 18b shows the inputs and outputs to the control section. It is just
a combinational logic block where the logic is shown in Figure 13a.
Figure 18c shows the structure of the sequence number block. It is also
a trivial FSM with 2 states: N(S) equals N(R) - 1 or not equal.
34
Output
Control
Sequence
Number
Section
Timer
Section
(a)
tlmeout (Ist cycle)
tlrneout (2nd cycle)
Empty
Combinational
Logic
mau-req
(b)
load end-pkt-addr
load out-aadr-cntr
enable out-addr-cntr
q - e m Pt Y-,,_'I-._ Empty
t,meoot--_ _>.
tlm N(S) N[R)vall_ Enable
advanceN_S1Nm) _"_ II_Ii_Imer
(C)
Sequential
Umt I
t_me tlme tlme
enable
out out out
(Ist (2nd
cy) cy)
(d)
Figure 18: Go-Back-N Block
35
Figure 18d shows the input and outputs to the timer section. The se-
quential circuit is shown in Figure 13b. This timer has 3 states: no timeout,
timeout(1 st cycle) and timeout(2 '_d cycle).
4 Conclusions
We have presented our design of a high speed network interface unit (HSNIU)
in this report. We have chosen to implement the packet processing functions
in hardware, using VLSI technology, in order to satisfy our goal of achieving
high rate on-the-fly processing for the multimedia communications applica-
tions while keeping the interface design relatively compact and at low cost.
The multimedia workstations that incorporate video and audio in addi-
tion to data and graphics will probably be very common in the near future.
We note that the architecture and the operating system of these workstation
also play an important role in achieving high bandwidth communication. As
an example, in such a workstation, the ability to transfer an image from
the video camera to the network adapter directly, without first transferring
the data into the main memory of the workstation and then moving it to
the network adapter would probably increase the performance significantly.
Similarly, on the receiving side, it makes sense from the performance point
of view to be able to move the received image directly to the frame buffer,
without having to first copy it to the main memory. Thus, in the design of
the multimedia workstations, such issues must be taken into consideration.
With this goal in mind, we have been examining several existing computer
architectures and have identified three basic types of architectures based on
their bus structure. We are going to undertake timing analysis of multime-
dia workstation architectures, taking into consideration the effects introduced
by the operating system. For example, it might make a significant difference
whether the operating system allows a user process to directly access the
network interface or not. We will then consider the integration of issues at
both the system's architecture and the network adapter unit's architecture
in order to determine the best configuration meeting the new objectives.
36
References
[1] H. Kanakia and F. Tobagi, "Performance Measurements of a Data Link
Protocol," in Proc. 1CC'86, (Toronto), 1986.
[2] G. Chesson, "The Protocol Engine Project," UNIX Review, pp. 70-77,
Sept. 1987.
[3] H. Kanakia and D. R. Cheriton, "The VMP Network Adapter Board
(NAB): High-Performance Network Communication for Multiproces-
sors," in Proc. ACM SIGCOMM'88, pp. 175-187, 1988.
[4] E. A. Arnould et al., "The Design of Nectar: A network Backplane for
Heterogeneous Multicomputers," in Proc. ACM SIGCOMM'89, pp. 205-
216, 1989.
[5] D. Giarizzo, M. Kaiserwerth, T. Wicki, and R. C. Williamson, "High
Speed Parallel Protocol Implementation," in Protocols for High-Speed
Networks (H. Rudin and R. Williamson, eds.), pp. 165-180, IFIP, El-
sevier Science Publishers B.V. (North-Holland), 1989.
37
