Hardware buffering of data streams A general by Ted Hörnstedt
Hardware buﬀering of data streams
A general single buﬀer approach
Ted H¨ ornstedt and Jonas ˚ Aberg
E-mail: ted.hornstedt@chello.se, cja@gmx.net
Master Thesis at
Department of Microelectronics and Information Technology,
Royal Institute of Technology, Stockholm, Sweden.
Axis Communications AB, Lund, Sweden.
March 11, 2004Abstract
Today Axis Communication sells a wide range of network connected products that
are based upon their own design of an embedded processor named ETRAX. It is a
System on Chip (SoC) that supports an extensive range of protocols. Their current
DMA implementation is not optimal for more advanced protocols such as USB and
FireWirer .
The goal of this master thesis has been to evaluate and if possible, suggest a design
of a buﬀer system architecture that will replace the old DMA system. The protocols
which we focused on were USB 2.0, FireWire 800r , RS232, Gigabit Ethernet and
Serial ATA.
The buﬀer system architecture we evaluated is on a higher abstraction layer than
the current ETRAX DMA implementation. It is focused on the diﬀerent data streams
and their behavior instead of the actual physical connection. To be able to handle
these data streams, our system architecture uses an advanced buﬀer controller to-
gether with one buﬀer memory. A simulator model was implemented from our sys-
tem architecture design. It was later incorporated into SID, a hardware component
simulator.
The conclusions drawn from the development and the simulations are that our
design is very suitable for SoCs that are to support a large range of high speed and
complex protocols. A rough size estimation shows that our design will only result in
a minor increase of the number of gates required compared to the current ETRAX
DMA system. SID was of no help during the evaluation of our internal design, but
SID also showed to be a powerful tool that can be very useful for embedded software
developers.
iAcknowledgements
We have found this master thesis both rewarding and interesting. While writing
this master thesis for Axis Communications, we were located at the company’s ASIC
department. This strengthened our knowledge about the ASIC development process.
We now appreciate the fact, that the design process of an ASIC, is a complicated and
time consuming process.
One of the biggest challenges were that we had to invent all of the solutions to
this design ourselves, without looking at previous designs. This made us try several
diﬀerent approaches to each problem and then through discussions try to select the
one that best suited our needs.
We would like to thank all of the people that have helped us with this master
thesis. We would like to explicitly thank:
Niklas Persson, our Axis Communications supervisor, who answered many project
related questions for us and corrected many errors in this report.
Mats Brorsson, our KTH supervisor and examinator, who helped us with this report.
Stefan Sandstr¨ om, an Axis Communications employee, who with his vast knowledge
in hardware design, helped us a lot and also gave us suggestions how we could improve
this report.
Johan J¨ orgensen, an Axis Communications employee, who contributed with lots of
ideas and discussions. He also helped us with correcting our English.
iiContents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scope of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background study 4
2.1 IEEE 802.3, Section 3, Gigabit Ethernet . . . . . . . . . . . . . . . . . 4
2.2 IEEE 1394b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 RS232 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Serial ATA:
High Speed Serialized AT Attachment - 1.0a . . . . . . . . . . . . . . . 6
2.5 Universal Serial Bus - Revision 2.0 . . . . . . . . . . . . . . . . . . . . 6
2.6 Summary of protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 General design ideas 9
3.1 Handling data streams -
The virtual channel (VC) approach . . . . . . . . . . . . . . . . . . . . 9
3.2 How to organize the buﬀered data . . . . . . . . . . . . . . . . . . . . 10
3.3 Virtual channel data organization . . . . . . . . . . . . . . . . . . . . . 11
3.4 Meta data - Virtual channel descriptor . . . . . . . . . . . . . . . . . . 11
3.5 End of Packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6 Software overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.7 Transfer setup structure . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.8 I/O unit overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.9 Priorities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.10 Event and error reports . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.11 DSB operation example - Attaching a digital camera . . . . . . . . . . 13
3.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Speed and size constraints for the design 15
4.1 System speciﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 The main memory (DDR SDRAM) . . . . . . . . . . . . . . . . . . . . 15
4.2.1 Worst case calculations . . . . . . . . . . . . . . . . . . . . . . 15
4.2.2 Best case calculations . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.3 Average case calculations . . . . . . . . . . . . . . . . . . . . . 16
4.3 Virtual channel size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 How large shall the buﬀer be? . . . . . . . . . . . . . . . . . . . . . . . 17
4.5 Virtual channel descriptor size . . . . . . . . . . . . . . . . . . . . . . 18
4.6 Block size calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.6.1 The connection to the I/O units . . . . . . . . . . . . . . . . . 19
4.7 The internal bus and the main memory bus . . . . . . . . . . . . . . . 20
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 Between the ideas and the implementation 21
iii6 System implementation suggestion 23
6.1 Execution pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2 Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3 DSB Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.4 Inside the Virtual Channel Descriptor (VCD) . . . . . . . . . . . . . . 29
6.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.5.1 Device driver interface . . . . . . . . . . . . . . . . . . . . . . . 32
6.5.2 Internal functions . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.5.3 Data tables and vectors . . . . . . . . . . . . . . . . . . . . . . 34
6.6 Hardware units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.6.1 Interface between the hardware and software . . . . . . . . . . 35
6.6.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.6.3 Memories and buﬀers . . . . . . . . . . . . . . . . . . . . . . . 37
7 Simulation and testing 40
7.1 Simulator tool - SID . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.2 Implementation with SID . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.3 Simulation conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.3.1 Instruction buﬀer . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.3.2 Conﬁguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.3.3 Buﬀer size requirements . . . . . . . . . . . . . . . . . . . . . . 41
7.3.4 LP and HP List behaviour . . . . . . . . . . . . . . . . . . . . 41
7.3.5 Queue master behaviour . . . . . . . . . . . . . . . . . . . . . . 42
7.3.6 Virtual channels and buﬀer memory allocation . . . . . . . . . 42
7.3.7 Experience with SID . . . . . . . . . . . . . . . . . . . . . . . . 42
7.3.8 I/O transfer problem . . . . . . . . . . . . . . . . . . . . . . . . 42
8 Conclusions 43
9 Future investigations 44
9.1 Conﬁguration values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9.2 Packet sizes and buﬀer size . . . . . . . . . . . . . . . . . . . . . . . . 44
9.3 Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9.4 Bridge I/O unit transfers . . . . . . . . . . . . . . . . . . . . . . . . . 44
9.5 Virtually addressed memory . . . . . . . . . . . . . . . . . . . . . . . . 44
A Terms and abbreviations 46
ivList of Figures
2.1 Serial bus address space. . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 USB connection tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Basic idea - Single buﬀering unit serving several I/O-units. . . . . . . 10
3.2 Possible block allocation layout of a virtual channel in the buﬀer. . . . 10
3.3 Organization of a channel’s buﬀer. . . . . . . . . . . . . . . . . . . . . 11
3.4 Schematic hardware DSB overview with the DSB core block magniﬁed. 14
5.1 Schematic hardware DSB overview with the DSB core block magniﬁed. 21
5.2 A full transfer cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.1 Overview of the DSB’s functional units and adjacent devices. . . . . . 24
6.2 Transfer between the I/O units and the buﬀer master. . . . . . . . . . 25
6.3 Transfer between the main memory and buﬀer. . . . . . . . . . . . . . 26
6.4 A Cmd command transfer between the command buﬀer and the I/O
units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.5 The software of the DSB. . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.6 The hardware of the DSB. . . . . . . . . . . . . . . . . . . . . . . . . . 35
vList of Tables
2.1 USB 2.0, transfer mode characteristics. . . . . . . . . . . . . . . . . . . 8
4.1 Estimated VC size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Assumption of a realistic upper limit of maximum simultaneously open
virtual channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Assumption of a realistic buﬀer usage. . . . . . . . . . . . . . . . . . . 17
4.4 Bits required to point out one block . . . . . . . . . . . . . . . . . . . 18
4.5 Number of blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.6 I/O bus speed vs. protocol speed . . . . . . . . . . . . . . . . . . . . . 19
6.1 The names of the cycles and what is executed during them. . . . . . . 25
6.2 The resources and explanation. . . . . . . . . . . . . . . . . . . . . . . 26
6.3 Resources needed for the diﬀerent commands. . . . . . . . . . . . . . . 26
6.4 Resources needed for the diﬀerent commands. . . . . . . . . . . . . . . 27
6.5 Possible combination of two commands. . . . . . . . . . . . . . . . . . 28
6.6 Valid combinations of commands ussued during the same cycle. . . . . 28
6.7 Execution pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.8 DSB transfer instruction. . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.9 What a VCD needs to include. . . . . . . . . . . . . . . . . . . . . . . 30
6.10 Static VCD - 64kbytes buﬀer memory, 128 bytes blocks (total 512
blocks), VCD 32 bytes, 6 ptrs per VCD, 10 pkg end ptrs. . . . . . . . 30
6.11 Dynamic VCD - 64kbytes buﬀer memory, 128 bytes blocks (total 512
blocks), VCD 32 bytes, dynamic number of blk ptrs and pkg ends per
VCD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.12 Command buﬀer: data contains. . . . . . . . . . . . . . . . . . . . . . 38
6.13 VCD alter queue: data contains. . . . . . . . . . . . . . . . . . . . . . 39
viChapter 1
Introduction
Today Axis Communications sells a wide range of network connected products that
are based upon their own embedded processor named ETRAX. It is a System on Chip
(SoC) that supports an extensive range of protocols. Their current implementation
of the DMA controller and DMA buﬀers on the ETRAX processor is not suitable for
more advanced protocols such as USB and FireWirer due to timing problems when
it comes to multiple data channels.
A solution to this upcoming problem is to replace the current DMA system with
a whole new architecture which is able to handle the current protocols and still oﬀer
the ﬂexibility required by the high-performance protocols of tomorrow.
One solution to this problem is to create an advanced buﬀer controller with a
larger buﬀer memory, instead of one buﬀer for each DMA channel, that can handle
several protocols simultaneously. To design one controller that shall be able to handle
many bandwidth demanding data streams at the same time is a complex problem. If
not carefully designed from the beginning it can become very complicated.
In our master thesis project we demonstrate a single buﬀer approach that fulﬁlls
all requirements mentioned above.
Our approach is a rather simple software and hardware implementation. We have
managed to get a good relationship between the amount of software and hardware.
The software takes care of all non-time critic operations and things that are compli-
cated to implement eﬃciently in hardware.
1.1 Background
Axis Communications originally started with making IBM print servers in 1984[1].
Over the years Axis Communications has become a market leader in network video
solutions, as well as printer servers and document management. Most of the products
are based on an in-housedeveloped SoC called ETRAX1. The latest available model
is called ETRAX 100LX. It is an optimized SoC solution for putting peripherals on
the network [2]. It has been designed to run on Linuxr .
It has support for many various transfer interfaces and future models will have
support for even more. Since this is an SoC with all of the interfaces built into the
chip, the DMA that reside in the chip will handle all of the input/output data streams.
Today the buﬀering of the inputs and outputs of the transfer interfaces are done with
FIFO:s. One FIFO is used for each input and output. Simple protocols are working
with just one data stream a time. While in the more complicated protocols, like the
USB protocol, there can be several data streams that shall be transferred over one
physical connection through one FIFO. The problem with this is that each time you
1Originally ETRAX stood for Ethernet, Token Ring, AXIS. However the Token Ring support is
discontinued.
1CHAPTER 1. INTRODUCTION 2
change data stream source or target, the FIFO must be emptied and that takes time
and reduces the overall performance.
In the future, when faster and more complicated transfer protocols are added to
the SoC, the FIFO method might prove to be too slow.
Axis Communications therefore wanted a study to be done on a new approach to
buﬀer the incoming and outgoing data streams. Their main idea was to use only a
single buﬀer memory to store all the input/output data. A joint buﬀering mechanism
for all the diﬀerent transfer protocols is needed. This should make the buﬀering of
various data streams more scalable and conﬁgurable. It should also make it easier to
add a new protocol when the need for it comes.
1.2 Scope of Thesis
We have investigated the possibility to implement a unit that buﬀers incoming and
outgoing data, with the use of only a single big buﬀer. This unit serves multiple and
diﬀerent interfaces on an SoC similar to the ETRAX. This data buﬀering unit handles
protocols ranging from very slow protocols to fast protocols and between non-packet
based protocols and packet based protocols. The protocols of interest are: IEEE
802.3 Gigabit Ethernet, IEEE 1394b (FireWirer ), RS232 (Serial port), Serial ATA
and USB 2.0.
We were given no strict guidelines for the system design, except that they wanted
a single buﬀer design to be investigated. Axis wanted us not to be aﬀected by their
earlier design. Before our own system suggestion was completed, we were not allowed
to look at Axis current buﬀer design. Instead they gave us the very basic concept of
their design together with the problems and limitations with their current design to
assure that we would deal with them in our design.
Then we gave a suggestion of how to implement such a system architecture. Be-
fore we decided on a speciﬁc implementation we tried to ﬁnd some diﬀerent design
alternatives. But we did not do any serious calculations/simulations to guide us.
The design decisions were mostly based on discussions within our group or with Axis
developers.
In the end we made a simulation model of the suggested system implementation,
in order to see if the the design really works as planned and to investigate if the
performance meets our expectations. Because of time shortage we did not do any
simulation comparisons between our model and other models, determine the most
optimal conﬁgurable values for each protocol, implement anything in a hardware
description language or write any software drivers.CHAPTER 1. INTRODUCTION 3
1.3 Outline
Chapter 2 is a survey of the included protocols and summarizes aspects that will
be important to our design.
Chapter 3 will give an general overview of what major building block and basic
concepts that will be used in the design of the Data Stream Buﬀer (DSB).
Chapter 4 includes some timing constraints calculations of the protocols and the
surrounding hardware. It also includes attempts to determine some of the
conﬁguration parameters of the building blocks presented in chapter 3.
Chapter 5 is an overview of how data is transferred inside the DSB hardware.
Chapter 6 shows the details of our DSB design.
Chapter 7 describes the simulation model of the DSB architecture presented in
chapter 6 and the results and conclusions from the simulations.
Chapter 8 contains the project conclusions.
Chapter 9 covers what could be done in the future concerning this matter.
Appendix A includes a glossary with all the terms and abbreviations we use in our
report.Chapter 2
Background study
This chapter will cover a background study of the interesting transfer protocols. We
will try to summarise the most important aspects of each protocol when it comes to
ﬁnding a suitable system architecture. Interesting features are transfer speed, num-
ber of devices that you can attach, diﬀerent data transfer types and if the protocol
has some packet concept. These protocols have been chosen because Axis Commu-
nications found them representative for protocols likely to be added to their SoC in
the future. The primary reason for selecting the RS232 protocol is to highlight the
diﬀerences in comparison to other protocols e.g. low speed and the lack of a concept
of packets.
2.1 IEEE 802.3, Section 3, Gigabit Ethernet
This is a standard for LAN (Local Area Networks) that uses the CSMA/CD (Carrier
Sense Multiple Access with Collision Detection) access method. The IEEE standard
802.3 [9] with its latest revision supports transfer rates from 1 Mbit/s to 1000 Mbit/s.
The original 802.3 standard was published in 1985. It supported transfer rates up
to 10 Mbit/s and only worked in half duplex mode.
In 1995 IEEE added support for 100 Mbit/s transfer rate, also known as “Fast
Ethernet”. It now also supports a full duplex mode, with 100 Mbit/s in each direction.
In 1998 1000 Mbit/s or “Gigabit Ethernet” was adopted by the IEEE.
Gigabit Ethernet supports both full duplex and half duplex. This means that
one connection can actually transfer 1 Gbit/s in each direction. Gigabit Ethernet
supports only asynchronous packet transfers and only one type of data. The buﬀering
is therefore not complicated but it needs to be fast. An Ethernet packet is called a
frame and consists of a 22 bytes large header. The data payload is 46-1500 bytes long.
If we subtract the header from a frame sending the maximum 1500 bytes of data,
the eﬀective transfer rate is around 985.5 Mbit/s.
2.2 IEEE 1394b
The IEEE 1394b-2002 [12] are an amendment to the IEEE 1394-1995 [10] and IEEE
1394a-2000 [11] are standards that deﬁnes a high performance serial bus. The stan-
dard has support for a serial bus in two so called environments. The two environments
are an external cable connection and a backplane bus. Apple Computer Inc. has im-
plemented the external cable subset and calls it FireWirer . The cable environment
is by far the most used and it is what we focus on here. It is the IEEE 1394’s ability
4CHAPTER 2. BACKGROUND STUDY 5
to move large amounts of data between computers and peripheral devices fast that
have made it popular.
The transfer rates declared in 1394-1995 is 100 Mbit/s, 200 Mbit/s and 400 Mbit/s.
In the 1394a amendment, which was approved by IEEE in the year 2000, the arbitra-
tion method was improved to get better throughput. The latest 1394b amendment,
approved 2002, speciﬁes methods to obtain transfer rates up to 800 Mbit/s, 1600
Mbit/s and 3200 Mbit/s. Apple’s current implementation of FireWirer supports
transfer rates up to 800 Mbit/s.
The IEEE 1394 standard supports both isochronous and asynchronous transmis-
sion and it can address up to 1024 buses with 64 nodes (see ﬁg. 2.1). It is thus possible
to address 64k nodes in an IEEE 1394 network.
Figure 2.1: Serial bus address space.
There are two types of packets that contain data transmission, the Acknowledge
and Primary packets.
An acknowledgement packet is an asynchronous packet that is sent as an imme-
diate response to a non-broadcast asynchronous packet.
Primary packets are divided into two subgroups: isochronous and asynchronous
packets. They contain the data payload or a request/response command.
The maximum payload for a isochronous primary packet, in the 800 Mbit/s trans-
mission mode, is 8192 bytes. If we remove the headers and other transmission over-
head, the maximum eﬀective data transfer speed will be around 785.3 Mbit/s.
The maximum payload for an asynchronous Primary packet, in the 800 Mbit/s
transmission mode, is 4096 bytes. The maximum eﬀective data transfer speed for the
asynchronous transfer mode is around 782.5 Mbit/s, calculated as in the isochronous
transfer mode. But since the asynchronous mode sends a lot of small packets, like
request and response packets, the real eﬀective data transfer speed will be somewhat
slower.
2.3 RS232
The RS232 standard was renamed to EIA232 in 1991 [4], but still the old name is by
far the most common. The ﬁrst versions of the RS232 standard was developed in the
1960’s. RS232 is a serial protocol mainly focused on data transferred over modem
connections.CHAPTER 2. BACKGROUND STUDY 6
The RS232 standard claims that it is applicable for use at bit rates ranging from 75
bit/s up to a nominal limit of 20000 bit/s [7]. But there are several common products
that operates beyond that limit. Axis Communications’ older products supports up
to 6.25 Mbit/s.
Data can be transferred both asynchronously and synchronously, in duplex or half
duplex. Physically there is one wire for sending data, one wire for receiving data and
a couple of status/control wires.
The standard include both a 25-pin connector with dual data channels and a 9-pin
connector with one data channel. Today, the 25-pin connector is rarely used.
The RS232 standard is not a traditional packet based protocol. But it uses hand-
shaking on special wires before and after data transmission. To simplify things, this
can be considered as the start and end of a packet, even though it is not really so.
2.4 Serial ATA:
High Speed Serialized AT Attachment - 1.0a
The Serial AT Attachment (ATA) protocol [8] is a replacement for the older ATA
protocol. The main changes are that Serial ATA transfers data serially instead of
in parallel and that only one device can be connected to each host. In the old ATA
standard you could have two devices per host. Both standards are targeted for in-
the-box storage devices only.
The main reasons for making this new protocol standard are to increase the trans-
fer speed and replace the bulky ﬂat cables used for the ATA. The old parallel imple-
mentation is about to reach its transfer speed limit.
Serial ATA is an asynchronous packet oriented protocol. From idle mode both the
host and the device can initialize a transfer. Since there is only one device per host
the communication protocol is not that complicated.
The transfer speed of the ﬁrst generation of Serial ATA is 1.5 Gbit/s, but since
each byte has two extra control/checksum bits the eﬀective data transfer rate is 1.2
Gbit/s or 150 MByte/s.
Data can be transferred both unidirectional and bidirectional, but unidirectional
transfers are most common.
The amount of data that can be sent with one DMA transfer command is limited
to 4 Gbyte.
2.5 Universal Serial Bus - Revision 2.0
The Universal Serial Bus (USB) [6] protocol is designed to provide a general protocol
for attaching a wide range of external devices to a computer. The main goal of earlier
versions of this standard was to make sure it was ﬂexible, yet simple enough to be
integrated with cheap hardware in order to reach a wide market. The latest USB
standard, version 2.0 adds the possibility to attach high-speed devices. The USB 1.1
standard supports up to 12Mbit/s and the fairly new USB 2.0 supports transfer rates
up to, and including, 480Mbit/s.CHAPTER 2. BACKGROUND STUDY 7
As seen in ﬁgure 2.2, the USB hierarchy has the host unit at the top of the
connection tree. The host unit controls the ﬂow to and from attached devices. To
the host you can connect both hub and devices. A hub is a connection for several
devices and possible other hubs. There can only be one host in a system. A host can
connect up to 127 devices and the maximum hub depth is ﬁve hubs, which means
that you can at most have six cables between the host and the most distant device.
This restriction exists because of delay constraints.
Figure 2.2: USB connection tree.
Currently USB supports three diﬀerent speed levels, low-speed at 1.5Mbit/s, full-
speed at 12 Mbit/s and high-speed at 480 Mbit/s. When excluding packet headers
and such things, the maximum eﬀective data transfer rate will be a little bit over 400
Mbit/s.
The USB protocol is packet oriented. One of the important protocol properties is
what they call a frame. A frame is a small amount of time, where special events are
due to happen at ﬁx places in this frame. Certain types of transfers can only occupy
certain parts of the frame. Because of the great transfer speed diﬀerences there are
two types of frames: Frames and microframes. A frame is 1ms long and is used in
low-speed and full-speed transfers, and a microframe is 125 µs long and is used in
high-speed transfers.
Transfers of information between the devices and the host are done via something
called pipes. A pipe is an open transfer channel which can be bidirectional or unidi-
rectional. There are four types of pipes: Control, Bulk, Isochronous, and Interrupt.
The Control pipe type is used for setting up other pipes and for conﬁguration. ACHAPTER 2. BACKGROUND STUDY 8
Control pipe is opened by default when you attach a device to a host network. The
Bulk pipe type is a unidirectional pipe used for sending larger amount of data be-
tween the host and the device.(If you need to both write and read you have to open
two Bulk pipes.) The Isochronous pipe type should be used when a continuous data
stream is needed, for example, voice recording. The Interrupt pipe type is made to
suit typical human input devices for example, keyboards and mice. See table 2.1 for
the transfer mode characteristics.
Table 2.1: USB 2.0, transfer mode characteristics.
Transfer type Transfer characteristic Transfer type
Control transfers Bursty Bidirectional
Bulk data transfers Non-periodic Unidirectional
Interrupt data transfers Low-frequency, Periodic Unidirectional
Isochronous data transfers Periodic Unidirectional
2.6 Summary of protocols
As described in this chapter the protocols have a wide range of transfer speeds. From
RS232’s 75 bit/s to the Serial ATA’s 1.2 Gbit/s. The protocols also allows a wide
range of packet sizes. From a few bytes (RS232 and USB) to several gigabytes (Serial
ATA).
The USB and IEEE 1394 protocols will have multiple data streams which are
multiplexed over the same physical connection, while the other protocols will only
have one data stream at a time for each physical connection.
Since the protocols do not have much in common, the conclusion is that the
buﬀering system must be very ﬂexible and completly protocol independent.Chapter 3
General design ideas
The main purpose of any DMA system is to relieve the CPU of the task to transfer
data between the I/O units and the main memory. Other purposes are to buﬀer
data streams, reduce the eﬀects of transfer peaks and transfer data to and from the
main memory as eﬃciently as possible. To fulﬁll all these requests some sort of buﬀer
is needed. In order to operate correctly the buﬀer has to prefetch data from main
memory on an outward-going data stream and make sure it has enough buﬀering
space, on an inward-going data stream. A basic overview of the design is shown in
ﬁgure 3.1.
During the design procedure we have tried to keep in mind what software is best
at and what hardware is best at. Tasks which have very strict timing requirements
should be implemented in hardware. The hardware should also provide an interface
to software.
The hardware parts shall only do what is time critical on clock cycle level and what
is necessary to have in order to make the software interface convenient. The software is
more suitable than the hardware to handle lists, tables, dynamically allocated things
and complicated algorithms. We have also tried to minimize the number of hard
coded conﬁguration values.
The next sections of this chapter will cover the major design ideas to make the
hardware buﬀering unit work.
3.1 Handling data streams -
The virtual channel (VC) approach
We believe that the best approach to channel handling is based upon the concept of
data stream and not based upon the physical connection as in the current ETRAX
DMA architecture. This data stream based approach is much more suitable for more
complex protocols such as USB and FireWirer  where several transfers can be going
on simultaneously. There is no way to know in advance how these data streams will
be mixed over the physical connection. We call each stream of data that pass between
the main memory and the I/O units a virtual channel (VC). There will be no way to
know how many virtual channels that will be connected to one physical connection, so
having a single big buﬀer that serves as a buﬀer for every virtual channel would be a
far more suitable approach than having one buﬀer for each virtual channel. Figure 3.1
illustrates the single buﬀer approach and the surroundings. We call the design Data
Stream Buﬀer (DSB).
9CHAPTER 3. GENERAL DESIGN IDEAS 10
Figure 3.1: Basic idea - Single buﬀering unit serving several I/O-units.
3.2 How to organize the buﬀered data
As mentioned previously we have a single large buﬀer for all transfers. This approach
demands a more sophisticated organization of the buﬀer compared to having a dedi-
cated buﬀer for each channel. Since both very slow protocols as the RS232 and faster
ones as Serial-ATA shall be supported, the required buﬀer space will diﬀer. Naturally,
a fast protocol requires a large amound of buﬀer space since it handles a large amount
of data, while a slow protocol handles little data and will therefore just need a small
buﬀer memory space.
There are several possible ways to organize the data stored in the buﬀer. One
of the simple ways is to divide the whole buﬀer into small blocks of equal size. The
strengths of this method is that there is no penalty for fragmentation and that it is
simple to handle. Figure 3.2 shows how a virtual channel can be allocated in the
buﬀer. It does not matter wether the blocks are allocated besides each other or not.
Figure 3.2: Possible block allocation layout of a virtual channel in the buﬀer.CHAPTER 3. GENERAL DESIGN IDEAS 11
It will be rather easy to keep track of the allocated blocks. Only a simple list
will be needed. This list is best handled by the software, because the software is
responsible for the creation of new virtual channels and the removal of old. Inside the
hardware it is enough that each virtual channel knows which blocks it has allocated.
A global view of free and allocated blocks is not needed in the DSB hardware.
3.3 Virtual channel data organization
A key feature of the data that ﬂows through the buﬀer is that it will be written once
and read only once. After that is done, the data is old and is no longer needed. A
good organization would be to handle it like a circular buﬀer. One pointer points out
the current byte and one pointer points out where the fresh data ends. The ﬁgure
below illustrates a circular memory.
Figure 3.3: Organization of a channel’s buﬀer.
3.4 Meta data - Virtual channel descriptor
Somewhere information about each virtual channel has to be stored. We call this meta
data, virtual channel descriptor (VCD). During a transfer the VCD of the transferring
channel will be read and written many times, so it must stay in hardware. Since it
will be frequently accessed, the whole VCD must be delivered in one cycle to the
requesting units. An alternative is that you only send each unit the parts of the VCD
that are vital for that speciﬁc unit. The fact that a VCD transfer may take only
one clock cycle limits the size of the VCD a lot. To maximize the performance, the
VCD will be kept in a separate memory and to reduce the amount of data transferred
between diﬀerent parts in the DSB system, only a pointer to the VCD is transferred.
3.5 End of Packet
Nearly all modern protocols are packet based. The I/O units will understand when
a packet begins and when it ends. The DSB system must keep track of the end of
packets to keep the software updated with what happens. We will from now on refer
to “end of packets” as “packet ends”.CHAPTER 3. GENERAL DESIGN IDEAS 12
3.6 Software overview
A good idea is to let the software do everything that is not clock cycle time critical.
Examples of such events are the creation and removal of virtual channels. The creation
and removal of virtual channels require list searches and modiﬁcations and that is
much easier to do in software and can be allowed to take some extra time.
3.7 Transfer setup structure
The DSB needs to know where in the main memory to put the received data and where
to take the data that is to be sent from. Therefore a main memory data structure
that the DSB can read has to be implemented. We found two diﬀerent approaches to
implement such a data structure. Either there can be some kind of a list structure or
it can be some kind of instructions. The list structure will be easier to interpret in
hardware but will require more memory to be read before a transfer can be initialized.
The instructions will require more complex hardware but the transfer structure will
mostly be smaller, hence faster for the DSB to read.
3.8 I/O unit overview
The bus that connects the I/O units to the DSB hardware is called the I/O bus.
The I/O units will act as slaves on the I/O bus. The DSB is the master of the
I/O bus. The I/O units will be told when they are allowed to transfer data on the
I/O bus. To keep track of which I/O unit’s turn it is to transfer, there has to be some
sort of priority order.
The I/O unit must keep track of its ongoing transfers since the DSB hardware do
not.
3.9 Priorities
Priorities are needed at two places in this design. One to tell which I/O unit that can
transfer data on the I/O bus, and one to tell what data shall be transferred from or
to the buﬀer. The ﬁrst, the I/O bus priority order can easily be solved by having a
list which tells which I/O unit that is allowed to transfer data on the I/O bus. It is
suitable to conﬁgure this priority vector during boot-up, so it can be tuned to suit a
particular system. A solution to the second, the main memory priority problem, is to
queue the virtual channels on a special memory access queue, when they have need
for it. The need is based upon how much more transferred data the virtual channel
can handle before a VC overﬂow or underﬂow occurs. This can be implemented with
one or more queues. One advantage of having dual queues, instead of one, is that you
can divide the request depending on the level of need. One queue for virtual channels
that very soon are running out of free space or fresh data and desperately need to
receive or deliver new data, and the other queue can be for virtual channels that need
to deliver or receive data rather soon, but can wait. With the dual queue approach
the risk of getting a buﬀer overrun or a buﬀer underrun because of queue problems
should be much smaller.
3.10 Event and error reports
The DSB hardware needs to report to the software when something special has hap-
pened, like errors, packet end or transfer end. The report signaling is done via inter-
rupts to the processor.CHAPTER 3. GENERAL DESIGN IDEAS 13
In the DSB hardware there are two errors that can occur caused by limited access
to the main memory. Buﬀer underrun and buﬀer overrun. These two errors will occur
if the DSB can not get enough access to the main memory, which result in that a
virtual channel’s buﬀer will either be full of incoming data or does not contain any
data to be sent (i.e. underrun).
3.11 DSB operation example - Attaching a digital
camera
Imagine that you are about to download your vacation photos from your digital
camera. The ﬁrst thing that you will do is to connect it to the USB port. The USB
I/O unit will notify the camera driver software by sending an interrupt directly to
the CPU. The camera driver software will tell the DSB software to allocate space
for the new virtual channels. When that is done, the camera driver software can
initialize transfers. The camera itself can not do that. After detecting that the driver
and the camera understand each other it is time to download the photos. It will be
done by issuing a read command to the DSB unit. During initialization of the read
command the VCD that earlier was created in software, will be downloaded into the
DSB hardware. When the DSB is conﬁgured, it will broadcast the read command on
the I/O bus and USB I/O unit will detect that the read command is for one of its
attached devices. The DSB is the master of the I/O bus and during certain periods
the USB I/O unit will be allowed to transfer data. The data will be received by the
DSB and stored in the buﬀer. Before the buﬀer is full, the DSB starts to write the
data out to the main memory. Those two transfers are performed in parallel. When
the DSB has received all the data that this read command wanted, the DSB writes
back all data into the main memory and signals that all is done by doing an interrupt
to the CPU.
3.12 Summary
When the software initializes a new transfer, the DSB will broadcast the new transfer
command to all I/O units. The I/O unit that recognizes the corresponding device as
its own will save the provided information. The DSB hardware will have no memory
of this action, except that the relevant VCD will be updated. It is now up to the
I/O unit to demand the data for the newly started transfer. It has to no matter
transfer direction, the I/O unit has to request a transfer. This approach makes the
DSB hardware very simple, since it does not have to keep track of anything except
the priority queues and the priority vector. The important information is stored in
the VCDs which only will be accessed as a result of data transfers to and from the
I/O units. The software keeps track of all administrative information, such as free
memory blocks and used virtual channels. The information needed to conﬁgure the
DSB hardware is provided during the setup of a new virtual channel.
The schematic ﬁgure 3.4 illustrates how the parts of the DSB that are mentioned
above are connected to each other.CHAPTER 3. GENERAL DESIGN IDEAS 14
Figure 3.4: Schematic hardware DSB overview with the DSB core block magniﬁed.Chapter 4
Speed and size constraints for
the design
This chapter includes some basic calculations to determine some of the design param-
eters of the building blocks deﬁned in chapter 3.
4.1 System speciﬁcation
The diﬀerent protocols that this system shall be able to handle are discussed in
chapter 2. We have assumed the following system speciﬁcation:
• A 200 MHz system clock.
• A 200 MHz Double Data Rate (DDR) SDRAM main memory accessed through
a memory arbiter.
• A write-back cache that takes care of the cache coherence when the DSB writes
to the main memory.
4.2 The main memory (DDR SDRAM)
Reads and writes to DDR SDRAM memories are performed in bursts. The Double
Data Rate means that data is transferred at both the rising and falling edge of the
SDRAM clock. This doubles the data transfer rate compared to traditional SDRAM
memories.
The are a couple of interesting features from our point of view. The burst length
for DDR memories are two, four or eight cycles long. This means that the transfer
takes one, two or four clock cycles and that 8, 16 or 32 bytes are transferred. (32
bits are transferred per cycle [5]). The setup time for read transfers are called CAS
latency. A CAS latency of 2 or 2.5 cycles is standard, but CAS latencies of 1.5, 3 or
3.5 exist as well. There is no CAS latency for write transfers.
DDR memories are pipelined and multi-banked. This means that concurrent op-
erations are allowed and that the CAS latency can be hidden during normal transfers,
but must be included in worst case calculations. When row change or bank change
or both happens the response time is drastically increased.
Even though data is transferred on both rising and falling clock, commands can
only be issued at the rising edge.
4.2.1 Worst case calculations
The memory arbiter enqueues outstanding memory requests. There is no way to know
how long time it will take before the memory arbiter issues the DSB request.
15CHAPTER 4. SPEED AND SIZE CONSTRAINTS FOR THE DESIGN 16
4.2.2 Best case calculations
The best case transfer of a single read to the main memory is seven cycles and the
best case for a single write is ﬁve cycles. This results in a maximum transfer speed
of 3488 Mbit/s for read and 4883 Mbit/s for write.
4.2.3 Average case calculations
An educated guess would be that an average 32 bytes read would take around ten
cycles and the average write would take eight cycles. A write is faster since CAS
latency does not aﬀect writes. Transferring 32 bytes in ten cycles would give a read
transfer rate of 2440 Mbits/s and a write transfer rate of about 3050 Mbit/s. Assuming
that reads and writes are equally common, this would give an average transfer rate
of about 2745 Mbit/s.
4.3 Virtual channel size
The size of the diﬀerent virtual channel buﬀers is hard to estimate. We have focused
on one approach when it comes to choosing virtual channel buﬀer sizes. A better
way to optimize the size of the virtual channels is to simulate diﬀerent loads and
situations. We have been rather pessimistic in our size calculations which results in
rather large buﬀers. The reason for this approach is that we are more interested in a
stable functionality than trying to minimize the virtual channel buﬀer sizes.
Our approach is to minimize the eventual setup time for each transfer. Data is
transferred to and from DDR memories in bursts. As much data as possible should
be transferred each time to minimize the transfer setup time. There is some amount
of data that is optimal to transfer considering the buﬀer size requirements and the
memory bus occupation time requirements. To ﬁnd this ultimate point the following
equation is used to illustrate the ratio between how fast the DDR memory can transfer
data and how fast the protocol in question can transfer data:
DDR protocol ratio =
8∗buffer size
transfer rate
DDR setup time +
8∗buffer size
32∗2∗100∗106
Where
8 ∗ buffer size
is the size of the buﬀer in question in bits,
32 ∗ 2 ∗ 100 ∗ 106
is the peak DDR memory bandwidth.
It’s easy to see that when the buﬀer size increases the DDR memory setup time
inﬂuences less and less. When:
DDR setup time 
8 ∗ buffer size
32 ∗ 2 ∗ 100 ∗ 106 =⇒
DDR protocol ratio '
k
transfer rate
,where k is a constant.
However, each virtual channel can not have that much buﬀer. Some kind of
middle way between buﬀer size and transfer eﬃciency is needed. Table 4.1 displays
the estimated buﬀer memory size for the diﬀerent protocols.
When it comes to making as small buﬀers as possible for the fastest protocols, the
optimal transfer size is around 350-400 bytes. This implies that the largest buﬀersCHAPTER 4. SPEED AND SIZE CONSTRAINTS FOR THE DESIGN 17
Table 4.1: Estimated VC size.
Protocol Transfer rates Estimated VC size in bytes DDR pro-
tocol ratio
Serial ATA 1.2 Gbit/s 768 4.50
IEEE 802.3 1 Gbit/s 768 5.5
IEEE 802.3 100 Mbit/s 768 55
IEEE 802.3 10 Mbit/s 128 381
IEEE 1394b 800Mbit/s 768 7.0
USB 2.0 LS (1.5Mbit/s) 128 10803
USB 2.0 FS (12 Mbit/s) 128 536
USB 2.0 HS (480 Mbit/s) 768 45
RS232 6.25 Mbit/s 128 600
will be around 700-800 bytes since it is not likely that the entire buﬀer of a virtual
channel will be transferred at the same time. There is no real lower limit, on slow
speed devices, it is more a question on how small buﬀers that the DSB can control
and how much time DSB can occupy the main memory bus.
4.4 How large shall the buﬀer be?
It is very hard to say how many devices that can be attached and used at the same
time. Table 4.2 shows an assumption of how many virtual channels that a realistic
system conﬁguration can cause.
Table 4.2: Assumption of a realistic upper limit of maximum simultaneously open
virtual channels
Protocol Max open
virtual
channels
Comment
Serial ATA 4 Two S-ATA ports, R and W channels for each
IEEE 802.3 4 Two Ethernet ports, R and W channels
IEEE 1394b 8 Two units attached, R and W channels for
both asynchronous and isochronous transfer.
USB 2.0 25 Five units attached, ﬁve channels each
RS232 4 Two serial ports, R and W
Sum 48
So allowing up to 64 channels to be open at the same time would be more than
enough for most uses. If we allow the fastest virtual channel to use 800 bytes, the
worst case memory allocation will then be 800*64 bytes (51200 bytes). To access
51200 bytes, 16 bits are required. That gives a buﬀer roof of 64 kbytes. Below is an
assumption on how much memory the conﬁguration mentioned earlier will need.
Table 4.3: Assumption of a realistic buﬀer usage.
Protocol Open virtual channels Required memory (in bytes)
Serial ATA 4 4*768
IEEE 802.3 4 4*768
IEEE 1394b 8 30*128 (Average)
USB 2.0 25 30*128 (Average)
RS232 4 4*128
Sum ' 14kCHAPTER 4. SPEED AND SIZE CONSTRAINTS FOR THE DESIGN 18
It’s very hard to say how much the IEEE 1394b and the USB 2.0 devices will
need since they all have diﬀerent levels of transfer rate and type of channel. For
many system conﬁgurations 16 kbytes of buﬀer will be enough, but since there is no
simulation on how much buﬀer memory is needed, it is a good idea to allow for the
design to handle more. We have decided to make the maximum amount memory that
the DSB can handle to 64 kbytes.
4.5 Virtual channel descriptor size
As mentioned in section 3.4, the VCD memory must be accessed in one cycle because
the information it holds is vital to the performance. What limits the size of the
VCD is how wide the memory data bus can be. With today’s technology and for a
reasonable cost, we have assumed a maximum bus width of 512 bits. The available
interesting conﬁgurations are a 512 bits data bus with one port, or two 256 bits wide
data buses with two ports. (Each port can do one transfer each clock cycle to or from
the memory.) Since the information provided by the VCD is required by all parts of
the DSB hardware, the two ported variant is far more likely to be used. And that
limits the VCD to 32 bytes (256 bits).
4.6 Block size calculations
The cache line size is 32 bytes (256 bits), so the preferable option from a cache
coherency point of view would be to use the same virtual channel block size or a
multiple of 32.
Table 4.4 shows the number of bits required to point out one block in the buﬀer
and table 4.5 displays how many blocks there will be in the buﬀer for diﬀerent buﬀer
sizes and block sizes.
Table 4.4: Bits required to point out one block
Buﬀer
size VC
block size
8kbytes 16kbytes 32kbytes 64kbytes 128kbytes 256 kbytes
8 bytes 10 11 12 13 14 15
16 bytes 9 10 11 12 13 14
32 bytes 8 9 10 11 12 13
64 bytes 7 8 9 10 11 12
128 bytes 6 7 8 9 10 11
Table 4.5: Number of blocks
Buﬀer
size VC
block size
8kbytes 16kbytes 32kbytes 64kbytes 128kbytes 256 kbytes
8 bytes 1024 2048 4096 8192 16384 32768
16 bytes 512 1024 2048 4096 8192 16384
32 bytes 256 512 1024 2048 4096 8096
64 bytes 128 256 512 1024 2048 4096
128 bytes 64 128 256 512 1024 2048
To minimize the number of block pointers in the VCD, the slowest protocol
(RS232) will only have one block. It does not matter so much if the block is too
large for the RS232, except for the waste of buﬀer space. Since the other protocols
are much faster that the RS232 protocol we have choosen a block size of 128 bytes.CHAPTER 4. SPEED AND SIZE CONSTRAINTS FOR THE DESIGN 19
128 bytes is not too much unnecessary space for the RS232 and it causes the fastest
protocols to have only 7-9 block pointers in their VCD.
4.6.1 The connection to the I/O units
In order to optimize the I/O bus width some calculations regarding the protocols are
needed. In table 4.6 there are comparisons between the theoretical peak data rate of
each protocol transferred over a 200 MHz data bus with diﬀerent bus widths. The
ratio between the bus bandwidth and the protocol bandwidth is calculated by:
IOBus protocol ratio =
bus width ∗ 200 ∗ 106
protocol max transfer rate
Table 4.6: I/O bus speed vs. protocol speed
Protocol 256
bits
128
bits
64
bits
32
bits
16
bits
8 bits
Serial ATA, 1.2 Gbit/s 42.667 21.333 10.667 5.333 2.667 1.333
Gigabit Ethernet, 985.5
Mbit/s
51.953 25.977 12.988 6.494 3.247 1.624
IEEE 1394b, 785.3 Mbit/s 65.198 32.599 16.3 8.15 4.075 2.037
USB 2.0, 413 Mbit/s 123.971 61.985 30.993 15.496 7.748 3.874
RS-232, 6.25 Mbit/s 8192 4096 2048 1024 512 256
This table does not give a realistic view of when data can be transferred over
the bus, since there are several units sharing it. That makes things much more
complicated, so a more realistic measurement would be to cut the bus performance
in halv. (i.e. a reduction by a factor of two.)
The conclusions drawn from this table is that everything will probably manage on
a 128 bits wide bus.
The connected I/O units on the I/O bus will not be able to transfer data every
time a byte arrives. They must have small FIFOs, that can buﬀer data while it waits
for its turn to be allowed to transfer data.CHAPTER 4. SPEED AND SIZE CONSTRAINTS FOR THE DESIGN 20
4.7 The internal bus and the main memory bus
Since the I/O bus is 128 bits wide, it is suitable to make the buﬀer bus 128 bits wide
too. However, the memory arbiter wants data in packets of 256 bits, which forces the
DSB to have a small buﬀer of 256 bits which it uses for transfers between the main
memory and itself. It is better to try to keep the data buses as small as possible since
it requires less hardware.
4.8 Summary
The bottleneck of DSB is clearly the DDR memory bandwidth.Chapter 5
Between the ideas and the
implementation
The purpose of this short chapter is to build a bridge between the ideas and limita-
tions presented earlier and the implementation plans which are presented in the next
chapter.
Figure 5.1: Schematic hardware DSB overview with the DSB core block magniﬁed.
When an I/O unit wants to transmit or receive data, it must ﬁrst wait until the
DSB tells it that it is allowed to transfer data. In order to transfer data, the I/O
21CHAPTER 5. BETWEEN THE IDEAS AND THE IMPLEMENTATION 22
unit tells the DSB which virtual channel the transfer will refer to. All this happens
over the I/O bus. When the DSB knows which virtual channel that shall be used,
it requests the channel’s VCD from the VCD memory. From the delivered VCD the
DSB extracts the address, to where the data shall be stored or loaded from. The
address is then sent to the buﬀer to prepare for transfer.
After this is done the VCD is updated with the incremented address and is the
written back to the VCD memory. If the I/O unit wants to write data into the
buﬀer it can be done during the same cycle, but if it wants to have some data, the
buﬀer will return the requested data one cycle later. It will be simpler to handle all
transfers equally, independent of direction. Therefore we have decided that writing
information to the buﬀer also will be delayed one cycle. In short, a complete transfer
of data between an I/O unit and the buﬀer will take three cycles. See ﬁgure 5.2
Figure 5.2: A full transfer cycle.
The memory bus logic takes care of transferring data between the DSB and the
main memory. To know what to do, it looks for jobs in the queue where the I/O bus
logic queues virtual channels that need more data from the main memory or need to
store data in the main memory. A virtual channel ends up in the queue when most
parts of its buﬀer have been used. When the memory bus logic has gotten a pointer
to a virtual channel, it requests information about it from the VCD memory. After
it has received the information, it receives the virtual channel’s instructions from the
main memory. When the instructions have arrived the memory bus logic decodes
them and ﬁgures out the address in the memory where data shall be stored or loaded,
depending on the direction of the virtual channel. The memory bus logic performs the
needed memory transfers and during the transfer it keeps the VCD memory updated
with the latest changes. The memory bus logic and the I/O bus logic should be able
to operate in parallel.Chapter 6
System implementation
suggestion
The suggested system implementation is shown in ﬁgure 6.1. This is an enhancement
of the simple overview of the DSB hardware shown in ﬁgure 5.1. It shows the func-
tional units, the memories, the signals between and how the software interacts with
the hardware. This ﬁgure will be described in detail during this chapter.
First, in section 6.1, we explain the execution of the suggested system. Sections 6.5
and 6.6 is meant as a guide to the units of ﬁgure 5.1 and should aide in understanding
the function of these units.
23CHAPTER 6. SYSTEM IMPLEMENTATION SUGGESTION 24
Figure 6.1: Overview of the DSB’s functional units and adjacent devices.6.1 Execution pattern
This section explains how certain parts can be parallelized in order to increase the
throughput of the system. We call the parallelization method for the execution pat-
tern. As mentioned in chapter 5, a data transfer between an I/O unit and the buﬀer
requires three cycles. We have decided to name the three diﬀerent cycles, WR(Write
and Read setup), C(Command) and D(Data) (See table 6.1 below).
Table 6.1: The names of the cycles and what is executed during them.
Cycle
name
Description
WR The I/O unit tells the DSB what VC that is involved during the transfer
C The I/O unit tells the DSB about how much data that shall be
transferred during the D cycle and about transfer/packet ends
D Data is transferred between the I/O unit and the buﬀer.
The cycles WR, C and D are something that the I/O units must follow, not some-
thing that they create whenever they feel there is a need for them. The DSB controls
when an I/O unit can perform a WR, a C or a D. So from here on, WR, C and D will
be called commands. The involved parts of the DSB hardware is shown in ﬁgure 6.2.
Figure 6.2: Transfer between the I/O units and the buﬀer master.
25CHAPTER 6. SYSTEM IMPLEMENTATION SUGGESTION 26
The next step is to analyse what memories and buses, the diﬀerent commands
require in order to operate. The following table describes the diﬀerent resources that
the commands are required by.
Table 6.2: The resources and explanation.
Resource name Description
Buﬀer data Data bus
Buﬀer control Control signals for the buﬀer such as direction
and address
I/O bus data The part of the I/O unit connection where data is transferred
I/O bus control The other part of the I/O unit connection where control
and status signals are transferred
VCD memory The access to the VCD memory. Both control and data.
Table 6.3 shows the commands and the needed resources.
Table 6.3: Resources needed for the diﬀerent commands.
Command I/O Bus
Data
I/O Bus
Ctrl
VCD Buﬀmem
Data
Buﬀmem
Ctrl
WR - X X(r) - -
C - - X(w) - X
D X - - X -
During the transfer between the buﬀer and the main memory, the VCD memory
needs to be updated. Therefore we introduce two more commands, M (Memory setup)
and T (Transfer). During the M command the logic transfers data between the buﬀer
and the main memory is allowed to read from the VCD memory and during T it can
write.
Figure 6.3: Transfer between the main memory and buﬀer.CHAPTER 6. SYSTEM IMPLEMENTATION SUGGESTION 27
The last command that is needed is one that delivers diﬀerent commands to the
I/O units, such as new transfer and transfer end(if needed). We have decided to call
this command Cmd. It needs the I/O Bus Ctrl path (see ﬁg. 6.4).
Figure 6.4: A Cmd command transfer between the command buﬀer and the I/O units.
The complete table of DSB commands os shown below. M and T are internal and
the others are on the I/O bus.
Table 6.4: Resources needed for the diﬀerent commands.
Name I/O
Bus
Data
I/O
Bus
Ctrl
VCD Buﬀmem
Data
Buﬀmem
Ctrl
Buﬀmem - I/Os WR - X X(r) - -
C - - X(w) - X
D X - - X -
Main mem - Buﬀmem M - - X(r) - X
T - - X(w) X -
Cmds - I/Os Cmd - X - - -CHAPTER 6. SYSTEM IMPLEMENTATION SUGGESTION 28
The longest sequence of commands which must be performed before the task is
ﬁnished is the WR, C and D sequence. Therefore parallelizing the DSB’s executing
unit with a depth of three levels will be suitable.
Table 6.5 shows which two commands that can be issued at the same cycle.
Table 6.5: Possible combination of two commands.
Command 1 Command 2 Can be executed at the same time?
WR C Yes
WR D Yes
WR M Yes
WR T Yes
WR Cmd No
C D Yes
C M No
C T Yes
C Cmd Yes
D M Yes
D T No
D Cmd Yes
M T Yes
M Cmd Yes
T Cmd Yes
Table 6.6 show which three commands that can be issued during the same cycle.
Table 6.6: Valid combinations of commands ussued during the same cycle.
Command 1 Command 2 Command 3 Can be executed at the same time?
WR C D Yes
WR C T No (3 VCD memory accesses)
WR D M Yes
WR M T No (3 VCD memory accesses)
C D Cmd Yes
C T Cmd Yes
D M Cmd Yes
As you can see, there are several combinations of three commands that can be
executed simultaneously. If it was diﬀerent, a three stage parallelizing would not be
suitable. However, from this it is clear that parallelizing the commands is a good
idea.
Below is a suggestion on how this parallelizing could be done. It shows a sequence
of possible parallelized commands. This sequence requires that the VCD memory has
two ports and which port that is used is shown within parenthesis.
Table 6.7: Execution pattern.
State 0 1 2 3 4 5 6 7
WR (0) C (1) D Cmd WR (1) C (1) D Cmd
C (1) D M (0) T (0) - - M (0) T (0)
D Cmd WR (1) C (1) D WR (0) WR (1) C (1)
VCD
Mem-
ory
usage:
2 1 (1) 2 2 1 (1) 2 2 2CHAPTER 6. SYSTEM IMPLEMENTATION SUGGESTION 29
The reason for putting an M and T each fourth cycle is that the main memory can
only deliver 256 bits of data each fourth cycle at best. So at those points possible data
transfer must take place. This pattern is created in such a way it can be repeated
over and over again.
Notice that in state 1 and 4 port 1 of the VCD memory is free for use. It will be
used by the software when it needs to modify VCDs, e.g. starting new transfers or
creating a virtual channel.
As shown in table 6.7, the utilization is quite high as there are only two unused
slots in the pattern. The total execution pattern eﬃciency is : 22/24 = 92 %. The
I/O bus is used during 5 of 8 cycles for data transfers which results in a total data
transfer rate of
5/8 ∗ 200 ∗ 106 ∗ 128 = 16 Gbit/s
The I/O bus data transfer rate is about four times faster as the DDR memory. This
means that a faster DDR memory can be used without any changes needs to be done
to the DSB hardware and that the I/O bus width can be decreased, if the I/O units
are known and their bandwidth is far lower than the I/O bus’.
6.2 Queues
As mentioned in chapter 3, there are two queues for virtual channels that are waiting
for some kind of transfer to or from the main memory. The ﬁrst queue is a low
priority (LP) queue where virtual channels that have used 50% of their buﬀer space
are enqueued. The second queue is a high priority (HP) list where virtual channels
that have used more than 75% of its buﬀer space will be enqueued.
6.3 DSB Instructions
The software driver has to setup a transfer structure in the main memory before a
transfer can start. We have chosen to implement this behaviour with a couple of
instructions, shown in table 6.8.
Table 6.8: DSB transfer instruction.
Instruction arguments Description
data address, size Points out a memory area of speciﬁc size.
pe - Last byte is the packet end.
te - Transfer end.
irq number Generate IRQ.
cjmp rel adr Where to jump for retransfering data on collision.
Relative address.
jmp rel adr Jump, to relative address.
nop - No operation instruction.
The thought behind this idea is that the “data” instruction is the master instruc-
tion and all other instructions that come after it until a new “data” instruction comes,
are related to it.
6.4 Inside the Virtual Channel Descriptor (VCD)
During the previous chapters and sections, a lot of things have been mentioned which
will be kept in the VCD memory. Table 6.9 summarizes what has to be included in
the VCD.CHAPTER 6. SYSTEM IMPLEMENTATION SUGGESTION 30
Table 6.9: What a VCD needs to include.
Name Description
curr buﬀ ptr Byte oriented pointer to the next fresh byte
cirr buﬀ ptr Byte oriented pointer to the end of the circular buﬀer that the
blocks form
curr inst ptr Pointer to the current DSB instruction. Byte oriented
curr mem ptr Current memory address to the next fresh byte. Byte oriented
pkg pointers Byte oriented end of packet pointers.
pkg usage Bitvector of what packet ends that are in use.
blk pointers Pointers to the allocated blocks
blk usage Bitvector of what block pointers that are in use.
dir Direction of the channel
in lp A ﬂag to inform that the virtual channel is already queued in the
LP list
in hp A ﬂag to inform that the virtual channel is already queued in the
HP list
full empty A ﬂag to diﬀerentiate between a full and empty virtual channel
All this has to ﬁt within 32 bytes. The next table tries to identify the size and
number of the entries listed in table 6.9, that need to be in a VCD.
In chapter 4 we have stated that the VCD can not be larger than 32 bytes, a block
shall be 128 bytes, and that the buﬀer shall be allowed to be up to 64 kbytes large.
The result is that the buﬀer can at maximum be divided in 512 blocks. To point out
512 blocks, nine bits large pointers are required. There are two approaches to choose
between, one where there is a ﬁx number of possible blocks to allocate and another
ﬁx number of possible packet ends. (See table 6.10)
Table 6.10: Static VCD - 64kbytes buﬀer memory, 128 bytes blocks (total 512 blocks),
VCD 32 bytes, 6 ptrs per VCD, 10 pkg end ptrs.
What Size
for
each
(bits)
Total
num-
ber
of
Comment
curr buﬀ ptr 10 1 Shall be able to address 6 * 128 bytes
cirr buﬀ ptr 10 1
curr mem list ptr 32 1
curr mem ptr 32 1
dir 1 1 1=Read, 0=Write
blk pointer 9 6
blk usage 1 6 Bitvector
pkg pointers 10 9 10 bits can point of any byte in the VC.
pkg usage 1 9 Bitvector
transfer end marker 1 1 If this bit is set, we’re approaching the
end.
in hp 1 1 If bit set, in HP list.
in lp 1 1 If bit set, in LP list.
full empty 1 1 To know the diﬀerence between an
empty and full circular buﬀer.
unused 8 1 Reserved for future use.
However, at this point and without any thoroughly made simulations there is no
way to know if nine packet ends are enough for all supported protocols or if it is more
than necessary. Also it could prove during simulations that six blocks are not enough
for the fastest protocols. This approach is a bit too static.CHAPTER 6. SYSTEM IMPLEMENTATION SUGGESTION 31
The second approach is a little more complicated but is better since the number
of blocks and packet ends are dynamically allocated.
Table 6.11: Dynamic VCD - 64kbytes buﬀer memory, 128 bytes blocks (total 512
blocks), VCD 32 bytes, dynamic number of blk ptrs and pkg ends per VCD.
What Size
for
each
(bits)
Total
num-
ber
of
Comment
curr buﬀ ptr 11 1 Shall be able to address 13 * 128 bytes
cirr buﬀ ptr 11 1
curr mem list ptr 32 1
curr mem ptr 32 1
num blks 4 1 The ﬁrst 1-16 of the conﬁgurable point-
ers are block ptrs.
conﬁg ptr block 12 13 1bit usage, 11 bits ptr (to reach 13*128
bytes).
dir 1 1 1=Read, 0=Write
transfer end marker 1 1 If this bit is set, we’re approach the end.
in hp 1 1 If bit set, in HP list.
in lp 1 1 If bit set, in LP list.
full empty 1 1 To know the diﬀerence between an
empty and full circular buﬀer.
unused 5 1
In this VCD-structure(See table 6.11) there are 13 conﬁgurable pointers that can
be used for packet ends and block pointers instead of 15 ﬁxed(6 for blocks and 9 for
packet ends). The dynamic approach requires a little more complicated hardware
than the static approach.CHAPTER 6. SYSTEM IMPLEMENTATION SUGGESTION 32
6.5 Software
This section describes the software part of the system architecture illustrated in ﬁg-
ure 6.5. The driver interface is ﬁrst described. Then the internal functions are de-
scribed in section 6.5.2 and at the end the data tables and vectors are described in
section 6.5.3.
Figure 6.5: The software of the DSB.
6.5.1 Device driver interface
This is the interface between a speciﬁc device driver for a particular I/O unit and the
DSB hardware.
Open - VCD ptr Open( Id, Dir )
A device driver uses the Open interface to set up a connection before it reads or
writes something to a speciﬁc I/O device. The Open interface then creates a new
virtual channel. It uses the information stored in the VCD memory vector, VCDmem
copy and Block allocation vector to create a new VCD without conferring with the
hardware. When the VCD is complete, it is transferred to the DSB hardware. Observe
that the virtual channels are unidirectional, if you want to be able to both read and
write, you have to create two virtual channels.CHAPTER 6. SYSTEM IMPLEMENTATION SUGGESTION 33
Close - void Close( VCD ptr )
Closes a virtual channel by deallocating the virtual channel in the VCD memory vector
and freeing its block in the Block allocation vector. This is done when the connection
is no longer needed and a device driver wants to end it. The DSB hardware does not
need to be notiﬁed.
Boot - void Boot( VCconf, I/O prio vect )
Boot is the initialization interface of the DSB that should be conﬁgured during boot
time of the operating system. During this stage the settings of the diﬀerent virtual
channels are conﬁgured. The conﬁgurable settings are the buﬀer size of each type of
virtual channel and the I/O priority list. Depending if the approach of how packet
ends in the VCD is handled is chosen, the number of possible packet end will also
be conﬁgured. All these settings, except the I/O priority list are stored in the main
memory in a table and used when new virtual channels are created. The I/O priority
list is kept in a special memory in the DSB hardware.
Read/Write - void Read/Write( Id, VCD ptr, Part of VCD )
A device driver invokes the Read/Write interface when it wants a new read or write
transfer started. It changes the memory pointers in the VCD and through the VCD
Alter Queue it delivers the VCD pointer (VCD ptr) to a speciﬁc I/O unit. This then
triggers the I/O unit to start transfer data.
Flush - void Flush( VCD ptr )
Forces all buﬀered data of a speciﬁed virtual channel to be ﬂushed out of the DSB
buﬀer. It forces a ﬂush by sending the VCD ptr to the Low Priority Queue. It only
works for read transfers.
Abort - void Abort( VCD ptr )
A device driver uses the Abort interface when it wants stop a I/O transfer. The Abort
interface does this by by removing the VCD ptr from the I/O units (broadcasted) by
sending the Abort command and VCD ptr to the Command buﬀer.
6.5.2 Internal functions
This section covers all of the internal functions of the software that is needed to
perform some speciﬁc tasks on data structures that reside in either the software or
hardware part.
Add VCD
This function allocates a VCD ptr that points to a free VCD in the VCD memory.
To do this it needs to ﬁrst ﬁnd a free space in the VCDmem allocation vector, which
represents the VCDs in the VCD memory. Then it allocates the space found in the
VCDmem allocation vector and sends the VCD ptr on to the Create VCD function.
Create VCD
This function creates the VCD in the VCD memory and the VCDmem copy with the
information it gets from its connecting functions. To create the VCD in the actual
VCD memory it uses the VCD hardware altering mechanisms (VCDmem copy table,
Allocate blocks function and VCD Altering Queue).CHAPTER 6. SYSTEM IMPLEMENTATION SUGGESTION 34
Allocate blocks
This function allocates the blocks in the buﬀer that a particular virtual channel needs,
so that the Create VCD function knows which blocks it can use. This is done by
searching and allocating in the Block allocation vector.
Deallocate blocks
This function deallocates the blocks of a virtual channel in the VCDmem copy and
the Block allocation vector so that they can be reused by new virtual channels.
Remove VCD
This function deallocates a VCD in the VCDmem allocation vector so it can be reused
by a new virtual channel.
6.5.3 Data tables and vectors
This section describe the data structures and their contents. There are two tables
and two vectors of data in the DSB software.
Block allocation vector
This vector contains information on which blocks in the buﬀer that are free and which
are being used.
The vector is written to by the Deallocate blocks. The Create VCD function reads
and writes to this vector.
VCDmem allocation vector
This is a vector that holds information on which VCDs are free and which are currently
being used.
It is read and written by the Remove VCD and the Add VCD functions.
Virtual channel conﬁguration
This table contains the sizes of the buﬀers that the diﬀerent virtual channels need.
They are updated in the initialization process by the Boot interface.
The table is read by the Open interface and sends the speciﬁc I/O’s VC conﬁgu-
ration.
VCDmem copy
This table contains information on which blocks each VCD has allocated. This table
stores this so that the Close interface can deallocate the blocks in the Block allocation
vector.
The vector is read and written to by the Deallocate blocks function and is read
by the Create VCD function.CHAPTER 6. SYSTEM IMPLEMENTATION SUGGESTION 35
6.6 Hardware units
This section describes the DSB hardware, shown in ﬁgure 6.6.
Figure 6.6: The hardware of the DSB.
6.6.1 Interface between the hardware and software
The interface between the hardware and software consists of the I/O priority vector,
command buﬀer, VCD alter queue and the LP list (see ﬁg 6.6). As seen in ﬁgure 6.1
the I/O priority vector interface consist of the VCD ptr and abort command. The
command buﬀer interface consist of the I/O prio vect and the VCD alter queue inter-
face consist of the VCD, VCD ptr and the create command. The last interface, the
LP list consist of the VCD ptr.
6.6.2 Functions
This section describes the control logic units of the DSB hardware.
Buﬀer master
The buﬀer master is the control logic that handles the buﬀer when an I/O unit wants
to write or read into the buﬀer.
The buﬀer master is controlled by the bus master with the WR, C and D control
signals described in section 6.1. When the buﬀer master gets the VCD for the virtual
channel it is about to either read or write to, it checks whether it needs to ﬁll or empty
the virtual channel. This is done by sending the VCD ptr to either the low or high
priority queue. The low priority queue is used when the virtual channel has consumed
over 50% of its buﬀer space and the high priority queue when over 75% is consumed.
Before sending the VCD ptr to either of the priority queues, the buﬀer master checks
the VCD so that the same VCD ptr is not already in that priority queue. We do notCHAPTER 6. SYSTEM IMPLEMENTATION SUGGESTION 36
want multiple entries of the same VCD ptr in a priority queue. The buﬀer master
also checks that the VCD ptr is not in the high priority queue before sending the
VCD ptr to the low priority queue.
The buﬀer master uses interrupts to signal packet ends, transfer ends, buﬀer un-
derruns and buﬀer overruns (Pkg E, T E, BU, BO).
Bus master
The bus master sends the control signals to the I/O units, buﬀer master, memory
transfer master and the VCD alter master. It always sends three commands each
cycle (although a command can be a no-operation command). These commands
follow a eight cycle pattern described in section 6.1. It checks the I/O priority vector
to know which I/O unit that shall be allowed to initiate a transfer.
Memory decoder
The main purpose of the memory decoder is to transfer data, to and from the main
memory. To do this it needs to be able to interpret the instructions that tell the
memory decoder what data is to be sent or where to put the data that has been
received.
The memory decoder gets the VCD and VCD ptr from the queue master and
checks the VCD to see how much data should be transferred and in which direction.
It then arbitrates the main memory for the next 256 bits large instruction block. When
it arrives, it interprets the instructions and programs the memory transfer master.
During the task to transfer data between the main memory and the VCs buﬀer area
it can be necessary to read more instructions (but not likely) and to program the
buﬀer master several times. When the whole transfer is done, it has to tell the queue
master that it is done and ready for a new order.
If the memory decoder detects a transfer end instruction while writing to the main
memory, then it sends an abort command and the VCD ptr to the command buﬀer.
If the memory decoder detects a transfer end with a interrupt attached to it while
reading from main memory, then the interrupt number is sent to the buﬀer master.
During a write sequence, a packet collision can happen and in worst case the
memory decoder has to refetch the data. The queue master tells the memory decoder
when to do this.
VCD updates are done when the memory transfer master signals that it is done
with its task.
The memory decoder uses interrupts to signal packet ends and transfer ends
(Pkg E, T E) that occur while writing to the main memory.
Memory transfer master
The memory transfer master sets up transfers between the buﬀer and the main mem-
ory. It does this by using the MT buﬀer and the memory decoder. The MT buﬀer is
needed to bridge between the 256 bits wide main memory bus and the 128 bits wide
buﬀer bus.
The memory transfer master asks the memory arbiter for permission to transmit
data. For each 256 bits (or less) block of data, a new arbitration round must take
place.
The memory transfer master is controlled by the bus master with the M and T
control signals.
Queue master
The main purpose of the queue master is to look for jobs in the low and high priority
queues, check that it is still valid and tell the memory decoder to start a transfer.CHAPTER 6. SYSTEM IMPLEMENTATION SUGGESTION 37
The queue master should always look for jobs in the high priority queue ﬁrst.
When there are no more available jobs left in the high priority queue, then it should
look in the low priority queue.
Since there are two queues there is a possibility that the same VCD ptr is in both
queues. So the queue master need to check the VCD and see if the buﬀer has already
been ﬁlled or emptied. It then checks that the buﬀers of a low priority job really has
more than 50% consumed before sending it to the memory decoder. This will require
an extra bit for each entry in the low priority list to keep track if the VC is about to
be ﬂushed or not. Flushed VC shall be executed no matter the status.
VCD alter master
The VCD alter master reads from the VCD alter queue and writes to the VCD in
the VCD memory and passes read and write commands to the command buﬀer. In
the eight cycle long pipeline pattern, the VCD alter master can write to the VCD
memory in two cycles.
6.6.3 Memories and buﬀers
Buﬀer
The buﬀer is the central hardware unit of the DSB. It stores all the data that the
virtual channels receive or wants to transmit.
It is a one ported, byte addressable memory. It will of course have to operate at
the same 200 MHz frequency as the rest of the hardware design.
The memory is always accessed in two cycles. The address in the ﬁrst cycle and
the data in the second.
Memory size: Estimated to be 16 kbytes (The DSB is designed to handle up to
64 kbytes )
I/O priority vector
The vector holds a priority list of the I/O units. The priority list speciﬁes which I/O
unit is allowed to transfer data to or from the buﬀer.
The vector is software conﬁgurable, since in some systems not all I/O units will
exists. Also by conﬁguring the priority of the I/O units, a system that has a spe-
ciﬁc high load on a speciﬁc I/O unit can be conﬁgured to transfer data more often.
However, the DSB is designed to fulﬁll the demands of all I/O units at a very high
load. So in a vast majority of the cases, no changes will be needed to be done in the
priority vector.
The memory will function as a circular buﬀer, when you read from it, the current
pointer will step to the next element in the vector. Writes to the vector are handled
during the same cycle, while read returns the data during the next cycle.
Memory size: 64 entries of each 4 bits, total size 256 bits.
Command buﬀer
The command buﬀer buﬀers commands sent from the software to the I/O units. The
commands are then broadcasted to all I/O units when the execution pattern tells it
to.
There are two types of commands that passes through the command buﬀer: the
transfer start and the abort commands. For read and write the unique id of a speciﬁc
I/O unit, device and channel must be given. It is a 22 bits large value. For this, the
VCD ptr is enough.CHAPTER 6. SYSTEM IMPLEMENTATION SUGGESTION 38
The memory will function as a circular buﬀer, when you write or read 32 bits, it
automatically points out the next place in its the memory. Writes are handled during
the same cycle, while read returns the data during the next cycle.
Table 6.12: Command buﬀer: data contains.
Name Size (in bits) Comments
Command 4 Read, Write or Abort
ID 22 I/O unit, device, channel
VCD ptr 6
Memory size: 6 entries of each 32 bits, total size 192 bits (will probably be more
than enough).
Low priority (LP) list
This is one of the two queues of the DSB’s queue system described in section 6.2.
The buﬀer master and the ﬂush command enqueues entries in the list and the
queue master dequeues them. The entries consist of a VCD ptr and a ﬂush bit. The
ﬂush bit tells the queue master that this is a ﬂush and not a normal transfer (might
ﬂush a few bytes). Works as the command buﬀer, except that each memory cell is
seven bits large. We used 16 entries, but we did not have any scientiﬁc base for this
choice. The required size should be estimated by simulation.
Memory size: 16 entries of each 7 bits, total size 112 bits.
High priority (HP) list
This is the second and faster queue of the DSB’s queue system described in section 6.2.
The high priority queue is like the low priority queue except that it has a resend
bit instead of a ﬂush bit. This bit signals the queue master to resend the last transfer
from the main memory.
Memory size: 16 entries of each 7 bits, total size 112 bits.
Memory transfer buﬀer
This is the buﬀer that the transfer master uses to bridge between the 256 bits wide
main memory arbiter data bus and the buﬀer data bus.
It is basically a memory which you can read and write 128 bits or 256 bits a time.
Therefore it is divided into a high and a low part, each of 128 bits. It can be written
or read either by one part or the whole buﬀer, depending on which bus is accessing
it.
Memory size: 256 bits
VCD alter queue
This is the queue that the software uses when it wants to write to the VCD memory.
It is a circular memory
We have made the VCD alter queue three entries big. It needs two cycles to read
and write the large data entries.
Memory size: 3 entries of each 288, total size 1152 bits.CHAPTER 6. SYSTEM IMPLEMENTATION SUGGESTION 39
Table 6.13: VCD alter queue: data contains.
Name Size (in bits) Comments
Command 4 Read, Write or Create
ID 22 I/O unit, device, channel
VCD ptr 6
VCD 256
VCD memory
This is a byte oriented, two ported memory. Each transfer will take two cycles, the
ﬁrst cycle is used for setup, the second for transferring. It requires that the memory
is pipelined.
Memory size: 64 entries of 256 bits, total size 16384 bits.Chapter 7
Simulation and testing
This chapter describes the process of taking the system architecture suggestion of
chapter 6 and making a simulation model from it. We will also describe all the tests
made on this simulation model and its results.
7.1 Simulator tool - SID
SID is a framework for building computer system simulations. [3] The basic block in
SID is called a component. A component is like a black box that acts like a speciﬁc
hardware logic block. SID provides buses and pins in order to make it possible for
components to interact with each other.
7.2 Implementation with SID
We implemented the hardware part of the suggested hardware buﬀer design in C and
C++ as a SID component. The software part is only partly implemented since there
was no need to implement it all to test the hardware. Also in the software part there
is nothing that might be complicated to implement.
To simulate the I/O units we wrote a SID component of a conﬁgurable generic
I/O unit that behaves like several I/O units. Our conﬁguration setup was based upon
a normal SID snapshot and we ran our tests on a simulated ARM processor without
any operating system on.
We chose to implement the simpler and slower pipeline pattern, the static VCD
model (see table 6.10) due to lack of time.
Our implementation of the DSB component is clock cycle exact. All internal and
external signals are simulated with delays to make it hardware alike. The simulated
loads were all setup in the same way. At startup a number of transfer request was
queued. Each transfer request includes transfer speed, an amount of data, the size of
that data and direction. Since setting up a transfer is executed much faster than a
transfer is completed, most of the transfers will be performed in parallel. To simplify
things we decided that the latency of the simulated DDR memory to be constant.
To assure that the DSB acted as expected a graphical interface with all signals and
memories were developed. Later when we had validated that the DSB component
worked as expected we exchanged the visual inspection method to only making com-
parements between the input and output data. The input raw data was selected to
assure that as few possible errors as possible could go through the system undetected.
40CHAPTER 7. SIMULATION AND TESTING 41
7.3 Simulation conclusions
Our simulations were mostly attempts to ﬁnd out where and when the DSB fails, i.e
buﬀer underrun and buﬀer overrun. No matter how we divided the loads between few
channels with high bandwidth or many channels with a low bandwidth, the reason
for failure was always the DDR memory bandwidth.
Our system design kept what it promised. We simulated with several diﬀerent
loads and the conﬁguration parameters that we earlier suggested worked.
The hardest parameter to estimate was the virtual channel sizes and the simula-
tions proved that our estimation was acceptable. During the creation and simulation
of several conﬁgurations we did notice a couple of things in our design that could have
been done better.
7.3.1 Instruction buﬀer
To reduce the number of memory accesses and speed up the memory transfers, an
instruction buﬀer would be a good idea to add. This instruction buﬀer should contain
the instructions for the four to six last used channels. Without an instruction buﬀer
the channels instructions has to be fetched each time the buﬀer needs to be transferred,
which will result in a large overhead for slower channels. It is quite unlikely that more
than just few channels are doing transfers at the same time, so the number of channels
which will have their instructions in a buﬀer can be rather low. It would require some
hundred bytes memory.
7.3.2 Conﬁguration
This system is complex enough to require a lot of testing before actual parameters
can be set. Most parameters are conﬁgurable from the software, but we did decide
to keep the LP and HP levels ﬁxed and the same for all kinds of transfers. However,
this decision seems to be not that good, since it would be hard to know in advance
without extensive testing on what a suitable HP/LP level would be. Therefore it
would be better to make it conﬁgurable for each type of channel. This would make
all parameters of the whole system conﬁgurable which are good for eventual future
needs.
7.3.3 Buﬀer size requirements
Buﬀer size requirements of all channels in the system depends on how fast the fastest
channel is. In our design, a system with lots of slow channels and one fast, you
must make sure that when the fast channel needs to access the memory that the slow
channels have enough unused buﬀer space for the whole memory transfer. This will
increase the required buﬀer size for slow channels in order to be sure that they don’t
run out of space when the fast channel does a memory access. This can be solved
by having a low LP level for the fast channel so it will not access the memory for
that long time since it ﬂushes less memory each time. However, this will decrease the
memory bus performance. A better way would be to divide wanted memory access
from the fast channel into several small ones. This will just aﬀect the performance a
little and if another slow channel reaches to HP level, it will be taken between two of
the fast channels many accesses. A good size for such split could be half a block or a
block.
7.3.4 LP and HP List behaviour
In our suggested implementation we wrote that the queue master shall check both the
LP and the HP list if there are some data there, but it would be better to add two
signals, one from LP and one from HP that are high if there are some data within,CHAPTER 7. SIMULATION AND TESTING 42
and low when they are empty. That would make the queue master more eﬃcient and
jobs in the list will be handled faster.
7.3.5 Queue master behaviour
From the beginning we decided that the queue master always should pick a task from
the HP list if there is some, and only do LP tasks when the HP is empty. However, we
are not sure if this can lead to some bad behaviour like continuing to do only doing
HP entries after some heavy load. This should be investigated further.
7.3.6 Virtual channels and buﬀer memory allocation
For some protocols the number of virtual channels can vary a lot. For USB and
IEEE1394 the number of virtual channels that can be open at the same time can be
just a few or some hundreds. This results in that the software must be careful and
not give virtual channels unnecessary much buﬀer space or else it could run out of
free buﬀer space in worst case.
The buﬀer space each channel will get is based upon the maximum transfer rate.
However, if several transfers are going on at the same time, none of the transfers will
achieve maximum transfer rate. That will result in that the total amount of buﬀer
space that all channels has allocated is far more than just one channel transferring at
top speed would need.
7.3.7 Experience with SID
In order to see if our design idea could actually work, we tried to write it to behave
like hardware does as much as possible. We did not ﬁnd the SID framework helpful
for that. SID is much more suitable for a higher level abstraction, where it is the
interaction between the components and the software that is the main focus, not the
behaviour inside a component, which it was in our case.
However, later when suitable conﬁguration values shall be found during testing
SID might provide useful help. Implementing hardware components that requires
software drivers into SID before the components exist in hardware can be useful for
software developers, because they can start writing and testing code on much earlier.
But for those hardware components the behaviour should be well known in detail and
tested, so the person that will be implementing it in SID will not need to think about
the internal behaviour.
7.3.8 I/O transfer problem
There is a problem with the WR, C and D transfer model if a I/O unit requests some
data from the buﬀer and this data is distributed on two diﬀerent blocks and these
blocks are not next to each other. Since the data we want is not in one continuous
area but two separate places, in the buﬀer, we can’t address all of the data in one
cycle. The solutions to this problem can be: to let the I/O units keep track of the
block limits, to just send the data in the ﬁrst block and and send the size of the
transfer on a separate bus, by having a two ported buﬀer or always have the D stage
2 cycles after the C stage so there is always time to read on two separate places.Chapter 8
Conclusions
The conclusions drawn from the development and the simulations are that our design
is very suitable for SoC’s that shall support a large range of high speed and com-
plex protocols. It is easy to add I/O units that handle other protocols that are not
mentioned in this report, due the ﬂexible design of the DSB. The design scales nicely
and the only hard limit on how many I/O units that can be connected is the DDR
memory bandwidth.
A rough size estimation made by the Axis hardware designers shows that our
design will only result in a minor increase of the number of gates required compared
to the current ETRAX DMA system.
43Chapter 9
Future investigations
There are several tasks that could be analyzed further.
9.1 Conﬁguration values
There are several values in the DSB that can be conﬁgured. Some of those values are
hard to estimate without realistic simulations.
9.2 Packet sizes and buﬀer size
To optimize the VCD design and to conﬁgure it as optimal as possible, an extensive
simulation of how the diﬀerent protocols behaves when it comes to how large most
packets are. It will also aﬀect the number of needed packet ends in the VCD. Such a
simulation would show if the estimated buﬀer sizes are acceptable.
9.3 Queues
Some simulations on realistic loads will probably show if two queues are needed or
not and provide suitable queueing levels for the diﬀerent types of protocols.
9.4 Bridge I/O unit transfers
Possible future investigation could be to analyze if it is a good idea that the CPU
asks the DSB if it has some wanted data during the same cycle as it asks the level 1
cache.
Another interesting approach would be to investigate if the buﬀer could cache data
that shall be transferred from one I/O-unit to another.
9.5 Virtually addressed memory
In this investigation we have assumed that all data lies nicely in physical addressed
memory, while in reality it will be in virtual address memory. For achieving higher
performance, it would be an idea that the DSB understands virtual addresses.
44Bibliography
[1] About AXIS. http://www.axis.com/corporate/corp/about axis.htm.
[2] AXIS ETRAX 100LX SoC overview. http://devloper.axis.com/products/etrax100lx/index.html.
[3] SID simulator homepage. sources.redhat.com/sid/.
[4] Christopher E. Strangio. The RS232 Standard.
http://www.camiresearch.com/Data Com Basics/RS232 standard.html, 1993-
1997.
[5] Allan Graham. SDRAM/SGRAM DDR An Intepretation of the JEDEC Stan-
dard. August 1998.
[6] USB 2.0 Promoter Group. Universal Serial Bus Speciﬁcation Revision 2.0. 2000.
[7] TIA Subcommittee TR-30.2. TIA/EIA-232-F. September 1997.
[8] SerialATA workgroup. Serial ATA: High Speed Serialized AT Attach-
ment. http://www.serialata.org/collateral/zipdownloads/serialata10a.ZIP, Jan-
uary 2003.
[9] IEEE 802.3 working group. IEEE Std 802.3 (Revision of IEEE Std 802.3, 2000
edition). 2002.
[10] IEEE Computer Society. P1394 working group. IEEE Standard for a High Per-
formance Serial Bus. Std 1394-1995. 1996.
[11] IEEE Computer Society. P1394a working group. IEEE Standard for a High
Performance Serial Bus - Amendment 1. Std 1394a-2000. 2000.
[12] IEEE Computer Society. P1394b working group. IEEE Standard for a High
Performance Serial Bus - Amendment 2. Std 1394b-2002. 2002.
45Appendix A
Terms and abbreviations
ASIC: Application Speciﬁc Integrated Circuit.
Asynchronous transfer: Not synchronized; that is, not occurring at
predetermined or regular intervals. The term
asynchronous is usually used to describe
communications in which data can be transmitted
intermittently rather than in a steady stream.
Backplane: A circuit board containing sockets into which
other circuit boards can be plugged in.
Block: A piece of buﬀer memory, handled as the smallest
unit. Can be 32, 64 or more bytes.
Buﬀer memory: This is the memory that acts as a buﬀer between
the main memory and the I/O units.
DMA: Direct Memory Access. Away to transfer data between
the main memory and other devices.
DMA channel: It is a data transfer lane between the main memory
and a I/O unit.
DSB: Data Stream Buﬀer.
FIFO: First In-First Out. A memory where the data stored
ﬁrst comes out ﬁrst.
Gbit: 1024 Mbits
Gbyte: 1024 Mbytes
I/O device: The physical device that send/receive data to/from
the Main memory via the DSB (i.e a mouse, computer
or hardrive).
I/O unit: This is controller inside the the processor system
that is protocol speciﬁc and talks to the I/O-device.
Isochronous transfer: Time dependent. A transfer that need to be
delivered within a speciﬁed time interval to the
destination. Good for data that need a minimum
bandwidth. For example, voice recording.
kbit: 1024 bits
kbyte: 1024 bytes
Mbit: 1024 kbits
Mbyte: 1024 kbytes
Main memory: See memory.
Memory: This is the Main memory of the whole host system.
Not to be mixed up with the Buﬀer memory.
Read: Data transaction from a I/O device to the Main
memory.
SID: SID is a component based simulator for hardware.
46APPENDIX A. TERMS AND ABBREVIATIONS 47
Synchronous transfer: Occurring at regular intervals. The opposite of
synchronous is asynchronous. Most communication
between computers and devices is asynchronous.
Transfer: Data that is send in arbitrary directions.
VC: See Virtual Channel.
VCD: See Virtual Channel Descriptor.
VCD ptr: A pointer to a Virtual Channel Descriptor.
Virtual Channel: A data transfer channel between the software
and an I/O unit. A physical unit can have several
channels. A channel is based upon how the data is
transferred between the host and the device. Each
virtual channel has just one direction.
Virtual Channel Descriptor: The meta-data for the virtual channel. It describes
the properties of the virtual channel, what memory
that is allocated, what kind of device it is
attached to.
Write: Transaction from the main memory to a I/O device.