Event Signaling Within Higher Performance Network Systems by Chung, Jeffrey D et al.
University of Pennsylvania 
ScholarlyCommons 
Technical Reports (CIS) Department of Computer & Information Science 
January 1996 
Event Signaling Within Higher Performance Network Systems 
Jeffrey D. Chung 
University of Pennsylvania 
C. Brendan S. Traw 
University of Pennsylvania 
Jonathan M. Smith 
University of Pennsylvania, jms@cis.upenn.edu 
Follow this and additional works at: https://repository.upenn.edu/cis_reports 
Recommended Citation 
Jeffrey D. Chung, C. Brendan S. Traw, and Jonathan M. Smith, "Event Signaling Within Higher Performance 
Network Systems", . January 1996. 
University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-96-15. 
This paper is posted at ScholarlyCommons. https://repository.upenn.edu/cis_reports/188 
For more information, please contact repository@pobox.upenn.edu. 
Event Signaling Within Higher Performance Network Systems 
Abstract 
The afterburner ATM link Adapter has allowed us to evaluate three event-signaling schemes: polling, 
traditional interrupts and the clocked interrupts first investigated in our operating system work in 
AURORA. The schemes are evaluated in the context of a single-copy TCP/IP stack. The experimental 
results indicate that clocked interrupts can provide throughput comparable with traditional interrupts for 
dedicated machines (up to over 144 Mbps, the highest TCP/IP/ATM throughput reported), and better 
performance when the machines are loaded with an artificial workload. Polling, implemented to be used 
with an unmodified netperf measurement tool, was competitive for small TCP/IP socket buffer sizes 
(¡32KB). We concluded that clocked interrupts may be preferable for applications requiring high 
throughput on systems with heavy processing workloads, such as servers. 
Comments 
University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-
CIS-96-15. 
This technical report is available at ScholarlyCommons: https://repository.upenn.edu/cis_reports/188 
Event-signaling within Higher Performance 
Network Subsystems 
Jeffrey D. Chung 
C. Brendan S. Traw 
Jonathan M. Smith 
University of Pennsylvania 
School of Engineering and Applied Science 
Computer and Information Science Department 
Philadelphia, PA 19104-6389 
Event-signaling within Higher Performance 
Network Subsystems 
Jeffrey D. Chung, C. Brendan S. Traw and Jonathan M. Smith* 
University of Pennsylvania 
{jdchung , traw , jms)@cis. upenn. edu 
Abstract 
The Afterburner ATM Link Adapter has allowed us to evaluate three event-signaling schemes: 
polling, traditional interrupts and the clocked interrupts first investigated in our operating sys- 
tem work in AURORA. The schemes are evaluated in the context of a single-copy TCP/IP 
stack. The experimental results indicate that clocked interrupts can provide throughput com- 
parable with traditional interrupts for dedicated machines (up to over 144 Mbps, the highest 
TCP/IP/ATM throughput reported), and better performance when the machines are loaded 
with an artificial workload. Polling, implemented to be used with an unmodified ne tper f  mea- 
surement tool, was competitive for small TCP/IP socket buffer sizes (i32KB). We conclude that 
clocked interrupts may be preferable for applications requiring high throughput on systems with 
heavy processing workloads, such as servers. 
1 Introduction 
1.1 Background 
This research is one of a series of results from an exploration of Asynchronous Transfer Mode (ATM) 
computer/network host interface architectures a t  the University of Pennsylvania, begun in 1990 
[Traw 931, as part of the ATMISONET infrastructure of the AURORA Gigabit Testbed [Clark 931. 
The goal of AURORA was to develop the technologies needed to support end-to-end gigabit per 
second networking between workstations. 
The initial research goal of the interface work was to  identify and experimentally verify a kernel 
of services that were suitable for hardware implementation. These data movement and formatting 
intensive services include ATM Adaptation Layer (AAL) processing, segmentation-and-reassembly 
(SAR) and ATM demultiplexing. The architecture partitioned the protocol processing activities 
between the hardware host interface and software running on the host processor. Concurrent with 
the hardware implementation, software support appropriate for high-throughput networking had to  
be designed and implemented. This was an opportunity to explore non-tiaditional approaches to  
reducing copying, event-signaling, and application-programmer interfaces for IPC. Our schemes for 
reducing copying and a shared-memory IPC model have proven themselves effective[Sniith 931, and 
while our clocked interrupt scheme was shown to be effective, it was not conclusively characterized 
with respect to other event-signaling schemes. 
*Work at  Penn was supported by the National Science Foundation and the Advanced Research Projects Agency 
under Cooperative Agreement NCR-8919038 with the Corporation for National Research Initiatives, by the NSF under 
agreement CDA-92-14924, by Bell Communications Research under Project DAWN, by an IBM Faculty Development 
Award, and by the Hewlett-Packard Corporation. 
Brendan Traw is now at the Intel Architecture Laboratory, Hillsboro, OR and can be contacted at 
Brendan-TrawQccm. jf . i n t e l .  corn. 
1.2 Event-signaling and Clocked Interrupts 
Event-signaling within the network subsystem between the hardware network interface device and the 
software device driver is typically accomplished via polling or device-generated interrupts. In our im- 
plementation of an OC-3c ATM host interface for the IBM RS/6000 family of workstations[Traw 93, 
Smith 931, we replaced the traditional forms of this crucial function with "clocked interrupts." 
Clocked interrupts, like polling, examine the state of the network interface to  observe events which 
require host operations to  be performed. Unlike polling, which requires a thread of execution to 
continually examine the network interface's state, clocked interrupts perform this examination peri- 
odically upon the expiration of a fine-granularity timer. In comparison to interrupts, clock interrupts 
are generated indirectly by the timer and not directly by the state change event. 
1.3 Clocked Interrupt Discussion 
We developed an analytical model for clocked interrupt performance in our earlier work[Smith 931, 
and argued that for application workloads characterized by high throughput, heavy n~ultiplexing, 
and/or "real-time" traffic, clocked interrupts should be more effective than either traditional polling 
or in.terrupts. For these intensive workloads, our analysis predicted that clocked interrupts should 
generate fewer context switches than traditional interrupts and require fewer CPU cycles than polling 
without significantly increasing the latency observed by the applications. We note that for tradi- 
tional interrupts with interrupt service routines which detect additional packets enqueued on the 
adapter, many of the same benefits may accrue. Ramakrishnan[Ramakrishnan 931 has noted a prob- 
lematic performance overload phenomenon known as ficeive livelock which clocked interrupts can 
help alleviate. 
Our clocked interrupt implementation for the ATM subsystem of the IBM RS/6000 proved to 
be effective, but could not be directly compared with interrupts as the network interface was (by 
design) incapable of generating them. A second implementation of the hardware portion of the host 
interface architecture has been built as an OC-12c rate ATM Link Adapter for the HP Bristol Labs 
"Afterburner" [Banks 931 card. 
2 The HP Afterburner and UPenn ATM Link Adapter 
The hardware infrastructure for this evaluation consists of HP 9000/700 series workstations equipped 
with Afterburner generic interface cards and ATM Link Adapters. Figure 1 shows an Afterburner 
and ATM Link Adapter. The remainder of this section briefly describes the architecture and imple- 
mentation of the Afterburner and ATM Link Adapter. 
2.1 Afterburner 
The Afterburner [Dalton 931, developed by HP Laboratories in Bristol, England, is based on Van 
Jacobson's WITLESS architecture. It provides a high speed generic packet interface which attaches 
to the SGC bus of the HP9000/700 workstations. A large pool of triple ported Video RAM (VRAM) 
is provided by Afterburner. The random access port of the VRAM is visible on the SGC bus allowing 
the VRAM to be mapped into the virtual address space of the workstation. The two serial ports 
are used to provide a bidirectional FIFOed interface to a network specific Link Adapter. Several 
additional FIFOs are provided to assist in the management of VRAM buffer tags. 
2.2 ATM Link Adapter 
A Link Adapter provides an interface between the general purpose Afterburner and a specific network 
technology. The UPenn SAR architecture [Traw 931 is the basis for the ATM Link Adapter. This 
architecture performs all per-cell SAR and ATM layer function in a heavily pipelined manner which 
can be implemented in a range of hardware technologies. For the ATM Link Adapter the base SAR 
architecture has been extended to  support a larger SAR buffer (up to 2 MB), AAL 5 including 
Figure 1: Afterburner (left) and ATM Link Adapter (right) 
CRC32 generation and checking, and demultiplexing based on the full VPI, VCI, and MID. The 
performance of the implementation has been improved to 640 Mbps by using more advanced EPLD 
technology. Figure 2 shows the host/Afterburner/ATM Link Adapter configuration. 
3 Implementation of the Clocked Interrupt Scheme on the 
Afterburner ATM Link Adapter 
We implemented the ATM Link Adapter device driver to operate in conjunction with HP  Bristol 
"Single-Copy" TCP/IP[Edwards 941. The kernel was modified to support a fine-granularity timer, 
as the standard 100 Hz soft clock rate was inadequate. We had initially tried to simply increase the 
frequency, but the number of dependencies elsewhere in the kernel caused a large number of failures. 
Our strategy then became one of transparently supporting an increased clock rate for the specialized 
network subsystem code, while allowing the remainder of the system to retain the frequency. This 
Figure 2: ATM Link Adapter 
Series 
Workstation 
I I I I I 
Monitor I I 
I 
L - - - - - - - - - - - - - - - - - J  
Link Adapter - - - - - - - -  - - - - - - - - - -  
I 
S G C  Bus 
Physical 
> Dual Ported Layer 
I Mbps 
Reassembler 2; Interface 
< 
to network 
< > 
was done by modifying the operating system to increase the hardware clock interrupt rate, and 
changing the interrupt service vector to point to our specialized clock service routine rather than 
the usual hardclock interrupt service routine. Clock division is performed inside our code to  call 
the hardclock interrupt service code at the proper rate. This approach, while aesthetically a bit 
crude, made the software engineering possible in a relatively short period. Thus, a t  each vector clock 
tick, occurring a t  the clocked interrupt clock rate, the link adapter is examined for packet arrivals. 
If packets are discovered the Interrupt Service Routine (ISR) for the ATM link adapter is invoked; 
this ISR provides the packet to the single-copy TCP/IP stack. 
Polling requires a continuous thread of execution to examine the state of the I/O device. Because 
our version of HP-UX lacks preemptive kernel threads, polling was implemented with a preemptable 
user process. To minimize the number of system calls, the device status flag was appropriately 
memory mapped for access by a user process. This allowed a user process to continually examine 
the state of the device in a preemptable thread of execution, albeit at some cost in overhead. The user 
process invokes the ISR through an i o c t l ( )  call; for measurement purposes we wrote a small helper 
daemon which performed this function rather than modify ne tper f ,  again at a cost in overhead. 
Preemptive kernel threads would remove both these additional source of overhead. 
Thus, the current implementation includes support for interrupt generation as well as the exam- 
ination of the card via polling or clocked interrupts. With support for all three types of state change 
notification, we could perform a comparative experimental evaluation of these mechanisms. 
4 Performance 
Figure 3: Experimental Setup 
HP9000 
Mode1 720 
The hardware test configuration consists of the elements shown in Figure 3. Two HP 9000 Series 
700 Model 720 workstations are connected back-to-back via their Afterburner ATM Link Adapter 
subsystems. 
4.1 Measurements and analysis 
We analyzed the throughput of the resulting network stacks using the netperf  tool[IND 951. We have 
experimented with both t t c p  and ne tper f ,  and have drawn two conclusions from these experiments. 
First, netperf  results are reproducible; t t c p  measurements exhibit significant variation in reported 
throughput - up to  20% in some cases. Second, netperf  results correspond very closely with 
maximum t t c p  reported throughputs. What this suggests is that netperf  better controls the 
variables under study, while reducing noise from other factors. 
The results are given in Table 1. While behavior is more evident from graphs, space limitations 
precluded graphical presentation; the major observation is that polling does not keep up with the two 
HP9000 
Model 720 
Afterburner 
ATM Link 
Adapter 
Custom 160 Mbps 
Physical Layer 
ATM Link 
Adapter 
Afterburner 
Table 1: TCP/IP Throughput as measured by netperf with 32KB messages 
other schemes above about 32KB. All checksums were enabled for all tests; the measurements were 
performed on dedicated processors, with no other activity except for necessary system background 
processes. The tests were run with symmetric configurations; that is, both sender and receiver were 
using the same signaling mechanism. We will examine asymmetric configurations in future work. 
It  is clear from the figures shown that at high polling rates, the clocked interrupt scheme is able to 
lteep up with the traditional interrupt scheme, which is almost everywhere the best performer, with 
the exception of polling, which does best for small packet sizes. In a lightly-loaded environment, 
interrupts would appear to be the best solution, except for some anomalous, but repeatable results 
which show polling best for small socket buffer sizes. It is important to note that the maximum 
possible theoretical throughput from the link on which these tests are run is 144.9 Mbps, indicating 
that other limitations are bounding the measured TCP/IP throughput. 
Such a dedicated configuration is not characteristic of real environments, which are often loaded 
with other work and other network traffic. We created an artificial workload by continuously execut- 
ing a " fac tor  99121010311157" command, which uses about 5.2 seconds of CPU time. This has a 
significant effect on the behavior of the three schemes, as can be seen by measuring the throughput 
wit,h netperf  with the artificial workload running on the receiver. In this case, for the socket buffer 
size of 262144 bytes, the traditional interrupts delivered 61.8 Mbps, polling delivered 48.6 Mbps, 
and clocked interrupts at 500 Hz gave 106.19 Mbps. 
Another important factor in networking performance, and perhaps a more important parameter 
for distributed applications is the round-trip latency induced by the software supporting the adapter. 
Since the hardware was a constant, we could directly compare the software overheads of the three 
schemes. This was done with the following test. An artificial network load was created using netperf  
with a socket buffer size of 262144 bytes and operating it continuously. Against this background load, 
ICMP ECHO packets of 4K bytes were sent to the TCP/IP receiver, which was where the event- 
signaling performance differences would be evident. Sixty tests were done to  remove anomalies. 
Our results showed that traditional interrupts and clocked interrupts a t  500 Hz performed similarly, 
yielding minimum, average and worst case times of 5/12/18 ms, and 4/11/25 ms, respectively. When 
the systems were not loaded, the performances were 31313 ms and 41416 ms. This suggests that 
clocked interrupts performed slightly better under heavy load, but slightly worse under unloaded 
conditions. 
SocketBuffer 
Size (Kbytes) 
5 Conclusions 
Clock 
1 KHz 
Most importantly, the experiments reinforced the observation that packet size is the most important 
factor, by far, in maximizing observed throughput. However, such "jumbo-grams" are uncommon 
today (although they become more common with application such as video-on-demand). Our study 
has shown the following to be true. First, clocked interrupts can provide throughput equivalent to 
the best throughput available from traditional interrupts; both methods provide better performance 
Trad. 
intr. 
Clock 
2 KHz 
Poll Clock 
4 KHz 
Clock 
500 Hz 
than polling as implemented here. Second, clocked interrupts provide higher throughput when the 
processor is loaded by a computationally-intensive process; this suggests that clocked interrupts 
may be a viable mechanism for heavily loaded systems such as servers, which might also suffer 
from Ramakrishnan's receive livelock. Third, clocked interrupts provide better round-trip delay 
performance when systems are heavily loaded for large ICMP ECHO packets. 
Taken as a whole, the data suggest that clocked interrupts may be an appropriate mechanism for 
many of the high-performance applications now being proposed, such as video-on-demand servers. 
We would also like to  note that we have obtained the highest reported TCP/IP performance for an 
ATM network adapter a t  this point in time. 
6 Acknowledgments 
John Lumley's group at Hewlett-Packard's European Research Laboratories (Bristol, UK) collab- 
orated on the Afterburner ATM Link Adapter (particularly David Banks and Costas Calamvokis) 
and provided the single-copy TCP stack (Aled Edwards and Chris Dalton) as a starting point for 
the work reported here. 
References 
[Banks 931 D. Banks and M. Prudence, "A High-Performance Network Architecture for a PA-RISC 
Workstation," IEEE JSAC, 11(2), pp. 191-202 (Feb. 1993). 
[Clark 931 David D. Clark, Bruce S. Davie, David J .  Farber, Inder S. Gopal, Bharath K. Kadaba, 
W. David Sincoskie, Jonathan M. Smith, and David L. Tennenhouse, "The AURORA Gigabit 
Testbed," Computer Networks and ISDN Systems 25(6), pp. 599-621, North-Holland (January 
1993). 
[Dalton 931 C. Dalton et al., "Afterburner: A network-independent card provides architectural sup- 
port for high-performance protocols," IEEE Network, pp. 36-43 (July 1993). 
[Edwards 941 A. Edwards, G. Watson, J. Lumley, D. Banks, C. Calamvokis and C. Dalton, "User- 
space protocols deliver high performance to applications on a low-cost Gb/s LAN," in Proceed- 
ings, 1994 SIGCOMM Conference, London, UK, 1994. 
[IND 951 Hewlett-Packard Information Networks Division, "Netperf: A Network Performance 
Benchmark (Revision 2.0)", February 15, 1995. 
[Ramakrishnan 931 K. K. Ramakrishnan, "Performance Considerations in Designing Network Inter- 
faces," IEEE JSAC 11(2), pp. 203-219 (Feb. 1993). 
[Smith 931 Jonathan M. Smith and C. Brendan S. Traw, "Giving Applications Access to Gb/s 
Networking," IEEE Network 7(4), pp. 44-52, (July 1993). 
[Traw 931 C. Brendan S. Traw and Jonathan M. Smith, "Hardware/Software Organization of a 
High-Performance ATM Host Interface," IEEE JSA C 11(2), pp. 240-253 (Feb. 1993). 
[Traw 951 C. Brendan S. Traw, "Applying Architectural Parallelism in High Performance Network 
Subsystems," Ph.D. Thesis, CIS Department, University of Pennsylvania, January, 1995. 
