Abstract
Introduction
In this paper, the capabilities of contemporary platform FPGAs are exploited through new design tools, in order to illustrate how highly mutable implementations of embedded networking for systems on chip can be produced, and how trade-offs between implementations can be rapidly investigated. A particular focus for investigation is the use of an embedded processor within the logic fabric of the FPGA. Although a hard processor is studied here, the general discussion is equally applicable to the use of a soft processor. The general methodology is demonstrated through the case study of implementing a simple web server on a platform FPGA. This server has a gigabit Ethernet connection to its environment, and uses a tailored embedded networking subset of the IP, TCP and HTTP protocols to communicate.
Two recent research tools were exercised in the implementation of the simple web server. The HAEC language [2] , developed in Xilinx Research Labs as part of the research into network processing using platform FPGAs, was used for representing functions targeted at programmable logic. C was used for representing comparable functions targeted at an embedded processor. The PowerPC processor was automatically wrapped in a black box with a highlyefficient, gigabit-rate interface, using a prototype tool that incorporates techniques from [1] .
Web server functions
The nature of the web server implemented was that it implements only the minimum protocol subsets required to communicate -importantly though, always keeping within the protocols. This aspect is seen particularly in the case of the TCP protocol. First, only a single TCP connection (and hence web request) is handled at a time, and it is always initiated by a client, by definition. The responder thus handles only basic TCP packet header processing, checksum calculations, as well as the connection setup (SYN/ACK) and connection teardown (FIN/ACK) subprotocols. Once a connection is set up, the expectation is that the TCP responder receives an HTTP request message within a single TCP packet. Then, it immediately issues a TCP ACK packet, followed by the requested web page data in another TCP packet, and finally a teardown TCP FIN packet.
Implementation trade-offs explored
To explore codesign trade-offs between programmable logic and embedded processor, eight different codesign points were selected for the web server implementation. At one extreme point, HTTP, all of the protocol handling is implemented in programmable logic alone. At the other extreme point, PPC, it is implemented on the embedded processor alone. Of the six intermediate points, the first of these, TCP, involves placing the simple HTTP processing into the PowerPC with all other protocol handling remaining in programmable logic. In the second intermediate 
Experimental results
The web server system was initially targeted at the XC2VP7 Virtex-II Pro platform FPGA, which includes eight multi-gigabit transceivers. The standard Xilinx ISE 6.3 tools were used to produce the bitstream for the FPGA from the VHDL description of the programmed soft platform generated by the HAEC compiler. The system was tested using a Xilinx ML300 board, which includes four gigabit Ethernet interfaces, one of which was connected by optical fiber to a Linux workstation that acted as a client for the server. The GMAC core, which has an eight-bit packet interface, was clocked at 125 MHz for full gigabit Ethernet rate. The threads in logic, which operate on 32-bit data, were clocked at 31.25 MHz. The PowerPC was clocked at 300 MHz, with the OCM bus running at 100MHz. Table 1 reports the protocol handling latency (in nanoseconds) associated with the programmable logic and embedded processor for each codesign point from ModelSim simulations. This latency does not include the time taken for the physical reception and transmission of packets, and also eliminates any unavoidable latency introduced by arbitrary gaps between the receipt of packets.
Version name
Logic PowerPC Total  HTTP  1312  0  1312  TCP  1312  4391  5703  TCP/DATA  1312  6551  7863  TCP/SYN/FIN 1312  7951  9263  TCP/CHKS  1312  10111  11423  IP  928  15651  16579  ETH  576  22911  23487  PPC  0  27850  27850   Table 1 . Protocol handling latency
When Ethernet MAC handling is moved to logic, the total latency decreases by 17%, and when the IP handling is moved to logic, latency is reduced by an additional 23%. With the TCP checksum handling moved, an additional 19% latency reduction is experienced. A dramatic decrease in latency occurs when the PowerPC is not used at all, where the protocol processing is 21 times faster. In general, the resource utilization for TCP handling is fairly low, but it is important to remember that this is a stripped-back embedded networking version of TCP. Performing the TCP checksums in logic is a fairly inexpensive codesign point. When the entire TCP handling is placed in logic, the resource utilization has more than doubled from the case where TCP is entirely handled by the PowerPC.
The designs have also been investigated for the Virtex-4 FX platform FPGA. This device contains two hard tri-mode Ethernet MAC blocks, saving around 1400 LUTs. When the clock rate for the PowerPC405 is increased to 400MHz, this results in the expected 25% reduction in latency.
Conclusions
This paper has illustrated a practical methodology for rapidly investigating trade-offs between different codesign points, when implementing embedded networking protocols on a platform FPGA. It was possible for a non hardware expert to carry out the necessary programming for all of the codesign points within a total of six weeks. The resulting implementations have attractive attributes in terms of resources used and/or packet handling latency.
