Ongoing research in adaptive protocols and active networks has presumed that flexibility is offered exclusively through software systems, and the performance implications have generated considerable skepticism. The Programmable Protocol Processing Pipeline (P4) exploits the dynamic reconfigurability of RAM based Field Programmable Gate Arrays (FP-GAS) to provide both hardware performance and dynamic functionality to network Components.
Introduction
A desire for flexible network infrastructures has stimulated research into adaptive protocols and active networks. This research [6] has presumed that flexibility is offered exclusively through software systems, and the performance implications have generated considerable skepticism. In particular, a number of researchers [l2] have proposed that programmability be restricted to t,he control plane, as they believe that high data throughput cannot be achieved concurrently with dynamically interposed functions.
However, flexibility is not exclusive to software systems: new programmable logic devices can be reprogrammed rapidly enough so that network components can operate at hardware speeds while providing dynamic functionality. The growth in size and speed of state of the art programmable logic devices has stimulated new fields of research, e.g., reconfigurable coinputing [17] . *This research was supported DARPA under Contracts #NCR95-%0963 md #DABT63-95-C-0073. Additional support was provided by the AT&T Foundation, the HewlettPackard Corporation, the Intel Corporation and the Altera University Grants Program.
We are exploring the application of dynamically reconfigurable hardware to adaptive protocols and active networks. To explore the design space where high speed requirements make software implementation a bottleneck, we have constructed an FPGA-based architecture called the Programmable Protocol Processing Pipeline (P4) [10] . We thus achieve functional acceleration with special purpose hardware while maintaining a software-like flexibility of the system.
We focus on the example of TCP/IP performance in a noisy environment. We protect against noiseinduced errors with the FEC, and demonstrate the convolutional encoder and Viterbi decoder operating at OC-3 (155Mbps) data rates on the P4. The next section briefly describes the P4 architecture. The architecture of the P 4 is shown in Figure 1 . It composes a set of RAM based FPGA devices (Altera FLEX8000[1]) in a pipeline, with a switching array selecting which devices are engaged in processing 'a data stream. FPGA devices allow implementing protocol processing algorithms in hardware, while providing dynamic functionality through the run time reconfiguration.
P4 Architecture
Processing elements in the P4 are organized into a pipeline of programmahle logic devices interconnected by the switching array. Each device has a FIFO buffer associated with it. A processing element reads the data from its FIFO buffer, performs its processing, and writes into the FIFO buffer associated with the next device in the chain. Connection to the next device is achieved via the switching array. The switching array can dynamically include or exclude processing elements, or reorder them on an as-needed basis.
When needed, a protocol processing function (in the form of an FPGA configuration) is added by downloading a free device, and inserting this device into the pipeline chain. Unnecessary functions are switched out of the processing chain and the device becomes free. .4ltera's Flex 8000 devices require about lO0ms t,o be reloaded, but can be switched in and out of the data pat,h within a microsecond. The gat,e arrays can thus be viewed as a cache for selected protocol processing functions.
The P4 prototype uses ATM cells as a convenient unit of processing. While the architecture is not ATM-specific, use of ATTM allows interoperation with existing syst,ems and wlidat,ion of performance in l 0 O t Mbps operating regimes.
FEC Booster
The P 4 prototype has been constructed as part of the Protocol Boosters project [5] , which takes the approach of dynamically adding and delet,ing protocol functions. The P4 illuminates a design subspace where high speed requirements force the irnplerneutation of certain functions in hardware.
We have chosen an FEC as an example protocol processing function which might operate on an asneeded basis for greater efficiency. A convolutional encoder and Viterhi decoder were implemented t o allow experimental evaluation. Our goal was not to construct, a highly optimized code for a given link, but &her t o explore t,he feasibility of performing a complex protocol processing funct,ion using the liniited set, of resources offered by the P4. Thus, the code was optimized for implerrieniation on the P 4 and operation at the P4;s OC-3 data rate. 
3.1

Robustness
An important. issue in protocol design is robiist,ness.
-4lthough it prokcts user data from bit errors, convolutional encoding may increase the risk of other impairments such as cell losses and cell miisinsertions if no counternieasures are applied. In general, the output of the Viterhi decoder depends on the history of its inputs. If a cell is lost, missing data may cause unpredictable behavior, and the error can propagate far into the future. To imurove robustness in such Bits of each data octet. are grouped in four chunks of two bits arid cncoded independent,ly using four parallel, rat? 1/2. constraint Iengt,li 3, convolut,ional enroders. Each encoder accepts two bit,s from the current, oct,et and produces four output bits. Fonr parallel encoders thus produce two octets of dat,a xuhich cases, the encoder resets its state every 24 bytes (half the ATTM cell) and the decoder resets its state after every cell.
The encoder generates two cells for each input cell. Both rrlls h a w the same value for the user indication hit iii the ATM trader. If the encoded cell within t,he AAL-5 protocol data unit (PDU) is lost, there will he a mismatch in the user indication hit at the end of the AAL-5 PDU. Prior to decoding, data are passed through the front end processing unit which checks for matching user indication bits. Only pairs of cells that match are passed for further decoding. If a mismatch is found, an all-zero cell with the appropriate user indication hit will he inserted as shown in Figure 3 .
This will, of course, result in a series of hit errors after decoding, hut will prevent any error propagation that might otherwise result. Only the AAL-5 PDU whose cell has been lost will be affected. It can he easily verified that the front end processing unit will also successfully isolate had AAL-5 PDUs in the case of cell misinsertion. thus avoiding error propagation. 
Application
An important motivation for the Protocol Boosters concept is the problem encountered when protocols optimized for certain conditions operate outside those conditions; they perform extremely poorly. Flexible adaptive protocols and active networks cope with this problem by dynamically adapting the protocol stack to one appropriate for the current conditions in the network.
Wireless ATM [14] is an example where the protocol requires modification, as the original assumptions for ATM link reliability are no longer met. In an effort to improve the link quality, modifications of the link layer that incorporate strong FEC in comhination with ARQ have been suggested [4, 13, 3, 181 . It is, of course, unlikely that an optimal error control scheme meeting the needs of all applications under all possible conditions exists. In [15] the author of the NEC Wireless ATM prototype [7] has pointed out that each service type will require an appropriate error control scheme, implying that the error control is not a static mechanism.
[13] considered protecting only the header of the ATM cell to prevent extensive cell losses and misroutings, and leaving the protection of the payload to the higher layers depending on the desired quality of service.
In addition to the different error control schemes needed for different service types, the bit error rate on a wireless link is cha,nging over time.
For adaptive protocols, FEC can he viewed as the functional element of the protocol stack which can he added, removed or changed on an as needed hasis. With the spectrum of FEC implementations of varying strengths and complexities available, dynamic protocol can select the implementation that best fits the current conditions and $OS requirements.
In the enhanced network infrastructure provided by the P4, different FEC implementations are available as FPGA configurations. When an appropriate coding scheme is selected, processing elements in the P 4
are configured and the result is P4 operating as specialized hardware in the network. If the FEC algorithm must be replaced, the processing element is reconfigured and new specialized hardware is activated, reusing the same physical device.
Experiments
Our experimental work evaluates the effect on link throughput of the FEC implemented on the P4. With a tunable hit error rate induced on the link, we measured the T C P throughput seen by the receiver with and without the FEC booster described in Section 3.
Test Setup
The experimental setup is shown in Figure 4 The host is an Intel Pentium PC running Linux kernel, release 2.0.29, with the "ATM on Linux" [2] patch and a Fore Systems PCAlOOE .4TM adaptor [8] . Throughput testing is done with ttcp. For convenience, we used single test machine with source and sink running as two separate processes. Since we were interested in testing the impact of the P4 on TCP throughput and not the impact of the workstation, this setup can deliver useful results. Cells transmitted by the workstation are encoded using the first P4 in the test setup. At the output of the first P4; the utilized bandwidth is twice the bandwidth generated by the workstation due to the additional cells. To prevent buffer overflows in the operating P4, the device driver in the workstation must be rate controlled. Our rate limiting mechanism forces an idle period between the transmission of two consecutive packets so that the encoder in P4 has an opportunity to insert all generated packets. There are tradeoffs among the buffer size on the P4, the maximum segment size for IP running over the link, and the length of the enforced idle period.
Encoded cells are passed through a noisy link, emulated by inserting hit errors with the Network Impairment Emulator [lG] . We vary the bit error rate and measure the TCP throughput seen by the recei1,-ing process on the workstat,ion with and without the FEC booster in place.
The second P4 board decodes t,he cells and corrects any correctable bit errors. Decoded data are passed through the Cell Protocol Processor [ll] which acts as the passive monitoring device for link traffic and the error rate after decoding.
Results
We ran the ttcp throughput t,ests for four cases: (1) wit,hout P4 hoards in the data path; (2) with P4 hoards doing no processing; (3) with P4 boards doing no processing, rate control on; and (4) with P4 hoards configured as FEC encoder and decoder.
We wried t,he hit error rate (BER) from lo-'' to lo-' with an exponentially dist,rihuted time between two consecutive bit errors (z.e.> a Poisson error distribntion). Results from the first case provide a baseline measurement. In the second case we tested if inactive P4 hardware had any impact on the results. The first two cases exhibit^ almost identical results: an enormous drouoff in TCPIIP Derformance in Figure 5 . The third cme shows the effect of rate control, namely that, the throughput starts off corisiderably lower (a factor of 4 less) hut drops off as rapidly as the first two cases in the face of error. The reason this test was performed was to separate the costs of rate control from the costs associated wit,li the FEC processing.
In the last experiment, we measured T C P throughput with the FEC in place, and rate controlled, as before. As expected for the low BER region, the FEC booster does additional processing and uses extra bandwidt,h for the redundancy, beyond the cost of rate control, taking its throughput to about 8 Mbps.
In the high BER region, the TCP protocol stack benefits from FEC in reducing the number of retransmissions and keeping the value of T C P window size larger. Without FEC, TCP completely stalls at BER below while it is still able to operate with FEC in place. Figure 5 shows the logarithmic plot of the mean value of measured throughput as a function of BER and Table 1 shows 90% confidence intervals for nieasured throughput. Log plots are used since BERs of interest covers many orders of magnitude. The upper solid line presents the throughput without the P4 in the datapath; the dashed line following it is the throughput with an idle P4. The overhead introduced by the P4 hardware is negligible. The lower solid line shows the TCP throughput, with rate control and an inactive P4. Finally, the dashed & dotted line shows the measured throughput, with P4 running the FEC booster. For BERs greater than TCP gains from FEC. Given our earlier explanation of TCP's response to packet loss, it should be clear that, the FEC, in reducing the impact of noise, reduces the probability of the incorrect, assumption of congestion. Thus, the performance is improved.
The graph in Figure 5 illustrates an opportunity for an adapt,ive protocol. In particular, the intersecting curves at a DER. of ca.
suggest, that FEC he employed only when the BER exceeds 10W7. Thus, a protocol booster's polic)-module would constantly monitor the conditions on the link (e.g., using AAL-5 CRC or IP checksums): and switch on the FEC when needed. In Figure 5 , the line followed hy an ideal adaptive protocol is marked by an "0;'. In the Protocol Boosters framework, the FEC processing is mechanism, under control of the aforementioned "Policy".
, .
6 Generalizing Adaptive FEC in face of bit errors (the throughput is on a logarithmic scaie in Figure 5 ) . This is due to TCP/IP's strategy in the face of packet loss, which is to assume that in Hardware the loss was a result, of congestion rather than noise. The result is that the TCP/IP congestion window is The P4 demonstrat,es near-soft,ware flexibility and rapidly reduced t,o the point where the prrjtocol heperformance comparable to special purpose hardcomes "stop-and-wait", wit,h the consequences shown ware. We used the examplc of a convolutional code 
Next Steps
A prototype policy module for the FEC boosters and a signalling protocol [9] which enables P4 boards distributed over the network to synchronize their activities is operational. It has been implemented on the out-of-path controller associated with the P4 board. Controlling software manages the P4 board and configures its processing elements. Selecting the appropriate booster and when to activate it are the two central roles of the policy module. In a realization, the policy module is a combination of software running on the controller and configured hardware running on the P4 assigned to monitoring the conditions on the link and collecting the information necessary for policy decisions. Controlling software periodically polls the policy module and based on the policy decision configures the P4. A signalling protocol is used to exchange control messages with other P4 boards when the booster is distributed (e.g. FEC encoder and decoder).
We are designing a second generation of the P4.
Besides some technical improvements such as better buffer management scheme which should reduce the cost of the rate control mentioned in Section 5, we plan to address some conceptual issues which set certain limitations on applicability of current version of P4.
The main limitation of the current version is the lack of buffering for local processing. There are also some small dependencies on ATM. Local buffering is essential in supporting transparent boosters, which do not modify the original packet. An example of a transparent booster is an FEC booster that sends the FEC packets in addition to the original packets. The price paid here is in the memory resources where the packet is stored during its construction. Due to lack of local memory resources on the P4, implementation of transparent boosters is limited. ~~ that modern hardware allowed a novel invest,igation of the design space of programmable network infrastructures. In particular, the P 4 demonstrated Rexihility by loading an FEC into it,s pool of FPGAs, and this flexibility was employed in end-to-end throughput tests using T C P on an ATM-attached workstation. The T C P results showed that the FPGAresident code allowed TCP performance in a BER regime where the protocol was previously inoperable.
The performance tradeoffs of the system with and without the FEC suggest the use of a hybrid strategy, using the FEC as-needed, a scheme to which t o P4 is well-suited. We believe that, among the uses for such a scheme are wireless ATM applications.
The important result, of this demonstration is that schemes such as Protocol Boosters and Active Networking for flexible network infrastructures are not limited t o poor performance regimes. For functions which can be implemented within the area limitations of FPGAs at, any point in time, hardware performance levels can be achieved. Thus, this refutes much of the skepticism which exists in the networking community about the performance of these approaches, and hence their impact on real net,works.
