As it has been already proved, link layer compression is very effective when used in packet networks. 
Introduction
An important application of packet networks is in the interconnections of private LANs. In this capacity a packet network protocol sees common carrying data over WAN links. A typical such a network is shown in Figure 1 . One of the main characteristics of these networks is that the cost of the WAN link infrastructure is comparatively high. By increasing the network traffic we can send over these links we can increase the effectiveness of the whole network. We propose to do this by compressing the data transmitted. WAN In particular our algorithm works as follows, at the ingress/egress of the LAN into the WAN, we collect all the cells that have the same header. All the headers are removed and then formed into a single data-stream. Prior to 1 Supported by a Marie Curie Research Training Grant under TMR activity 3 transmitting this data-stream we apply our compression algorithm and then encapsulate the compressed stream into network packets. If for example the ATM network is used (which has a payload of 48 bytes) for any packet we fragment the compressed stream into 48 byte quantities and then attach to them the original header. By doing that, these newly formed cells are routed over the WAN network just as the original cells would have been. The cells are then decompressed in the gateway of the destination LAN. The major advantage of this approach in comparison with another approach described in [2, 3] is that the WAN network involved is oblivious to the data carried, requiring no additional compression units. A more detailed description of the compression scheme can be found in [4] . This paper describes the Titan chip which is a device that implements this compression scheme (which has also been proved to be very effective) when applied to the current networks. It is a single chip approach that can compress the network data at state-of-the-art network speeds, and by introducing very low latency. It supports a standardised interface for compatibility with the existing network devices, it is highly configurable and provides real time measurements for use by the network manager.
Implementation issues
Several data compression algorithms of different philosophy, complexity, and application scope have been applied to network data. Many of these algorithms have been used in practice primarily through software or hardware implementations at speeds less that 100Mb/sec. Their disadvantage is that they fail to meet the speed and performance requirements of state-of-the-art and future systems. Successful implementations of efficient VLSI chips for data compression and decompression can certainly reduce the time and space overhead incurred by data compression, which is not desirable in many real-life applications. Thus, a VLSI chip that can: (a) Compress the data at speeds up to the current networks transmission speeds, (b) Introduce minimum latency, (c) Being robust in case of bit errors, (d) Having a complexity similar to that of a standard network interface device, will probably increase the overall efficiency of a packet network.
The device described implements the LZ77 algorithm which, as it is described in PAPA98a, it is the optimal for this compression scheme. In the next sections the hardware architecture and implementation of this device is outlined. But before moving to the actual implementation the exact format of the interconnections between the device and an ATM switch/interface card are described.
Titan Interconnections
The Titan Chips are interconnected to each other and to the existing network terminals as shown in Figure 2 . In this figure it is clear that both the compressed and the uncompressed data is carried by ATM cells. 
Core Hardware Architecture
In Figure As in every dictionary based compression device, the core comprises of the dictionary and the comparison circuits around it. Therefore, the speed of the device depends heavily on the memory throughput and the comparisons' latency. In the architecture proposed in this paper the speedup implementation techniques of pipelining, parallelism and repetition of information have all been used in order to accelerate this core. In particular the main characteristics of the architecture are:¯2 56-stage pipeline.
16 comparisons in parallel at each pipeline stage.
Memory repetition (100% more memory used) for higher memory throughput.
This architecture was implemented using the Verilog Language and the Synthesiser perviously mentioned. The technology used for the synthesis of the Verilog code was (unless otherwise stated) a 0.25um CMOS technology with a worst case 2 input NAND gate delay of 0.5ns and a worst case memory latency of 4ns.
Using these speedup techniques and after some layout optimisations of the device layout, the network data can be compressed at speeds up to 1.4Gb/sec and the latency introduced is within the acceptable limits for network traffic (5 cell times). The latency can be further reduced if greater hardware resources are to be used, as it will be described in the next section.
Compression Unit
The compression unit implements the LZ77 algorithm which is applied only to the payloads of the ATM cells. Its block diagram is shown in Figure 4 . In general, the compressibility table determines if a flow should be compressed or if it should bypass the main unit and the header memory, and the merge and bypass circuits ensure the cells will be formed and In general, the compressibility table determines if a flow should be compressed or if it should bypass the main unit and the header memory, and the merge and bypass circuits ensure the cells will be formed and sent over the transmission link correctly. Moving to the exact functionality of the device, at first the header bits described by a certain register are used as an index to the compressibility table so as to determine whether this particular cell should be compressed or not. If the cell should not be compressed it is sent through the bypass path to the merge circuit. If it is a compressible cell the compressibility table points to the dictionary that should be used for compressing the cell. In each entry of the table the 15 last bytes of the payload of the last cell on this flow are also stored. These bytes together with the first byte of the new cell form a 16-byte lookahead buffer which is sent to the compression unit. In the next clock cycle, the second byte of the cell will arrive. These two first bytes of the cell together with the 14 last bytes of the previous cell will form the new lookahead buffer, and so on. In this manner the 48 bytes of the payload are processed by the compression unit in 48 byte-clock cycles. As stated above, the core circuit is organised in a 256-stage pipeline. In Figure 5 a pipeline stage is demonstrated. It consists of a memory bank of 31 bytes for each dictionary 1 , a "crossbar" so as to be able to route each 16-byte quantity to a specific comparator and 60 comparators that are used 4 times each at every clock cycle. The inputs of each stage are (a) the 16-byte long lookahead buffer, (b) the 15-bit address of the dictionary that should be used, (c) register LONMA which specifies what is the longest match up to the last point of the pipeline and what is the dictionary address of the first byte of this match, and (d) the four PRENA1-PRENA4 registers which specify the 4 longest matches found in the last pipeline stage and the addresses of these match. The outputs of each stage are (a) the unchanged 16-byte lookahead buffer, (b) the also unchanged 15-bit address fields, (c) the possibly altered LONMA reg-ister, and (d) the new PRENA registers which specify the 4 longest matches found on this stage.
The reasoning behind the size of the memory is as follows: The algorithm implemented has a longest possible match of 16. Thus, taking 16 subsequent bytes, all their possible matches are included in these 16 bytes and the next 15 subsequent ones. So, it is guaranteed that all the matches of the first 16 bytes are included in the 31 bytes stored in the memory.
Since the main objective has been to minimise the time for the comparisons of the 16 byte lookahead buffer with every single byte in the dictionary, parallelism is also used. In every pipeline stage there are 64 byte-comparators each used 4 times in each major cycle. Therefore, 256 comparisons are done in each major cycle. Since the 16, 16-byte long strings should be compared with the 16-byte long lookahead buffer 16 * 16 = 256 comparisons are needed. As a result in each clock cycle all the possible matches of the lookahead buffer with a particular 16-byte stream are identified.
The exact timing of a pipeline stage is shown in Figure 6 . The memory is accessed and at the same time the register LONMA is compared with the four PRENA registers. The longest of these 5 matches is stored in register LONMA, together with the corresponding address in the dictionary. After the memory is read the first 4 16-byte comparisons are executed and their results stored in the corresponding registers. After the results are stored, the second set of comparisons starts and at the same time the 4 comparison results are compared with one another and the longest match is stored in register PRENA1. Similarly, all the four PRENA registers are loaded with the 4 longest matches produced by the 4 sets of comparisons. By using all the above speedup factors the compressor can process data at a constant speed of 622Mb/sec introducing a latency of 256 clock cycles or 5 cell times. This is a significant improvement over the current network compressors since, as it has already been metioned, the processing speed of the fastest such compressor is 25Mb/sec and its latency is up to 2 cell times.
In order for this design to be used in an even faster network (e.g. a 1044Mb/sec one) the only alteration needed is the following: Instead of having 64 comparators in each pipeline stage 256 are needed, so as all the necessary comparisons can be done at the same time. By following the same calculations as in the last paragraphs, the latency of each pipeline stage will be 5ns and thus a clock rate of 200Mhz can be used. As a result, the compressor would be able to process data at a rate up to 1.4Gb/sec.
Decompression Unit
The decompression unit is much simpler than the compression one, since there is no need for comparisons between the input data and the one stored in the dictionary. It just maintains the compressibility table and a 4KB dictionary for each compressible flow. Its block diagram is shown in Figure 7 . Using the compressibility table it first determines if a cell comprises of compressed data or not. If the header corresponds to an uncompressible flow the cell is sent over the bypass path. If it is a compressed cell, the compressibility table entry points to the dictionary that should be used for the decompression of it. Then, for each byte the decompressor determines if it is part of an (address, length) token or an uncompressed byte. In the latter case, it sends the byte to the merge circuit and stores it in the next free entry of the corresponding dictionary. In the former one the memory item at the corresponding address address, is fetched and the first length bytes of it are written to the output buffer. The new string is also written in the next free position of the dictionary. Since the decompression unit needs at most two memory accesses per input byte and assuming the same delay parameters as in the compression unit, its latency is 7ns if a standard non-pipelined architecture is used. If the two accesses are performed into two different pipeline stages, in which case a dual port SRAM is also needed, the latency of the device will be 4ns. Therefore, even when a non-pipelined architecture is used the unit can decompress data at speeds up to 1066Mb/sec.
Compressibility

Interface circuitry
The proposed device also consists of an interface module needed for connecting the Titan with the existing network switches or interface cards, and a processor interface for configuring the Titan and reading the real time measurements it provides. The common characteristic of the two interfaces is that they both conform to widely accepted standards and thus the Titan can be directly connected to a wide range of network devices and processors.
Conclusions
The device presented here addresses some of the most important implementation issues of a network compression scheme, i.e throughput and low latency. It can compress network streams at speeds up to 1.4Gb/sec with a latency of just 5 packet times. Therefore it is claimed that it can be used with the state-of-the-art high speed networks. This device is highly configurable and provides on-line measurements of the compressibility of the different streams. Although it is targeted to ATM networks, the basic architecture of it can be used with other network technologies, as well.
