This paper presents the design and implementation of the InfiniBand link layer with special efforts made for packet latency reduction and buffer space optimization. The link layer is designed to avoid any architectural conflict while its components are executed in parallel as far as possible. For highspeed packet processing with the various quality of service supports required by InfiniBand, three candidates for packet receiving architecture are investigated. The maximum and minimum delays from an input to an output of a switch adopting each of the three candidates is estimated by mathematically modeling the switch delays. Then, the candidate architecture with the best performance is chosen, and a novel first-in first-out (FIFO) is designed to efficiently implement the chosen architecture. Simulation results show that the chosen architecture achieves the least packet latency and uses the least memory space among the three candidates. The link layer core is implemented in an InfiniBand host channel adapter system-on-chip called KINCA.
INTRODUCTION
The support for quality of service (QoS) is a major research issue in modern multimedia communication networks. However, most of the current networking systems that use cable interconnect standards such as Gigabit Ethernet or Fiber Channel have difficulty in supporting QoS effectively because these standards fail to provide enough QoS-specific mechanisms to satisfy the various QoS requirements [1] [2] [3] . This lack of mechanisms explains why the InfiniBand architecture is now being spotlighted for the multimedia network architecture [4] .
InfiniBand is a system interconnection standard that improves the interconnectivity between servers and I/O devices [5, 6] . It was designed to overcome existing barriers and to provide a framework for new and enhanced features and capabilities. As a result, it is widely considered as a technology for system interconnection that will serve as a universal enterprise fabric for various classes of networks and storages [7, 8] . Most of all, InfiniBand enables high-grade QoS support for the transmission of time-sensitive packets such as multimedia data; therefore, it is often considered that InifiniBand is the most appropriate system area network (SAN) that enables the delivery of video-on-demand and other multimedia streaming services in the form of a storage cluster.
The link layer of InfiniBand is fully responsible for supporting the QoS mechanisms. However, it is not easy to implement efficiently in terms of both performance and cost due to complicated packet check routines at high-speed packet transfer rates and multiple virtual channels that require large buffer space, respectively. To implement a link layer that fully supports the InfiniBand QoS mechanisms, careful design and optimization of hardware architecture are necessary. Otherwise, many of the QoS mechanisms that are stated independently in the specification may cause an architectural conflict that, as a result, degrades the performance and wastes buffer memory space.
This paper presents the design and implementation of the InfiniBand link layer with special efforts made for packet latency reduction and buffer space optimization. For the THE COMPUTER JOURNAL Vol. 50 No. 5, 2007 efficient implementation of the link layer, a high-speed packet receiving architecture with the support of a smart first-in firstout (FIFO) [9] is proposed to overcome the bottleneck caused by the real-time processing of abundant and rapid packets. The proposed architecture and FIFO remove temporary buffers where incoming packet data is stored until error checking is finished, causing the bottleneck in the traditional approach. Instead, the FIFO is capable of discarding packets immediately in response to signals that indicate an error detection. Thus, it reduces the packet processing latency because the input packet is directly stored in the FIFO; the hardware cost is also saved due to the omission of temporary buffers. In addition, it facilitates the link flow control more efficiently by discarding any error packet in one shot and thus using the available bandwidth of InfiniBand effectively. To verify the advantages, the paper analyzes the maximum and minimum delays and hardware costs of the proposed architecture as well as the existing architectures with their static models, simulates the behaviors of the architectures and compares the results.
The rest of the paper is organized as follows: the InfiniBand QoS mechanisms are introduced in the next section. Section 3 presents the proposed hardware architecture of the link layer efficiently supporting all the InfiniBand QoS mechanisms. Section 4 investigates architectural candidates for high-speed packet-receiving buffers and proposes a new architecture based on the investigation. Section 5 estimates the critical and optimal path delays of the candidates and Section 6 evaluates the performance and the hardware cost. Section 7 presents the implementation of the link layer with the proposed architecture as a silicon core and briefly introduces the KINCA chip, which is an InfiniBand host channel adapter (HCA) employing the proposed link layer core. Finally, conclusions are presented in Section 8.
INFINIBAND QOS MECHANISMS
The InfiniBand link layer provides various services that guarantee the support of various QoS demands ranging from connected to connectionless communications. The QoS mechanisms at the link layer are the virtual lane (VL), linklevel flow control, VL arbitration and service level (SL)-to-VL mapping [10] . This section briefly introduces these mechanisms.
Virtual lane
The I/O ports of the InfiniBand components, such as the switches and HCAs, support the VLs that provide a mechanism for creating multiple virtual links within a single physical link. Each VL, which must be viewed as an independent resource for the flow control, contains its own transmit or receive buffer. An I/O port must support at least two VLs (VL0 and VL15) and not more than 16 VLs (from VL0 to VL15). All ports must include VL15, which is reserved exclusively for the subnet management and always has the highest priority among all the VLs. Although the implementation details of a VL are not defined in the InfiniBand standard specification, a VL is generally implemented as a temporary buffer with control logics. The buffer stores data packets that are to be either processed by the transport layer (or the crossbar switch) in the receiving part or sent out in the transmitter side. The size of one packet can be as large as 4222 bytes.
Flow control
To prevent the loss of packets due to network congestion, the InfiniBand architecture uses an 'absolute' credit-based flow control scheme between node components such as HCAs and switches. Many traditional flow control schemes such as those used in asynchronous transfer mode networks provide incremental updates that are just added to the available buffer space for packet sending in transmitters [11] . On the other hand, InfiniBand receivers provide a 'credit limit' that indicates the total amount of data that the transmitter has been authorized to send since the link was initialized [10] . InfiniBand requires the credit to be calculated not by a cell or packet, but by a block with the fixed size of 64 bytes. This block-based calculation makes it easier to manage and make better use of the buffer space.
VL Arbitration
If more than two data VLs (except VL15) are implemented, then the priority of each VL is defined by the VL arbitration table. The arbitration uses a two-level weighted round-robin algorithm. There are two arbitration tables, one for scheduling packets for high-priority VLs and the other for low-priority VLs. Weighted round-robin arbitration is performed within each priority level. The arbitration table contains up to 64 entries, and each entry specifies a VL number and a weight that represents the number of blocks (64 bytes) that can be transmitted from that VL. This weight must be in the range of 0 -255 and is always rounded up as a whole packet. The entries in each table are cycled through when the table is activated. Switching between two priority levels is based on the value of the 'Limit_Of_High_Priority (LHP)' register defined in the InfiniBand specification. The value of the LHP register specifies the maximum number of high-priority packets that can be transmitted before a low-priority packet is transmitted.
SL-to-VL mapping
The InfiniBand specification defines a maximum of 16 SLs that can be used for QoS traffic classification. An SL-to-VL mapping table is used to define the VL that is used by HIGH-SPEED LINK LAYER ARCHITECTURE 617 each SL. Each data packet is marked with a class of service on the SL field of the 'Local Route Header' (LRH). As an InfiniBand clustering system can be constructed with switches that support different numbers of VLs, the total number of VLs used by a port and an SL-to-VL mapping table of each port should be configured by the subnet manager (SM).
THE HARDWARE ARCHITECTURE OF THE INFINIBAND LINK LAYER
The InfiniBand link layer is responsible for sending and receiving data across the fabric at the packet-level [8] . It also handles the various QoS mechanisms mentioned in Section 2. The additional link layer's functions include buffering, error detection, packet routing within the local subnet and address decoding [10] . Figure 1 In the transmit block, data VLs contain a buffer of 8 kbytes each, and one management VL (VL15) contains a buffer of 4 kbytes for subnet management packets. Without a major change in the architecture, the number of VLs can be generalized to an arbitrary number. The size of the maximum transfer unit (MTU), that is, the size of the maximum packet payload can be configured from 256 to 4096 bytes, which is a half of one data VL. The packets in the five VLs are passed to the Data Packet Generator through a multiplexer that is controlled by the VL arbiter. The arbiter decides the priority of the packets based on the arbitration table that is downloaded from the SM. The Data Packet Generator receives data from the selected VL and packetizes the data to be sent out. The link packet generator receives information from the link packet check and flow control machine in the receive block, composes a packet with the information, and sends it back to the opposite link layer of the physical link.
In the receive block, the Packet Receiver State Machine receives input packet streams and reformats the packets for the upper layer processing. The Data Packet Check Machine verifies the packets from the Packet Receiver State Machine. The Link Packet Check and Flow Control Machine controls the network traffic with the credit-based flow control mechanism using information delivered by the flow control packets. A multicast buffer is used to support the self-multicasting for the packets of 'Unreliable Datagram' and raw types as defined in the InfiniBand specification. The VL Control Blocks in both the receive and the transmit blocks are responsible for controlling the VLs. The Rx VL Control Block is important because it generates new read addresses in each VL buffer when the 'discard' indication from the transport layer occurs. This situation is described in detail in the following section.
InfiniBand requires a packet data stream of 2.5 Gbps per channel (with 8/10 coding in the physical layer). To support up to the 4Â channel mode in which data is transferred by a 32-bit width at 250 MHz, the physical layer's I/F block is designed to process 64-bit data for every clock cycle at 125 MHz.
PACKET RECEIVING ARCHITECTURE
To support the InfiniBand QoS mechanisms with the capability of high-speed packet processing, the hardware 
618
J. LEE et al.
architecture of the packet-receiving part needs to be designed carefully. This is because of all the types of InfiniBand traffic defined in [1] commonly need high utilization of a physical bandwidth and low-packet latency to achieve a good QoS performance. Another reason is for the efficient handling of errors caused by more various sources than those in other existing standards. This is because InfiniBand operates at a much higher transfer rate [12 -14] and supports very complicated communication services. The link-level error group, for example, has packet delimiter errors, CRC errors and errors on each field of the LRH and the transport-level error group has errors in various headers that follow the LRH. If these types of errors are not checked promptly by an I/O port, a bottleneck may easily be made and the network bandwidth may be severely wasted. An InfiniBand packet consists of three fields: the header field, the data field (MTU size) and the error control field (CRC). Although the data field is received ahead of the error control field, the data field cannot be stored in the corresponding VL buffer until all the data in the error control field is received and checked. This requirement is mandatory because the specification states that the data can be stored in the corresponding VL only if no errors are detected [10] .
This section investigates the existing architectural candidates for the packet receiver in conventional network devices to assess their suitability for the high-speed packet receiving required by InfiniBand and proposes a candidate, which is compared with the previous ones in later sections.
Conventional architecture for packet receiving
To satisfy the specification, a packet receiver machine can generally be designed as shown in Fig. 2 . As indicated by arrow in the figure, all the fields of a packet are stored in a temporary buffer, namely a memory that consists of an SRAM and its controller. The packet is then analyzed and checked for errors by the Packet Inspection Logic as shown by arrow . Any error is indicated by signal from the inspection logic and the corresponding packet is discarded by clearing the memory buffer where the packet is stored. If no error is detected, the packet is sent to the corresponding VL FIFO toward the upper layer as arrow and the packet is sent to the upper layer, i.e. the transport layer in an HCA or the crossbar in a switch as indicated by arrow .
For current packet-based network communications and I/O standards that deal with a relatively slow data stream, the above architecture can process the data stream in time [15] [16] [17] [18] . However, applying the architecture to InfiniBand may be inefficient or even impossible for several reasons. First, the memory buffer and controller block in the figure cannot afford to process all incoming packets as fast as required by InfiniBand and, second, due to the bottleneck of the receiver, the processing latency may increase severely.
To alleviate the problems, the memory buffer should be able to accommodate at least two maximum-size packets, where one is checked and one is forwarded to a VL or discarded in parallel. However, the buffer space for the data packets becomes a major overhead because the size of an InfiniBand packet can be as large as 4222 bytes. Another drawback of this architecture is that the space of the memory buffer is shared by all VLs. As a result, the space cannot be used for the flow control because the credit for the flow control should be based only on the buffer space dedicated to each VL. Thus, even though the buffer space may have an empty space, it cannot be counted for any VL credit. Figure 3 shows a packet-receiver architecture with a one-bit error flag attached to each entry of a VL [19, 20] . The white bar in the left of each VL represents the one-bit error flag that can be treated as an additional bit of data stored in the FIFO. As shown by arrows and , incoming packets are immediately stored in the corresponding VL through the Packet Inspection Logic. If the packet contains any error, the error indication flag is set to '1' at the end of the packet. Whenever the upper layer or the crossbar controller reads, it checks the error flag and if it is set to '1', it discards the 
Architecture with a one-bit error flag

HIGH-SPEED LINK LAYER ARCHITECTURE 619
packet. As a result, all the errors in the link layer and transport layer are discarded in the upper layer. This architecture is simple and has a low latency due to the direct path indicated by arrows and without any temporal buffering. However, the credit size for the flow control can decrease because it stores all incoming packets, including the error packets. In addition, this architecture needs an additional buffer with a maximum packet size of 4222 bytes per port or per each VL depending on the switch structure that will be discussed in Section 5. The reason for the additional buffer is that the link error packets, even without any transport error, should first be stored in VLs, and then discarded by the upper layer or the crossbar when the error flag is set to '1' at the end of the packet. Only after the entire packet is confirmed as error-free, it can be consumed by the upper layer or the crossbar.
The proposed architecture
This section proposes an efficient high-speed packet receiving architecture and a FIFO that supports it. Figure 4 shows the proposed architecture. The memory and controller block, which causes the bottleneck in the traditional approach, is removed in the architecture. The packet data is stored immediately in the corresponding VL FIFO when the packet data is received from an external channel. This is possible because the packet header contains the corresponding VL address. When the input packet is being stored, the inspection logic checks immediately various fields of the packet and accumulates the CRC value calculated at every clock cycle. Whenever an error is detected at the end or in the middle of packet receiving, the packet data is immediately discarded. The upper layer can also detect errors while pulling out data from the FIFO; that is, when an error is detected in the layer, the whole packet must be discarded from the FIFO. The architecture can also discard error packets immediately in response to a request from the transport layer.
When compared to the other architectures mentioned in advance, the proposed architecture has a number of advantages. Firstly, in contrast to the traditional architecture in Fig. 2 , the proposed architecture reduces the latency because the input packet is directly stored in the FIFO. Secondly, in contrast to the architecture in Fig. 3 , it can use VL buffer spaces more efficiently by not storing the error packets. In addition, without any temporal buffer, the proposed architecture can reduce gate counts, i.e. area cost. Moreover, by avoiding any bottlenecks, it can efficiently support the available bandwidth of an external channel.
To implement the above architecture, the FIFO must be capable of discarding packets immediately (within one clock cycle) in response to signals that indicate an error detection. These signals can be generated by both the input side and the output side; that is, by the current layer and the upper layer. To handle the immediate packet discard, a FIFO architecture is proposed in Fig. 5 . The data input and output method is the same as a conventional FIFO. The signal 'write_data_in' refers to the input data bus and 'read_data_out' refers to the output data bus. The input data is latched to the FIFO when the 'write_enable_in' signal is asserted 'high', whereas the 'read_enable_in' signal is used to enable data to be read from the FIFO. The 'full_out' signal is asserted 'high' when the FIFO is full, and the 'empty_out' signal indicates that the FIFO is empty.
The 'waddr' register is the address pointer for the current input data, and the 'waddr_bound' register points to the address of the first data of the current packet. If any error is detected by the inspection logic, the 'waddr_load' signal is asserted. The value of the 'waddr_bound' signal is then loaded into the 'waddr' register. A change in the 'waddr' value means that the input data from 'waddr_bound' to 'waddr' has completely been discarded. Because the load of 'waddr' takes place in a single cycle, the data discard is performed in a single cycle too. If the packet has no errors after all the data have been received, the value of the 'waddr' register, which points to the start of the next packet is loaded into the 'waddr_bound' register, which is enabled by the assertion of the 'waddr_bound_load' signal.
In the data output side, an error packet can also be immediately discarded. The 'raddr_plus_1' signal points to the current read address. In normal operation, whenever the 'read_allow' signal is asserted, the 'raddr_plus_1' register is increased by one. Upon finding an error, the transport layer asserts the 'upper_layer_discard' signal. The Rx VL Control Block, which is shown in Fig. 1 , then updates the FIFO read pointer. Because the packet length is in the LRH, which is the first header of a data packet, the length is always stored in advance in a temporary register in the Rx VL Control Block. As the packet data is transmitted to the transport layer, the control block counts the length of the stored data accordingly and calculates the remaining length. The control block then reads the value of the 'raddr_plus_1' from the FIFO. The 'raddr_plus_1' is added to the remaining length, thereby enabling the transport layer to read the start address of the next packet. This new address is sent back to the FIFO through the 'new_addr' signal and stored in the 'raddr_plus_1'. This update of 'raddr_plus_1' causes the error packet to be discarded.
The initial value of a register at a system boot-up is indicated by 'set to 0', 'set to 1' or 'set to 2' denoted just below each register in Fig. 5 . The 'waddr_plus_2', 'waddr_ bound_plus_2', 'waddr_bound', 'waddr', 'raddr_plus_1' and 'raddr_plus_1_reg' registers are used to express the addresses of the 'dual_port_sync_sram' memory cell in the FIFO. These registers are updated according to the combination of the three-bit control signals; write_allow, waddr_bound_load and waddr_load. Recently, the FIFO is registered as an US patent [9] and the details of the FIFO circuitry are presented in [9] .
ESTIMATION OF THE MAX AND MIN PATH DELAYS
The link layer is a major part of an InfiniBand switch as the switch consists of only the link and physical layers, including a crossbar. The performance enhancement in the link layer may help reduce the time taken for a packet to pass through the switch. This section estimates the maximum and minimum delays from an input to an output of a switch by statically modeling the three candidates for the link layer described in the previous section based on the delay computation method of [21] .
Maximum delay modeling for an error-free switch
In the case of a typical switch with a multiplexed crossbar, each input port that consists of the receive block of the link layer has a single input to a crossbar within the switch. Each output port that consists of the transmit block also has only one output from the crossbar. With this structure, the maximum delay that a packet suffers when crossing a switch can be computed. Considering a packet into the input VL buffer, and if the VL has a pure FIFO behavior without any consideration about error checking, all the packets that arrived before the packet in the VL must leave before the packet. Then, the number of packets depends on the VL size (VS) and the maximum packet size (MTU) is dVS/MTUe where d e represents the round up (ceiling) operation. If the round-robin algorithm is applied to the selection of the next packet to transmit, the behavior of the arbitration tables affects the behavior of the round-robin algorithm, adopting the priority policy that is applied to the output port. Thus, to compute the number of packets that can cross before one packet of a high priority, all possible packets from all the buffers must be considered. Thus, the maximum number of 
HIGH-SPEED LINK LAYER ARCHITECTURE 621
THE COMPUTER JOURNAL Vol. 50 No. 5, 2007 packets P to consider before a certain high-priority packet crosses is:
where nPorts is the number of external ports of the switch, and nVLs is the number of VLs per port.
As all the packets of the same SL use the same VL, when the high-priority packet crosses the crossbar, all the packets that it encounters on the way belong to the same SL. The maximum number of the packets that the buffer of the output port can have is ceiling (dVS/MTUe) and this number should be added to equation (1) . In addition, the one packet that could be crossing the crossbar should be added to equation (1) . Thus, the maximum number of packets, P, becomes as follows:
The time in the output VL buffers will depend on the output arbitration tables. Thus, the term Scan is used for the maximum separation between the two entries corresponding to this VL in the high-priority arbitration table, or a complete round of the table, if the VL has only one entry in the high-priority table. Then, this value should be applied to equation (2) :
In addition, a possible packet leaving the switch and the effect of the LHP should be included as well. This limit allows the switch to send LHP Â 4 kbytes of high-priority packets before sending a packet of a low priority. Thus, the number of low-priority packets that can be transmitted before a high-priority packet depends on the VS and on the MTU. Adding this value to the previous expression, equation (4) is obtained:
where P is the maximum number of packets that can be sent before sending a given high-priority packet. Finally, the maximum delay per switch that a certain high-priority packet suffers since it arrives at the switch until it leaves, T max , is as follows:
Meanwhile, in the case of a switch with a full crossbar, where each VL has a dedicated input to the crossbar, two packets from two different VLs of the same input port could be crossing at the same time toward different output VLs. In this case, equation (4) can be simplified. Only the packets that are in the same VL of the other input ports could cross the crossbar before a packet of a high priority. Thus, the final expression is:
The Scan varies when new connections are accepted. However, with all the SLs for time sensitive traffic assigned to the same VL, the expressions can be simplified by setting Scan ¼ 1.
Actual modeling of the delay
With consideration about error checking, i.e. constraints by architectural candidates, equation (1) should be modified as follows: † Type A : nPorts Â 2 þ nVLs Â VS MTU ; ð7Þ † Type B :
Here, for convenience, the conventional architecture for packet receiving is marked as Type A, the architecture with a one-bit error flag as Type B and the proposed architecture as Type C. As indicated in the expressions, the value P of Type B and that of Type A are bigger than that of Type C by the number of ports and by at least twice, respectively, regardless of the switch type. If it is assumed that MTU is set to 4096 bytes, VS to twice the maximum MTU, the numbers of ports and VLs to four each, and LHP to two, the value P for each candidate is listed in Table 1 . The VS is a reasonable embedded memory size, considering today's system-on-chips (SoCs) for InfiniBand network chipsets. As shown in the table, with the multiplexed crossbar, the Type C requires 17.77 and 9.76% less delay than Types A and B, respectively. With the full crossbar, Type C requires 38.09 and 23.52% less delay, respectively. Moreover, the delay is accumulated whenever the packet hops from one switch to another.
On the other hand, the minimum delay that means there are no competing (or contention-free) packets is calculated easily through the previous concepts for the switch structures. Accordingly, at both crossbars, the minimum values for Types A, B and C are 2, 1 and 1, respectively. In the same way, Type C is still superior than Type A with the delay reduced by 50%.
The estimation discussed above is not reflecting dynamic factors such as dynamic flow control, VL management, error packet discarding and so on. The mathematical estimations for those factors are very complex and not viable. Thus, to investigate the effects of the dynamic factors in normal cases, the simulations are performed and evaluated in the next section.
PERFORMANCE EVALUATION AND COMPARISON
This section performs network simulations to measure dynamic effects by the three candidates in a real environment and calculate their area costs. Each architecture is programmed by Verilog HDL in the form of a synthesizable register-transfer level (RTL) code. The models of an InfiniBand HCA and a switch are also programmed in Verilog RTL, including from the transport layer to the physical layer. The three architectural candidate models are employed by the HCA and switch models and a 4 Â 4 mesh structure is built to form an example InfiniBand network system. This architecture is reasonable for performance estimation because, in general, 15 to 64 nodes are used to analyze the behavior of SANs [21] [22] [23] [24] [25] . The details of the simulation environment are as follows: the data packet for transmission consists of the fields of LRH and 'Base Transport Header', the data payload and the fields of 'Invariant CRC' and 'Variant CRC'. The total packet size is 26 bytes (for the header) plus the size of the MTU (for payload and the CRCs), which ranges from 256 up to 4096 bytes. It is assumed that the models have four VLs, operate in the 4Â mode of 10 Gbps at a clock speed of 125 MHz, and all the data paths in the link layer have a width of 64 bits. Each data VL of all the architectural candidates in a switch uses a memory cell of 1024 Â 64 bits (twice the maximum MTU).
To emulate the physical layer, the behavioral model of a PMC's SerDes chip [26] is used and additional features are designed in accordance with the specification of the InfiniBand physical layer. The transport and physical layers are the same as the practical models used to fabricate an InfiniBand HCA SoC, which is illustrated in the next section.
Latency
First, average packet latencies are measured with various injection rates. The packet latency is measured when the link state machine is in the 'LinkActive' state. The latency of each data packet is measured from the time the packet is sent from a buffer in the transport layer to the time it arrives at the transport layer in the destination node through the switch nodes. Note that the packet is sent immediately after the decision of the VL arbiter. There is a packet error generator between the HCA node and the switch node to corrupt normal packets to error packets. Flow control packets are sent every transmission of a data packet and they are sent periodically for fast credit updates even though there is no transmission of data packets. Note that the latencies due to the packet routing and the internal switching delays are not included in this measurement because this research focuses on an efficient buffer management under dynamic credit-based flow control. To concentrate on differences among the three architectures, the overall latencies through the entire mesh network are divided by hoping numbers to obtain the average latencies in each link. VL0 is assigned for real-time (i.e. time-sensitive) traffic like Dedicated Bandwidth Time Sensitive by using the highpriority table. Using the low-priority table, best-effort traffic for Dedicated Bandwidth, Best Effort or Challenged is assigned to VL1, VL2 and VL3. This traffic classification and its table assignment are presented in [1, 4] . The results presented here are for real-time to best-effort ratio of 7:3 and each VL in the transmit side can control the injection rate ranging from 30 to 60%. These simulation conditions are widely used in previous researches [21 -25] . 
HIGH-SPEED LINK LAYER ARCHITECTURE 623 Figure 6 shows the average latencies of the data transfers through the network. Figure 6a presents the latencies of realtime traffic. For the rates from 30% or 60%, Type C shows the best performance and, as the rate increases, the performance differences become bigger. In the case of Fig. 6b that presents the latencies of best-effort traffic, Type C also shows the best result. However, the differences between Types B and C are very small. Previous research [23] shows that the latencies of best-effort traffic are heavily affected by the network routing algorithm. In this case, link layer architectures do not affect the latency much and that is why all the three types have similar latencies.
As shown in the figures, the link layer architecture affects the network performance, especially for time-sensitive traffic that is a major QoS factor in multimedia applications. Moreover, as the latencies are accumulated whenever the packets hop to another switch node, it could not be negligible in the applications.
According to [12, 13] , the PER is ,0.00000422% by an InfiniBand cable. However, as a packet error is caused not only by physical cables but also by many other sources, the actual PER is much higher. Several Infiniband device makers report up to the PER, 0.5%. Thus, the experiments with 0.5% PER are performed again under a heavy traffic environment. Here, the weight ratio of the VL arbitration table is assumed to be 4:3:2:1 for V0:VL1:VL2:VL3. Figure 7 shows the latencies of the data packets transmitted through VL1 with the 0.5% PER. The X-axis represents the sequence number of data packets, and the Y-axis represents the latency measured in microsecond. The packet sequence numbering of the X-axis does not include the error packets. The slight change of slopes in the middle of the graph is caused by VL0 saturation. As VL0 is saturated, the rate of packet transmission through VL0 is reduced. The transmitter therefore uses other VLs more frequently to send packets out. As a result, the packet receive rate of VL1 is increased resulting in an increase of packet processing latencies.
As shown in the figure, Type C has less latencies overall than the other types because the immediate discard of error packets by the proposed FIFO reduces the transmission time of the next packets. Type B in particular has larger latencies because of its inflexible error handling without temporary buffers.
Gate count
To measure the gate count, the 0.18 mm Faraday TM standard cell library is used for gate-level logic synthesis and the Synopsys TM Design Analyzer is used to synthesize the three architectural candidates. As previously discussed in Section 4, Type A requires additional temporary memory per port, of which size is about twice of the maximum packet size, regardless of the switch structure. Type B also needs an additional buffer with the maximum packet size in the multiplexed crossbar, or four-times in the full crossbar that has four VLs. Table 1 shows the synthesis results of three candidates with one port and four VLs each. As shown in Table 2 , when a full crossbar switch is implemented, the cost for five ports with Type B can implement up to six ports with Type C. This gate count difference can make a significant impact on the chip or board size because an InfiniBand switch consists of up to a few tens of ports and each port has its own physical and link layer [21, 25] .
IMPLEMENTATION
This section presents the implementation details of the InfiniBand link layer core. The core includes four data VLs, from VL0 to VL3, due to the limit of silicon cost (die size). Thus, the overall architecture of Type C and Fig. 1 is implemented without any modification. The 0.18 mm Faraday TM standard cell library (FSA000A) is used for gate-level logic synthesis. The gate count and propagation delay are optimized using the Synopsys TM Design Analyzer. The built-in self-test (BIST) wrappers for the memory cells are inserted before synthesis and the scan insertion for design for testability (DFT) after synthesis is performed by the SynTest TM tools. The design included seven dual-port synchronous SRAM cells, nine SB110040s (1048 Â 64 bits) and two SB104040s (512 Â 64 bits). The total gate count of the link layer core is 106 970 including BIST wrappers while excluding the memory cells and the DFT scan logic circuits. Including memory cells, the total gate count is 1 328 229. The normal operating clock speed is 125 MHz and the critical path delay is about 7.2 ns on the data output path of an SB110040 through a BIST wrapper in the worst case static timing analysis with the Synopsys TM PrimeTime. The link layer core is employed by an InfiniBand HCA SoC that provides the interface for host multimedia electronics. The HCA implementation is presented only to show how the link layer core gears with the other layers and how it is characterized in real silicon. Figure 8 shows the organization of the HCA. The chip consists of four parts: the Host I/F that is connected to the PCI bus, the transport layer, the link layer and the physical layer. It also includes an interface with external SRAMs (up to 4 Mbytes) via AMBA AHB (32-bit at 125 MHz). Two ARM922T processors are used for the transport layer to support up to a 10 Gbps network. The SerDes block in the right-end is a commercial chip that serializes/ deserializes the InfiniBand packets. The HCA chip supports the InfiniBand 4Â port (2.5 Gbps Â 4 ¼ 10 Gbps).
Up to 95% of fault coverage is achieved with the following test schemes: † ARM922T and Prime Cell IPs: TIC vectors from ARM Ltd. † Embedded SRAM cells: BIST chains. † Custom Logics: full scan circuit insertion and ATPG. † I/O pads and external interface: IEEE 1149-1 JTAG boundary scan logic and test vectors.
The layout of the HCA and its demo board are shown in Figs 9 and 10, respectively. In addition, its specification is shown in Table 3 . 
HIGH-SPEED LINK LAYER ARCHITECTURE 625
CONCLUSION
The main contribution of this paper is the design of an InfiniBand link layer with a high-speed packet receiving architecture and a smart implementation of the FIFO for data buffering. The improved performance of the link layer may increase the overall network performance as it can help reduce the time taken for a packet to pass through the switch of which the link layer is a major part. To show the improvement, the paper analyzes the maximum and minimum delays from an input to an output of a switch by statically modeling the link layer, comparing it with existing architectures. In addition, the paper performs network simulations to measure dynamic effects by the designed link layer in a real environment. Through both static estimation and dynamic simulation, the proposed architecture and its smart FIFO turn out to be effective in reducing the packet transmission latency and remove temporary buffers, consequently reducing the hardware cost. Moreover, the link layer core is proven to be efficient in the bandwidth utilization of InfiniBand, fully 
626
supporting its QoS mechanisms. The paper implements the link layer core as a silicon, proves its characteristics in a real silicon, and shows how the link layer core gears with the other layers. The proposed architecture and FIFO design might be also applicable to other state-of-the-art I/O standards that have high-speed network and switching fabric architectures such as HyperTransport, RapidIO, PCI Express, Fiber Channel and Gigabit Ethernet.
