Abstract
Introduction
User-level Networking (ULN) mechanisms seek to remove the operatin system from the critical path of communication, provi8ng the user with direct access to the network. The operating system is used only to set up rotected channels of communication, removing the costs oPcrossing protection boundaries during data transfers.
Several research endeavors based on user-level networking [ 13, 1 1, 61 have demonstrated its erformance benefits. Industry has also taken note of d N ' s otential, and attempted to standardize it in the form of tie Virtual Interface Architecture (VIA) [4] specification. Hardware [7] and software [2, 3, 11 implementations of VIA have been developed, together with preliminary studies examining the performance of this platform.
However, previous studies [l, 2, 3 , 121 have largely focussed on maximizing. the perfo.rmance seen by a single VIA channel. Equally important is the performance de radation of the channels as more are opened and used ancfwe refer to this issue as the scalability of VIA.
Higher multi rogramming levels necessitate more network channels &ne channel is needed per protection domain at the very least). Cluster applications also kee several channels o en to other nodes to lower demultipExing and setupltear-gown costs. To revent significant performance degradation, scalable V I 1 desi ns are needed. The goal of this paper is to examine, in ietail, those aspects of the VI Architecture and implementation which impact its performance as the number of channels, and the experienced load on these channels increases.
Several factors affect the scalability of a VIA implementation. These include the hardware capabilities that are available (number of DMA engines, ca abilities of these DMA en ines, number of rocessors etc.fl hardware s eeds of these ievices (bus banlwidths, memory latencies, f;MA transfer rates, processor clock speed etc.), and the software that coordinates the hardware. This paper rimarily focusses on the first and third issues. Specificai paper starts with a baseline network interface card similar to the Myrinet M2F-PCI32C [8] interface which offers DMA and processing capabilities, and examines ways of improving it from a hardware (hardware DMA queues, doorbellsup ort, multiple processing en ines) and software perspective, teeping scalability in mind.%he software components consist of the user-level messa ing libraries runnin on the host CPU, and the firmware k a t running on the N I~ rocessor.
d e first contribution of this paper is an indepth investigation, usin a detailed simulation model, of the different hardware and software alternatives in implementing VIA. Secondly, this paper also identifies firmware and hardware enhancements that can improve the scalabilit of VIA.
The rest of this paper is organized as follbws. Section 2 ives a brief overview of the fundamental mechanics of VfA. Section 3 presents the issues in implementing VIA that can impact its scalability, and Section 4 gives simulation results for the issues under discussion. Section 5 concludes with a summary of the results and identifies directions for future work.
The VI Architecture
This section explains some of the terminology and provides the background for various operations that are part of a VIA implementation. A detailed description of the architecture may be found in [4] , as further clarified b [5] . This paper only addresses the hard requirements of [4{inspite of its considerable ambiguity.
In VIA, each process gets a direct, protected access to the NIC through what is called the virtual interface (VI). A VI is opened by a process by oing through the kernel and thereafter provides user-levef communication in conjunction with the trusted firmware running on the NIC. Protection amon st processes relies on the selective memory mapping, ancfuse of trusted components such as the kernel for setup, and NIC firmware for communication.
The VIA protocol is point-to-oint and connection oriented. We assume the Unreliable belivery mode of the protocol throughout this paper.
Each VI comprises a pair of Work Queues (WorkQs) for sendin (SendQ) and receiving (RecvQ), each with its associated %oorbells. WorkQ elements called Descriptors contain all the information needed to initiate a send or a receive operation. WorkQs and the data buffers associated with them are allocated in pinned host memory.
Logicall there are three operations that an application has to go tirough To send a message, an a plication first allocates a data buffer in registered (pinne8 memory and copies the contents of the message to it. Next, it posts a Send Descriptor in the SendQ. The Descriptor contains the address of the data buffer, len th of the message and other control and status fields. Finily, the application rings the Send Doorbell to notify the NIC of the newly posted Descri tor. Control of the Descriptor is now transferred to the N I 8
An a plication can then wait for send completion notification ! y polling or blockin on an update to the SendQ Descri tor. Or it can poll or bfock on a Completion Queue (CQ) #at has been explicitly associated with the SendQ during the latter's creation. A sin le Com letion Queue can be used to coalesce.notifications%om mukple WorkQs. The NIC, after processing the send, updates the status fields of the Descriptor and optionally puts an entry into the associated CQ. Receive side actions are very simlar.
Scalability Considerations for VIA
Having looked at the overall structure of VIA, we now focus on those features that could affect its scalability. Some of these features are inherent in the protocolkpecifications while others are implementation issues. There is some overlap between the following subsections as they look at different implications of the same design choices.
To make the remaining discussion more concrete we assume a Baseline NIC similar to the Myrinet M2F-PCI32B card [8] . It has a custom rocessor, memory that can be mapped into a process' ad8ess space, one host side DMA en ine for data transfers to/from registered host memory, an% two DMA engines (one outgoing, one incoming) as part of the acket interface to the bidirectional wire link to the networi fabric. Figure 1 shows a simplified view of the NIC that is used in the discussion to follow. For the purposes of this study, scalability is the number o VI channels that can be supported without sign$cantly d e rading the e$omance on an individual channel. Typica8 y, the numger of channels that are active simultaneously is a subset of those that have been opened. But since there are processing costs for an open VI regardless of its usage, scalability is defined in terms of VIS opened (declared) rather than being used.
Work and Completion Queues
Descriptors and the queues used to mana e them are amongst the most critical aspects of the VI kchitecture. The location of the queues, size of Descriptors and the way that event notifications are sent can all affect the latency and concurrency of messaging operations.
Location:
The VIA specification appears to mandate the lacement of WorkQs in host registered memory. However, Lee ing the WorkQ on theNIC.can save an additional DMA in tfe critical ath. Scalability is also im roved, as is shown later. On the i i side, WorkQs on the {IC require the application to polr them across the I/O bus (for completion notification), leading to severe performance degradation.
To get the best of both a roaches, we propose the use of a shadow queue on the k8C. After posting a descriptor in a WorkQ in host registered memory, the application additionally writes the control segment part of the descriptor to a per-VI shadow queue maintained on the NIC. This allows the NIC to proceed with data transfer immediately after it notices the doorbell. Completion status is, however, still sent to the host resident WorkQ. Polling on it remains efficient and the semantics of completion calls are.maintained. Also, the NIC can reuse a shadow queue ent independent of the corresponding Descriptor's lifetime in?ost memory.
Size of Descriptor pa load: Independent of the above design option, the payroad of a descriptor (in the immediate data segment) can be increased beyond the 4 bytes suggested by the exam le implementation in the VIA specification document [4f For a message smaller than this payload, henceforth called an immediate message, an additronal.DMA.(to get the data), could be eliminated. The Fast Descri tors idea of [3] also suggests increasing the payload to 16-f2 bytes, without a quantitative analysis, however, to back the recommendation. The effect of descriptor payload size on scalability comes from a reduction in latency and in the reduced contention for the DMA engine.
Use of Completion Queues:
The use of CQs reduces the host CPU time spent on polling for notifications. This not only improves the CPU utilization for useful work, but also allows it to do more protocol processing and indirectly allows more VIS to be active concurrently. On the fli side, the firmware on the NIC has to performan additionafDMA operation for each com letion notification Hence, it is not intuitively clear how C g usage affects scalability overall.
Firmware design
Perha s the most critical aspect of the performance of a VIA im yementation is the desi n of the firmware that runs on the &'IC. The elimination oAhe kernel from the critical ath offloads much of the protocol processing load onto the kIC as it is the only trusted component in the critical ath. This also frees the host CPU for other useful work. {ow-ever, the NIC typically cannot match the processing power or memory resources of the host. Hence, fhe'firmware runnin 'on the NIC has to be carefully designed for optimizing .
performance with limited resources. Complex, memory or compute intensive code, rarely fits the bill.
Doorbell su port:
The doorbell mechanism is used to alert the N I 8 that a descriptor has been posted and needs processing. The most efficient way of doing this is to write to a ma ped portion of the NIC memory which is polled by the NI$rocessor. This wastes NIC memory, as outlined in [3] . ollation of VI doorbells owned by the same host process onto a single NIC pa e helps somewhat.
As the number of declare80 en VIS increases, the number of locations that must be pol)led increases, regardless of their usage or collation onto a single age. The penalty for the extra olling is paid by all the V i , existing and newly added. A i s could cause a severe degradation in performance.
The problem can be alleviated to some extent by carefully choosin the order in which the firmware polls the different doorbefl locations. If load is light with only one or a few VIS sending m e s a es, the firmware could favor these active VIS; an extreme%ein the case where the firmware does not move on to anotherb as long as the current one is ringing the doorbell. The latter is rather temptin when the performance criteria is the latency of only one Vf For more evenly distributed loads, a round-robin approach may be more suitable. In all cases, the complexity of determining the order affects the delay seen by the target workload(s).
DMA operations and event ordering:
The host side DMA en ine is an important element in the critical path as all NI8-initiated data transfers to and from host memory have to use this resource. For a WorkQ with an associated Completion Queue, as many as 3 host DMA operations unrelated to the message data have to be done (as shown in Figure 2 and explained in detail later). The host DMA for message data is of variable length.
When the firmware, during the course of message processing, finds the host DMA engine busy, it has to wait. How it spends this wait time is ver important for scalability. A busy wait,especially for a rong message, could be very expensive as it directly adds to the wait for service times seen by other messages on the same or on different VIS. Slightly better is to poll for doorbells while waiting, so that the average wait for detection of a rung doorbell is reduced. However, the first successful doorbell poll ends this overla , unless a software queue of rung doorbells is maintainet One could also start rocessing an incoming message (from the wire) but might {ave to wait for the same host side DMA engine.
In either case, it is apparent that it might be advantageous to break up the processing of a message into well-defined stages of a pipeline. Such a breakup allows the NIC CPU to freely o from any stage of one message to any other stage of anoier. This re uires that the state of a stage be maintained efficiently, alowing the suspension and resumption of state to be determined by as nchronous events rather than a predetermined firmware or&-. The as nchronous events are those whose occurrence is beyond tie firmware's control such as ringing of a doorbell, arrival of a message over the wire and the completion of a DMA. One way of achieving the pipeline mentioned is to have a task queue for each DMA engine. Details of such a desi n and performance results for the same are presented in Zection 4. Pipelining as an idea is not new even in the context of network interfaces ~4 1 .
NIC hardware
Adding hardware support for various o erations that have been identified as otential scalability gottlenecks is an interesting exercise. !hi s sup ort can be classified into two types: those which improve tfe performance of the NIC in general and hence allow more VIS to be processed in the same time, and those which are aimed at the VIA protocol in particular.
Hardware doorbell support: The descri tion of the "doorbell region" in the newer Myrinet M2L-kI64A NIC [9] reveals one wa in which hardware support for doorbells could be provide2 A certain (large) portion of the U 0 address space is managed by the doorbell hardware such that all writes by the host into the entire space appear in FIFO order as seen by the NIC. This serves to collate the doorbell entries, across processes and VIS, into one or more queues. The hardware traps the memory references and the physical memory required is only that of the expectedmaximum length of the doorbell queues, not that of the whole doorbell space.
DMA engines: Increasing the concurrenc of DMA usage could cut down the overhead of waiting ! o r a DMA ento finish (whether the firmware is wating to use it for t e next transfer or is waiting to confirm completion of the current transfer). This can be donejn two ways. A DMA engine could be programmed to initiate more than one DMA operation simultaneously. [9] ermits upto 4 DMA channels to operate concurrently. Alditionally, the DMA engine could read the parameters of the next DMA transfer request from a NIC memory resident queue instead of requiring the NIC processor to pro ram it each time around. This feature is es ecially useful when a scatter-gather list is used within a fjescriptor. Even otherwise, it provides hardware support for one stage of a pipelined firmware design and hence could improve scalability. [9] provides this kind of DMA hardware queues.
Performance Evaluation

Simulator and Workload
To examine some of the above issues, a simulator has been develo ed for a cluster node with a NIC capable of supporting &A. The node is a 4-wa symmetric multiprocessor and the NIC is similar to ahyrinet interface [9] . The network fabric is assumed to contain a small number of high-degree switches with negligible latency and contention. It is also assumed that error rates are negligible and packet order is preserved.
All these assumptions are applicable to the Myrinet switches, and deviations from these assumptions are not expected to significantly change the results and conclusions drawn from this stud . All software and hardware elements of the host and the d C are modeled. In particular, the.PCI bus (64-bit 66 MHz has been chosen here), Ion identified as the primary bottleneck on the host side for bLNs, has been modeled in great detail, including the PCI bridge.. Similarly, the NIC firmware and hardware were modeled in detail, based on our extensive experience in rogramming and usin different Myrinet cards. The simuitor development has ieen undertaken in an industrial setting, incorporating a lot of proprietary hardware information, and these models have been used in product design exercises as well.
Microbenchmark workloads have been chosen to highlight each issue under investigation.
Each 4-wa SMP node runs 4 identical processes, each of which estahshes a separate VI channel with one other process on every remote node. The number of VIS at each node thus increases as 4 * ( N -1) where N is the number of nodes in the cluster. Considering cluster sizes ( N ) of 2,8,16,32,64 and 128 nodes, we have d=4,28,60,124,252 and 508 VIS created (declared, though not necessarily used).
Each process on a node "owns" ( N -1) VI channels, which is henceforth referred to as the subrange for that process. This model closely resembles the recommended usage in the VIA specification [4] .
An active VI is one that is actual1 being used to transfer messages. The number of active ds across all processes on a node ( U ) at a given time is obviously a subset of declared VIS (d). This distinction is made to highlight the fact that some NIC desi ns have to do a certain amount of work for each declared $1 regardless of its usage. Lo ically, an application process iteratively erforms a send fojlowed by a receive on each active VI, tKough the separation between the send and the receive may change depending on the workload : Type I : One VI, random1 chosen from the subrange, is active. A send on this $1 is immediately followed by a receive before moving on to the next iteration.
Type I1 : All VIS in the subran e are active. A send is performed on all these active V k before doing receives on all of them. Hence, the message transfer on any one VI can enjo the benefit of overlap with transfers on other VIS odts subrange.
Type I11 : A fixed (small) number of VIS in a subrange are active. These are used in the same pattern as Type I1 workloads. Type IV : A Type 111 workload in which more than one m e s a e operation send or receive) is performed on each Vfat a time. T(he intention is to have several outgoing/incoming messages on a VI, to capture situations where an application may be exchanging relatively long messages that are packetized into multiple shorter messages seen by the underlying VI layer, or is sending messages in close succession on a channel. Each workload uses either immediate messages (which fit within the payload of a descriptor) or long messages (where the data has to be retrieved separately from the descriptor). System load is defined as the number of VI channels which are used actively out of the ones that have been declared open by a process. The load on any sin le active channel remains constant as we increase the numger of active/declared VIS in an experiment.
The metric used for evaluation is the round trip latency incurred by messages on a node averaged across all active VIS. Various design options have been studied and representative results are presented here. The scalability of a particular option is seen primarily in the rate of increase of latency with VIS declared or active). However, the abdiscussion of results. We have also collected detailed statistics on utilization of buses, rocessors, and DMA engines, together with delays througi different stages of the message transfer process. We do not present all these results, and briefly allude to them only when necessary. solute latencies are a I so important, as will be clear from the
Messaging Sequence
It is im ortant to understand the messaging se uence for the rest opthis discussion. Fi ure 2 shows the di7ferent actions/events in a message sen%, at the user-level library, rocessor on the NIC, and the DMA en ines (the send inter!ace to the network and the interface to tost memory). This sequence, which is similar to the implementation in [3] , assumes that a long message is sent, descriptors are maintained in host memory, and completion queues are enabled. 
Figure 2. Sequence of Actions in Sending a Message
This NIC is assumed to not have any other hardware support for message rocessing, and is referred to as the Baseline NIC that hasieen shown earlier in Figure 1 After a Send Descriptor is posted by the host library and the Send Doorbell has been rung, it takes a while for the In the following discussion, we take such a Base NIC design (no hardware sup ort, and se uential execution of the messagin sta es) a n x r g r e s s i v 2 y refine it with both firmware a n t harsware en ancements to understand their impact on scalability. To motivate the need for looking at scalabilit , the Base NIC is used. Here, the NIC processor polls the doorbell queues for each declared VI in round robin order. Fi ure 3 shows the effect of declared (unused) VIS on a T pet11 workload with 8 active VIS (u=8) at a node, and 38 byte immediate messages. The latency increases almost linearly. This follows directly from the time spent by each message waitin for service as seen by the second curve in the figure (labeid wait) which almost parallels the latency curve. The serial processing of messages means that after a doorbell has been rung, a messa e may have to wait in the send ueue while the NIC fini&es processing another message &tag. 1 in Figure 2 ). In addition, it has to wait for a period that is roughly half the time taken by the NIC to poll the doorbell queues of all other VIS. This is seen as the en th of Stage 2 in Figure 2 . Figure 4 shows the same effect for a Type I11 workload where a maximum of 60 VIS are active (U is set to minimum(d,60 ). For d=4,28,60, this translates to a Type it represents a constant application-offered load. A sharp increase is seen in the Type I1 art of the curve (the application load is increasing), whik the Q p e I11 part shows a reduced slope (the application load is constant). The Ty e I1 part shows the waiting time increase caused mainly by tie processing of messages in other VI queues, while the slope in the Type I11 art captures the effect of the delay caused by unsuccessfufNIC 011s alone. Scalability im rovements lie in the reduction oAatencies in both parts of &e curve.
Base Design: Need for a scalable solution
I1 workload (a 1' 1 declared VIS are active), and for the rest 5 for 8, 32, 64 and 128 byte messages. In each case, the data fits within the descriptor, and so the number of DMA operatrons (at the receive end) incurred is the same. The results indicate that a 32 or 64 b te message can incur a lower latenc than a message of 8 &tes for 508 declared channels. h i s non-intuitive result can be explained by examining what happens on the NIC. Since it polls each queue in round robin order, if the card finishes processing one message fast enou h, it can miss the posting of the send on the next VI (that [as to wait a full cycle of doorbell checks). With a little longer message, this does not happen as the card takes longer to finish rocessing the send (it also takes longer for the host to finis{ posting its send but that is not enough to compensate for the other factor).
While this result is a special case, it does serve to illustrate that descri tor payload sizes affect bojh raw latency and scalability. hence. workload specific setting of the payload size may be considered. To demonstrate this issue, a Type I1 workload with immediate m e s a es of various sizes is used. To further isolate the effect of rfiC event processing overheads, the shadow queue described in Section 3.1 is used so that the NIC does not need to DMA the descri tor down after it detects a message waiting to be sent on a b Results are shown in Figure   The wait for completion of DMA operations wastes the CPU resource on the Base NIC (Sta es 4, 6, and 8 of the pipeline in Figure 2 ). Overlapping %MA o erations with the processing of ano!her VI can !mprove tie throughput and scalabilit of an implementation. To study this, we model a NIC!firmware which maintains a software task ueue for each of the DMA engines. An event (such as ietection of a doorbell) causes an ent to be inserted into the task queue of the appropriate DMX engine (if it is not free). The DMA operation is initiated whenever the entry reaches the head of the queue. The firmware is thus reduced to an event rocessing loop which schedules work to and from each of ties, queues, and this model is referred to as Base+O. It should be noted that the queue is maintained in software. Figure 6 compares the latencies of Base+O (abbreviated to B+O in the raph) with Base. The workload is of Ty e I11 where 12, f8 and 60 VIS are activehsed, labeled.ul%, u28 and u60. The benefits of ipelining are evident in all the workloads, with the benefit gecomin more pronounced at higher loads (larger number of active 61s). A lar er messa e size of 4 KB is used in these expenments, whica allows a fuger sco e for overlap. The pipelined desi n used here allows the BMA engine transfers to overlap k r either the same messa e (if scatter-ather is used), or different messages from tie same or different VIS. Only the latter effect is seen here.
PipelinedOverlapped firmware design
To observe the effect of the length of the DMA o eration, the same workload is exercised with smaller 6 2 8 byte) messages in Figure 7 . Here, the results are reversed, with the latencies increasing (more for hi her loads) for a pipelined firmware. The length of the DflA operation is not sufficient to overlap with processing, and the overhead of queuektate maintenance actually degrades performance. The benefits of a software pi eline would depend on the len th of Stages 6 and 8 in Jgure 2. The len ths of the D d A transfers in Stage 4, 10 and 12 are smalfand independent of the m e s a e size for non-immediate messages. As such they offer littfe scope for overlap.
Hardware DMA Queues
There are a couple of disadvantages with the pi elined firmware operation in the previous section. First is t\e cost associated with software queue maintenance. Second is the lag/gap between the com letion of a DMA operation and the detection of that contition by the firmware to initiate the next operation (as o osed to the situation in Fi ure 2 where Sta es 5 , 7, and ffcan immediately succeed &ages 4, 6, and f 0 respectively). An interrupt could be used on DMA completion, but it adds to the overheads.
Hardware support for queuing of DMA entries, as in [9], could ameliorate these roblems. Upon completion of one DMA operation, the D h A engine could directly proceed to the next queued operation. We refer to this model as Base+O+DMAQ.
We compare the benefits of a firmware designed with this hardware support to Base+O. Figure 8 shows that this hardware support (abbreviated as B+O+D) does not help for lar er messages 4 KB). Fi ure 9 reveals that short mesmore pronounced for higher loads. However, these latencies are still worse than a non-overlapped design (compare to Figure 7 ). sages t128 bytes or r ower) do%enefit, with the effect being The rovision of hardware support for doorbells enables the N1d)to avoid polling all declared VI channels. In terms of Fi ure 2, Stage 2 can be.eliminated completely. Figure 1b demonstrates the significant improvement in latencies with this feature (Door+O+DMAQ) (abbreviated as D+O+D) for a Type I11 workload using a varying number of active VIS (12,28 and 60) and 128 byte messages, compared to Base+O+DMAQ (the only difference between the designs bein compared is the hardware doorbell support). We find that tie design with hardware doorbell su port kee s the latency constant for an offered ap licationyoad, wh#e the latency rows with the.number oPdeclared VIS (even when the ofkred load remans constant) for the cases without hardware su ort. This direct1 results from the wait time experiencejiy a message be&re it is noticed by the NIC.
The benefits of hardware doorbell support are not really felt by long messages (the graphs are not explicitly shown here). From Figure 8 , it is seen that the curves automatically flatten out for a certain offered load with long messages, because of the sufficient overlap. Hence hardware doorbell is not really needed in such cases, and the overlap can achieve the same effect.
Separate Send and Receive Processing
We can stretch the hardware support even further by providing two relatively independent pipelines for processing sends and receives res ectively, so that these logically separate operations interzre with each other minimally. This can be achieved by providing two processors and two hardware doorbell queues, one each devoted to the send and receive pi elines. A hardware doorbell queue is not strictly needed ! or receive doorbells since the need be answered only when a message comes in for that $1, and the message destination can be used to directly get to the corres onding doorbell. However, memory management of NIObuffers is easier with a unified wa of answering send and receive doorbells. The separate &nd and Receive DMA engines to/from the network ( resent even in the Base NIC) ensures that the on1 oint o?contention between the two pipelines is the Host $&A. While this could be alleviated to some extent by providing multiple DMA channels within the same DMA engine, we have not modeled it. Figure 11 shows the lar e improvements in latencies seen with this enhancement. h e improvement is more pronounced with higher operating loads. 
Shadow Queues
Further improvements for a hardware supported pipelined firmware are possible by balancing the pipeline segments to improve throughput and, consequently, latency seen across VIS. Usin the shadow queues (with MMIO) that has been roposetin Section 3.1 eliminates the small sized DMAs gr descriptors from the critical path, giving the opportunity for further overlaps in processing. The reduction in latency is shown in Figure 12 . The workload used here is of Type I11 with 128 byte messages. Through this section, we have progressed from a Base NIC which provides minimal hardware support (a single host DMA engine, two network DMAs, some memory, and a programmable processor) to a sophisticated NIC (Door+O+DMAQ+2CPU) su ortin DMAs with hardware queues, hardware doorbe?L, and separate processors to handle sends and receives respectively. Correspondingly, the firmware has also moved from a sequentially mode of operation, to one which does a sophisticated overlap of operations to ether with shadow queues to even eliminate some parts oBthe messaging pipeline. We have incrementally made this pro ression, showing the benefit of each enhancement along tie way. We can put the overall benefits of these enhancements in perspective b comparing the sophisticated design with the baseline. bgures 13, and 14 show the improvements with the enhancements consistently for the different message sizes (128 bytes and 4KB) respectively. We find the improvements materializing not 'ust in the increasing load region of the curves (the initiaf steep part), but also continuing in the constant load re ion. The im rovements are seen for both low (12 active VIS? and high (68active VIS) offered loads.
Putting it all together
Concluding Remarks and Future Work
Scalability is an im ortant consideration, especially as multiprogramming leveys continue to increase and each executing application kee s several open communication channels to avoid setup anitear-down costs. Previous investigations into ULNs, includin those specific to VIA [l, 3, 121, have not looked at.scalabifity indepth.
This paper has identified some of the hardware and software design issues affectin the scalability of a VIA im lementation, at the host and t i e network interface cafd ( N k ) .
It has then evaluated many of these issues using a detailed simulator. The evaluation exercise has started with a naive (baseline) im lementation of VIA on a fairly prjmitive NIC It is shown tRatsuch an implementationhas seri.ous scalability problems, with the latencies increasing rapidly not on1 with the offered load, but also with the number of channefs declared. This motivates the study of scalable VIA desi ns.
f i e baseline design has then been progressively refined, one enhancement at a time, to study its impact on scalability. Each enhancement exercise has given interesting insights that can be useful to hardware designers for incorporating scalability-aware features, and to software developers for adapting the underlying hardware based on application characteristics :
On a NIC without extensive hardware support (doorbells or hardware DMA queues), it is im ortant to overlap the NIC CPU processing with D d A operations, when the messages are Ion For small messa es ueue maintenance overheafs dominate and the NfC d L J is better off doing a busy wait.
Contrar to expectations, support for hardware ueues in D M J engines does not appear to significanly improve scalability. For lon messages, the overla desi n with software base% ueues does quite we by itself For short messages, &ile the hardware queues do im rove performance.com ared to the software approacg, the scalability is stii worse compared to a non-overla eddesign. It appears that accommodating concurrentvMA transfers, as was discussed in Section 3.3, may be a better alternative.
Clearly, hardware doorbell support improves scalability across the entire spectrum of message sizes. The resultin system is more scalable not just in the constant offered load re ion, but also with increasing load (number of active Vfs).
Providing separate send and receive pipelines with two CPUs on the NIC, significantly improves the latency and scalability of the system. Cutting down on a DMA operation for the descri tors definite1 improves message latency, as well as t l i scalability orthe system. To achieve this goal, this paper proposes using a shadow queue on the NIC while the actual descriptors are maintained in host memory.
Yd
There are several issues, some of which are identified in Section 3, that have not been investi ated in depth. The most important of these is the order of event processing in the NIC firmware. We would also like to ex lore the impact of NIC memory, translation lookup and digerent hardware arameters on the scalability of VIA. Having looked at VIA from latency, throughput and scalability angles, it is a natural pro ression to examine the different issues from a Quality of Eervice ( Q~s ) perspective. QOS issues also re uire a study of the interactions between the NICNLN ana the CPU scheduler [lo] . These investigations are part of our ongoing and future work.
