Ah#rac#-Several companies have introduced powerful network processors (NPs) that can be placed in routers to execute various tasks in the network. These tasks can range from IP level table lookup algorithm to application level multimedia transcoding applications. An NP consists of a number of onchip processors to carry out packet level parallel processing operations. Ensuring good load balancing among the processors increases throughput. However, such multiprocessing also gives rise to increased out-of-order departure of processed packets. In this paper, we first propose a Dynamic Batch Co-Scheduling (DBCS) scheme to schedule packets in a heterogeneous network processor assuming that the workload is perfectly divisible. The processed loads from the processors are ordered perfectly. We analyze the throughput and derive expressions for the batch size, scheduling time and maximum number of schedulable prooessors.
work processors (NPs) that can be placed in routers to execute various tasks in the network. These tasks can range from IP level table lookup algorithm to application level multimedia transcoding applications. An NP consists of a number of onchip processors to carry out packet level parallel processing operations. Ensuring good load balancing among the processors increases throughput. However, such multiprocessing also gives rise to increased out-of-order departure of processed packets. In this paper, we first propose a Dynamic Batch Co-Scheduling (DBCS) scheme to schedule packets in a heterogeneous network processor assuming that the workload is perfectly divisible. The processed loads from the processors are ordered perfectly. We analyze the throughput and derive expressions for the batch size, scheduling time and maximum number of schedulable prooessors.
To effectively scheduIe variable length packets in an NP, we propose a Packetized Dynamic Batch-CoScheduling (P-DBCS) scheme hy applying a combination of deficit round robin (DRR) and surplus round robin (SRR) schemes. We extend the algorithm to handle multiple flows based on a fair scheduling of flows depending on their reservations. Extensive sensitivity results are provided through analysis and simulation to show that the proposed algorithms satisfy both the load balancing and in-order requirements in packet processing.
I . INTRODUCTIOX
With the advent of powerful network processors (NPs) in the market, many computation-intensive tasks such as routing table look-up, classification, IPSec, and multimedia transcoding can now be accomplished more easily in a router.
Such art NP-based router permits sophisticated computations within the network by allowing their users to inject customized programs into the nodes of the network [ll. An NP provides the speed of an ASIC and at the same time is programmable. Each NP consists of a number of on-chip processors that can provide high throughput for network packet processing and application level tasks [2], [3] , [4] . However, processing of packets belonging to the same flow by different processors gives rise to out-of-order departure of the packets from the NP and incurs high delay jitter for the outgoing traffic. For TCP, it has been proved that out-of-order transmission of packets is inimicaI to the end-to-end performance. For many applications like multimedia transcoding [5], it is imperative to minimize this out-of-order effect because the receiver may not be able to reorder them easily to tolerate high delay jitter. Today's receivers vary widely from palm devices, PDAs to desktops that may or may not have enough storage and reordering capabilities. Examples of multimedia transcoding in an active router are found in the MeGa project [6] of the University of California, Berkeley, and the Journey network model [7] at the NEC-USA, where routers provide cusiomizable services according to packet requests. Efficient packet scheduling is necessary in order to guarantee both high throughput and minimal out-of-order departures of packets. However, these two goals are contradictory to each other because scheduling on more number of processors increases throughput but also increases out-of-order departure of packets.
Packet processing in an NP can be considered as similar to link aggregation techniques that employ multiple physical links from a source to the same destination. Link aggregation provides increased bandwidth and reliability between the two devices (switch-to-switch or switch-to-statim) as more channels are added, Implementations include the Cisco Etherchannel in the CISCO ONS 1500 Series based on the proprietary Inter-Switch t m k i n g (ISL), Adaptec's Duralink port aggregation, 3COM. Bay Networks, Extreme Networks, Hewlett Packard and Sun, etc. A practical Iink stripping protocol, called Surplus Round Robin (SRR), is proposed by Adiseshu [$I to schedule variable length packets over multiple links with different capacities. They demonstrate that stripping is equivalent to the classic load-balancing problem over multiple channels. They solve the variable packet size problem by transforming a class of fair queuing algorithms into load sharing algorithms at the sender. Although their solution is elegant and efficient, it requires the receiver to run a corresponding resequencing algorithm to ensure in-order delivery of packets.
The aim of this paper is to derive an efficient packetscheduling algorithm in a network processor that comprises of a number of processors (or channels) for packet processing. In this paper we derive a Dynamic Batch Co-Scheduling (DBCS) algorithm that considers a backlogged queue, and can be applied to dynamic arrival of packets in an overload situation. Expressions for load distribution in heterogeneous network processors are derived first by assuming that the schedulable workload is perfectly divisible in terms of bytes. The divisible load theory (DLT) for parallel processing has been suggested in [141, El51. However, they did not consider sequential ordering of the processor execution times, as required for packet transmission over the output link. Our algorithm schedules the packets in batches by computing the optimal batch size, scheduling time, and number of scheduIable processors given the maximum packet size and network processor parameters. A batch is similar to the concept of time epochs when scheduling is done. Several interesting results are derived regarding scalability of our algorithm.
Because the arriving packets cannot be distributed to processors in bytes, we also derive a packetized version of the DBCS algorithm by applying a combined version of DRR and SRR dgorithms [163, [SI. The P-DBCS algorithm produces better results in terms of throughput and out-of-order rate compared to round robin and pure SRR schemes. We then extend the P-DBCS algorithm to handle multiple flows having reservations. When applying P-DBCS algorithm to multiple flows, both load balancing and fair scheduling requirements shouId be satisfied.
Hence, we revise the expression of he batch size and packet dispatching condition to reffect the new requirements. Finally, we perform a number of simulations and sensitivity studies to verify the accuracies of our theory and obtain performance over wide-ranging input parameters.
The rest of the paper is organized as follows. In section 11, we present the preliminaries and certain design issues in a Network Processor. In section 111. we design the Dynamic Batch CoScheduling algorithm (DBCS) by doing theoretical derivation and analysis. In section IV, we propose and design a packetized version of the DBCS algorithm called Packetized-DBCS (P-DBCS) to deal with variable length packets. In section V, we present how to achieve fairness among multiple network flows using P-DBCS. Simulation results are presented in section VI in comparison with several other schemes. Finally, in section VIT, we conclude the paper with future possible extensions related to this paper. Figure 1 illustrates the multiprocessor architecture model of a router using a network processor (NP). The NP consists of one dispatching processor Pd, a few worker processors, PI through F'N, and a transmitting processor P,. Intel IXP NP divides its set of microengines this way for packet processing [2] . The dispatching and transmitting processors communicate with the U 0 ports sequentially. The dispatching processw Pd schedules incoming packets among the worker proceswri for packet processing. The transmitting processor P, receives packets from processors PI through PN and sends them to the output port. The aim of the packet scheduling algorithm is two fold: 1) the input load is balanced among the processors PI through PN; 2) the flow order is maintained when the packets are transmitted to the transmitting processor P,.
A similar problem called channel stripping has been addressed in the Iiterature by Adiseshu 181. There are N channels between the sender and the receiver. The sender implements the stripping algorithm SRR to strip incoming traffic across the N channels, and the receiver implements a resequencing algorithm to combine the traffic into a single stream. The stripping algorithm aims to provide load sharing among multiple channels. It does .not consider the transmission order among the packets in different channels. Hence, the receiver needs to run a resequencing algorithm to restore the packet order in the original flow. A strict synchronization between the sender and the receiver is difficult to implement.
Although the packet scheduling problem looks different from the channel stripping problem, there are many similarities. First, in channel stripping, the packets are transmitted in the channels, In NPs, the packets are processed on the worker processors. These two times are equivalent and proportional to the load size in bytes. Second, in channel stripping, the time to move packets from the single input port to different channels is assumed to be negligible in [SI, Actually, there should be a time overhead in executing SRR at the IP level processing. As for packet scheduling in an NP, the time to move packets from the dispatching processor to a worker processor cannot be ignored because of the lime taken by the dispatching processor and the transmission between the two processors.
Finally, the transmitting processor in an NP removes the packets from the worker processors on an FCFS basis, whereas the receiving processor in the stripping model executes a resequencing algorithm. Hence the models developed in this paper are applicable both to packet processing in an NP or packet transmission over multiple channels.
For the rest of the paper, we will explain our modeldalgorithms just in terms of NPs without loss of generality. To describe the whole procedure experienced by one packet when it is processed in an NP. there are three steps: 1) D-step: the dispatching processor dispatches the packet to a worker processor; 2) P-step: the worker processor processes the packet; 3) T-step: the worker processor sends the packet to the transmitting processor. Correspondingly, in the channel stripping problem, only one step, namely P-step, is modeled, if we take transmission of a packet as a type of processing.
Hence, the packet scheduling problem in an NP is more complicated. We propose a sequential completion pattern of the packet processing and thus a sequential data delivery by the worker processors, as iIlustrated in Figure 2 . Let there be i V worker processors in the router. each worker processor Pi V i , first receives some packets from the dispatching processor Pd (D-step], then processes these packets (P-step), and finally sends the packets to the transmitting processor P, (T-step) sequentially. We propose to let the dispatching processor P d distribute the packets among the N worker processors from PI through PN in such a way that each worker processor completes processing and transmission sequentially. In another word, the worker processor Pi for (i=2 -N) starts delivering the packets to the P, immediately after PiLl completes its Tstep. Therefore, sequential packet delivery is ensured! as well as the load is balanced among multiple processors. We call a scheduling algorithm that produces such a scheduling pattern Dynamic Batch CoScheduling (DBCS). A nice property of the above model is that sequential data delivery is achieved without any additional control. Simply by arranging the computation and communication phases, a scheduling algorithm is obtained.
However, to design a practical algorithm, there are many issues to be addressed. First, how many packets should be dispatched to each worker processor to produce the desired pattern? Second, what if the packets are of variable lengths? Thirdly, given a network processor configuration and any packet arrival rate, can we always find a way to schedule packets in such a sequential delivery pattern? Fourthly, how to ensure fair scheduling among multiple flows having different reservations? Rest of the paper attempts to analyze the situations and provide answers to the above questions. the complete DBCS algorithm is designed to schedule packets over multiple scheduling rounds with a care to always keep the sequential delivery of multiple batches of data. The scalability of the DBCS algorithm is analyzed and important conclusion reaches. To schedule variable length packets, a packetized version of the DBCS algorithm (P-DBCS) is devised based on the combination of DRR and SRR. In this way, we always minimize the absolute difference between the actual load distribution and the ideal load distribution. Finally. the P-DBCS algorithm is extended to handle multiple flows having reservati onsf 111. DYNAMIC BATCH COSCHEDULING In this section, we build a theoretical model for the general packet scheduling problem in an NP, and develop a load scheduling algorithm Dynamic Batch CoScheduling (DBCS) to produce a scheduling pattern described in Figure 2 . There are two design goals of DBCS. The first is to ensure load balancing for a group of heterogeneous processors. The second i s to ensure strict in-order delivery of data, i n the following derivation, we assume that. the input load is divisible at the granularity of one byte. 
A. Load Distribution in A Single Batch
For an NP, we set up the following mathematical model for the ease of theoretical analysis. As shown in Meanwhile, each worker processor has a direct link l S , i to the transmitting processor F'%. The dispatching processor receives packets from the input link, divides the input load into N parts and then distributes these load fractions to the corresponding worker processor. Each worker processor P i starts processing immediately upon receiving its load fraction ai and continues to do so until this fraction is finished. Finally, Fi sends the As demonstrated in Figure 4 , the load is partitioned among processors such that all the worker processors stop processing sequentially and therefore, deliver the processed fractions to the transmitting processor sequentially. That is, worker processor Pi+l starts delivering the packets to the output port only after and right after P i completes its delivery. Thus, all the out-of-order packet delivery. To achieve such a sequential delivery pattern, we obtain the following recursive equations for i = 1, ..., AT -1 from the timing diagram: To find a feasible batch size, we note that at least one packet should be dispatched to each processor in a scheduling round. Therefore, we define an important system parameter minimal schediiluble batch size I as the total bytes that should be scheduled in one scheduling round, Suppose the maximal possible length of a packet (in terms of bytes) that may arrive at the N P is L, the minimal schedulable batch size I is defined tn hn LW uli the packets are processed in parallel and are sent to the output port in order without any break. In this way, we eliminate
where C is the minimaf positive integer such that
Equation 5 guarantees that at least one packet fits into the load fraction that can ' ne dispatched to a worker processor. Hence, C L constitutes a minimal load size that should be guaranteed for one scheduhng round. Given I as the minimal batch size, the batch size B is set to be a multiple of I as follows:
where m is a positive integer referred as batch granularity. in the following section, we will see how the barch praniclarirv nt affects the system throughput given the minimat schedulahle batch size I . Note that the basic requirement of a sequential delivery pattern that consists of multiple rounds is: the delivery of two adjacent batches cannot be overlapped, i.e., the delivery of a batch cannot start until the delivery of its previous batch completes. So when we schedule load in multiple rounds, gaps may be introduced between the initiation of two adjacent batches, because 1) the dispatching processor P d cannot initiate a new batch until the first worker processor completes its T-step. 2 ) if the initiation of a new batch is too early. it may result in overlapped data delivery between two consecutive batches. Actually, during the time duration of this gap, the dispatching processor Pd has to wait even if it is idle. We denotes this idle time duration of Pd as Gupd, as shown in Figure 5 . Similarly, gaps may exist between the delivery of two adjacent batches, because the first worker processor cannot start transmitting packets until its T-step completes. During the time duration of this gap, the transmitting processor P, has to be idle. We denote this gap as Gapt, as shown in Figure 5 . Hence: in a heterogeneous system, as long as the dispatching processor maintains a gap GapF'' between two adjacent batches, a dynamic scheduling algorithm that ensures both sequential delivery and load balancing is obtained.
In a homogeneous system, we denote ze,i = zv,i = 2 and wi = w for i = 1,2, ..., N . According to the equations 2 and 
D. Scalabilily Analysis
According to the analysis presented in the above subsection, the best scheduling time in a homogeneous system is determined by the number of processors N , the maximal packet length L and the batch granularity m. Let there be N homogeneous worker processors, and Nsaturote be calculated according to the network processor parameters ( w ,~) as Nsaturate = W/J + 2. 1) If ht < ATsaturate, a gap should be introduced between the initiation of two adjacent batches. The length of the gap is defined as Gupd in the equation 13. Figure 6(a) demonstrates a scheduling example. In this example, we assume u t = Gps/byte and z = s = lps/byte, hence, Nsaturote = 8. For the ease of discussion, this set of Hence, we reach very interesting conclusions about the system scalability. 1) If there are less than NSatUTat4 worker processors, as shown in Figure 6( processors, all processors are fully utilized and there is no break between the delivery of adjacent batches, as shown in Figure 6 (bj. In this case, the communication rate and the processing rale perfectly match to give the best performance.
3) If there are more than Nsaturate worker processors, as shown in Figure 6 (cj, there are idle periods on each worker processor instead on the dispatching processor. In this case, the communication rate cannot catch up with the processing rate. Hence, some worker processors complete processing (c) Scalability of the Throughput before the dispatching processing completes dispatching the whole round of data! Obviously, the computation resources are underutilized because the single dispatching processor becomes the bottleneck.
Therefore, Nsatvrate represents the optimal number of worker processors that one dispatching processor can support. Note that Nsaturate is determined only by the computationkommunication rate, w/z, independent of the batch size. Hence, for a given network processor with the configuration parameter defined as (w, 2) and assume h a t processing always consumes much more time than the communication, i.e. w > z, there always exists an optimal number of worker processors that one dispatching processor can support. If there are more than NsatuTate worker processors, one more dispatching processor should be adopted to schedule load among those surplus worker processors to achieve good system performance. Figure 7b ). In addition, as the batch granularity m increases, the system throughput dyreases. This degradation cafl be illustrated by increasing gap length as the batch size increases. Therefore, the niinirnal scherlidable batch size is the oprimal barch size in terms of maximizing the svslein throughput. In our scheduling algorithm, we define the minimal schedulable batch size I as a function of the maximal packet length L. In this way, the minimal schedulable batch size i s always adapted to the packet size. thus achieves the optimal system throughput.
IV. PACKETIZED DYNAMIC BATCH COSCHEDULING
In the previous section, a dynamic scheduling algorithm DBCS has been designed for an NP to ensure both load balancing and perfect in-order delivery. However. we have assumed so far that the load is divisible at the granularity of one byte. In practice, we have to schedule workload ac the granularity of one packet, and the packets may be of variable lengths. To schedule variable length packets in an NP, we design a Packetized version for the DBCS algorithm (P-DBCS) in this section.
As discussed in section 111, according to the DBCS algorithm, given a batch size B, the ideal load distribution among multiple processors in any scheduling round is calculated as 
.) B N ) .
To guarantee the desired sequential delivery, the actual load distribution (&, B2, 'acket Scheduling: The rationale behind the algorithm is that: in equation 17, the left-hand represents the absolute value of IBalance,hi,hl when the packet is scheduled; and the right-hand represents the absolute value of IBaEance,hichI when the packet is not scheduled. Therefore, we always make a decision to minimize the absolute value of Balancewhich. Because
% l a F E w h i c h
IBalance,hi,h I = 18z-Bil, the absolute difference between the actually scheduled load and the idea1 load, measured as cf=v=,(81 -is also minimized. Note that, equation 17
can be simplified as Balance, 2 PacketSize/2.
Compared with two other well-known packet management policies, Surplus Round Robin (SRR) and Deficit Round Robin (DRR), our scheme improves over both by taking their combination. In each scheduling round. the absolute difference between the actual scheduled load and the ideal load, denoted as l€?.i-Bil, is bounded by the maximal packet length L in either DRR or SRR scheme; while it is bounded by half of the maximal packel length L / 2 in our scheme.
for a single scheduling round. Over multiple scheduling rounds, the deviation of the actual load from the idea load may be accumulated to be a bigger and bigger deviation from the ideal load. To avoid this, the ideal load Bwhzch for the next scheduling round is always adjusted as BWhich = Bwhich I-Balancewhich at the end of each scheduling round. The proposed P-DBCS algorithm is presented in Figure 8 .
As discussed above, we have minimized Er:,(& -
V. FAIR SCHEDULING AMONG MULTIPLE FLOWS
In section IV, a P-DBCS algorithm is developed to schedule incoming packets among multiple processors. It not only achieves load balancing in the presence of variable length packets, but also ensures the minimal out-of-order departure of processed packets. In the above discussion. we have assumed so far that the incoming packets are d l treated equally. However, in the practice, the incoming packets may belong to different network flows that have made different reservations.
In this section, we extend the P-DBCS algorithm to provide fair scheduling among multiple flows. We assume that all flows are continuously backlogged. Let there be M flows, each flow has made a reservation ri and the inequation C,"=,T~ 5 1 holds. We aim to service the packets of different flows at a rate that is proportional to their reservations. As shown in Figure 9 , the incoming packets belong to two different flows, with their reservations defined as ( T~: T Z ) = {0.75,0.15). To guarantee that flow 1 is serviced three times faster than flow 2, we need to schedule the packets as follows: in each scheduling round, the total number of bytes that are processed for flow 1 is three times that of the flow 2 . Note that the P-DBCS algorithm has a nice property: the scheduling order of the packets is maintained a1 the output link. Therefore, as long as the packets of different flows are scheduled for processing proportionally to their reservations, the processed packets depart orderly and proportionally to their reservations, as illustrated by Figure 9 .
To extend the P-DBCS algorithm to handle multiple flows, the basic idea is as follows: Given the batch size B and the flow reservations ( T I , r2, ... , T M ) , the number of bytes that are scheduled in each round for flow i (i = 1 , 2 , ..., AT), denoted as Fi, is
To perform a practical scheduling, we must guarantee that at least one packet is dispatched from any one flow. Hence, one more constraint must be set for the minimal schedulable batch size I as follows:
Therefore, 1 should be redefined as I = C L such that C is the minimal inreger that satisfies Figure 10 , the mapping between the sorted Bis and the Fjs determines an allocation of flows to the processors, 7Ris allocation spec@es the visiting order of bath processors and flows.
When the packet scheduler works, it keeps two pointers: one pointer, i, points to the processor that is currently receiving packets; the other pointer, j , points to the flow that is currently being serviced. The scheduler also keeps two balance counters: one counter, Balance,, records the remaining number of bytes that should be dispatched to the processor i; the other counter, FBalancej, records the remaining number of bytes that should be serviced for the flow j. To achieve both load balance among processors and fair scheduling among multiple flows, when the packet scheduler looks at the head-of-line packet of flow j , it compares PacketSzze to both Balance, and FBalance,. The algorithm 'to handle multiple flows using P-DBCS is illustrated in Figure 11 . The algorithm achieves both load balanctng among processors and fair scheduling among flows.
Fair Scheduling of Multiple Flows Using P-DBCS Parameter Initialization:
(1) Using equations (2) and (3). determine ai. V i .
(2) Using equation (20) . determine the minimal batch size I. several other strategies proposed in the literature. Lastly, we demonstrate the effectiveness of P-DBCS to handle multiple incoming flows with different reservations.
We developed a C simulator to model a heterogeneous NP system with one dispatching processor, multiple worker processors and one transmitting processor. This simulator is designed to process variable length packets in an NP. The packet size of the incoming flow is randomly determined by the exponential distribution. For simplicity, we assume a homogeneous backlogged system. Given that the communication bandwidth of a Intel IXP2400 Network Processor is 100Mbits/sec and considering some other foreseeable software overheads, we configure the system using the following parameters in our simulation. That is. zr,*Tcm = lps/bgte, wlTcp = 6ps/byte, L~,~T&,, = l p s / b g t e , Vi. We use these values in the theoretical derivation of our scheduling strategy and strictly apply them in the simulation runs. The mean value of the packet size is varied from IKB to IOKB and the number of worker processors N is varied from 1 to 20 to observe the performance.
A. Performance of the Packefized DBCS Algorithm
To observe the performance of P-DBCS for processing variable length packets in a realistic system, we conduct the simulation by varying the mean packet size of the incoming packets from 1KB to 10KB. As suggested by the scheduling algorithm in Figure 8 Clearly, the throughput of variable packet mean sizes closely matches the analytical result whereas the out-of-order rate is relatively higher than what is expected by the theory, However, Lhis discrepancy is reasonable because the theoretical load distribution cannot be strictly guaranteed in the presence of variable length packets. As a result, the sequential delivery pattern is disturbed and out-of-order delivery occurs. In Figure  12 (a) and 12(b), the curves of different mean packet sizes are twisted with each other, indicating the adaptation capability of P-DBCS to different packet size. It is also interesting to note that the saturating point of throughput occurs when there are 10 worker processors in the realistic system, while the theoretically proved saturating point is 8. This discrepancy is caused by the presence of variable packet lengths that lead to In this section, we verify the analytical results, given in the earlier sections, through rigorous simulations. First, we study performance of P-DBCS and highlight certain intrinsic advantages of P-DBCS. Next, to explicitly show the performance gain expected by P-DBCS, we compare P-DBCS with deviation of the actual load distribution from the ideal load distribution. After the saturating point, when more processors are used, the out-of-order rate tends to increase. Therefore, we need to select a proper number of processors that guarantees high throughput and incurs tolerable out-of-order rate. for each scheme as modeled in P-DBCS.
As shown in Figure 13 (a) and 13(b), P-DBCS achieves the highest throughput and the smallest out-of-order rate among all the schemes. P-DBCS and SRR-DBCS greatly reduce the out-of-order rate over SRR and RR because DBCS maintains an in-order delivery pattern while balancing the load, whereas SRR and RR do not consider in-order delivery at all. RR offers the worst throughput due to the potential load imbalance incurred by blindly dispatching packets. By adapting the load to the heterogeneity of processors, P-DBCS, SRR-DBCS and SRR achieve comparable throughput because good load bdmcing is ensured. The advantage of P-DBCS over SRR-DBCS can be observed by the relatively lower out-of-order rate. In conclusion, P-DBCS outperforms all other schemes in both load balancing and sequential delivery. While taking extra care to minimize the out-of-order rate, P-DBCS still produces the highest throughput.
C. Fair Scheduling of Multiple Flows Using P-DBCS
Let there be six flows and each flow makes a different reservation, defined as (0.3, 0.3, 0.1, 0.1, 0.1, 0.1). To evaluate the effectiveness of P-DBCS algorithm to provide fair scheduling among multiple flows. we generate different flows at the same arrival rate a d observe if the service rate of each flow is controlled by its reservation. Figure 14(a) shows the the total throughput and each flow's throughput as a function of the number of processors. Clearly, the total throughput is fairly shared among the flows. All the six flows are serviced at a rate proportional to their reservations, although they mive at the same rate. Hence, a fair scheduling among flows is successfully achieved by our algorithm. Note that, while scheduling among muitiple flows, the system still produces comparable total throughput as that presented in Figure 12 demonstrates each flow's out-of-order rate as a function of the number of processors. The out-of-order rate per .flow slightly increases as the the number of processors increases. The flow with higher reservation tends to have higher out-of-order rate because larger portion of its packets are serviced and involved in the out-of-order delivery caused by variable packet lengths.
Overall, the out-of-order rate falls into a small range which is similar to the out-of-order rate presented in Figure 12 (b).
VII. CONCLUSION
In this paper, we proposed an efficient packet schedding algorithm Packetized Dynamic Batch CoScheduling (P-DBCS) for a heterogeneous network processor system. P-DBCS is capable of scheduling variable length packets among a group of heterogeneous processors to ensure both load balancing and minimal out-of-order packet delivery. P-DBCS is based on the Dynamic Batch CoScheduling (DBCS) algorithm, which we developed from rigorous theoretical derivation by assuming the workload is perfectly divisible. Several important theoretical results were presented in the paper for DBCS algorithm. Extensive sensitivity results are provided through analysis and simulation to show that the proposed algorithms satisfy both the load balancing and in-order requirements in packet processing. To provide fair scheduling among muItiple flows, we extend P-DBCS to service packets according to their reservations. Simulation results verified that fair scheduling among multiple flows is successfully achieved. We plan to implement the proposed packet scheduling algorithms in a real network processor, like Intel IXP 2400/2800, to verify the performance.
