lm.
A solution entails finding an assignment of tasks to nodes such that each task is assigned to exactly one node and no node is over-utilized. Furthermore, task allocation needs to be carried out in a way that minimizes cost while satisfying the system's performance specifications. For embedded systems, many factors can influence task allocation decisions. In fact, task allocation decisions can be influenced by any design attribute that affects cost or performance. Accordingly, task allocation needs to be formulated as a multi-dimensional problem, and the tasks and nodes need to be modeled as multi-dimensional objects.
The task allocation problem is, in general, W-complete, and heuristic algorithms are required. In this paper, the task allocation problem for bus-based multicomputers is shown to be isomorphic to a generalization of the vector packing problem. Solution techniques are developed by considering heuristic solutions to the packing problem. These techniques are presented, described and verified experimentally on a mixture of real and synthetic test cases. The rest of this paper is organized as follows. Section 2 introduces a model of the task allocation problem. Section 3 re-states task allocation as a packing problem which is shown to be a generalization of vector packing.
Section 4 reviews related work in the areas of packing and task allocation and describes the relevance of this work. Section 5 introduces heuristic algorithms for solving the packing problem. Section 6 describes the experimentation strategy and presents results obtained by applying the heuristic algorithms to both real and synthetic test cases. Section 7 concludes and summa. rizes the paper.
Modeling Task Allocation
This section presents an embedded system model. It serves as an abstraction for evaluating task allocation alternatives. For more detailed information, refer to [4].
Software Model
A synchronous data flow graph (SDFG) is used to model the software [16] . The nodes represent tasks and the arcs signify the communication between them. The amount of data produced and consumed on each task invocation is known a priori. Thus, the resource requirements of the application are statically predictable, allowing static task allocation.
A task's demand for hardware resources is application specific and can occur across many independent dimensions, such as CPU throughput, memory or U 0 channels. An application specific demand vector is used to model this. This vector is of arbitrary length, and each element corresponds to a resource that is:
1. Available on one or more of the processing nodes.
2.
A constraint on the solution. Once the elements of the demand vector are defined, each task in the application is modeled by its own vector instance. Thus, the demand that a task imposes on aprocessing node is multi-dimmsional, and defined with respect to the application specific demand vector.
Target Architecture
The target architecture consists of an arbitrary number of heterogeneous nodes that communicate via message passing over a bus. The resources available on a node are specified with respect to a capacity vector, which is analogous to the demand vector used to model the tasks.
To model the bus, an analytic function is used that predicts message transmission time as a function of message size. An example of such a function for the CAN bus is shown in Figure 1 . This function is used in conjunction with a scheduling model (e.g. [28] ) to determine real time schedulability of the bus. 
Communication
If communicating tasks are assigned to different nodes, then message traffic will occur over the bus. However, since the bandwidth of the bus is finite, it can only accommodate a limited number of messages. Hence, a solution is feasible only if the cumulative message demand is less than the schedulable bandwidth of the bus.
Furthermore, the arcs within the SDFG define a precedence ordering among tasks. However, it is assumed that data traveling along the arcs is buffered, which decouples the tasks and allows them to execute asynchronously at their own period, without violating precedence relationships. Thus, each processing node is presented with a periodic task set that is consistent with the assumptions required by rate monotonic scheduling [17, 181.
Feasibility Constraints
An assignment of tasks to nodes that satisfies all task requirements without over-utilizing any of the hardware components is said to be feasible. A set of constraints is needed to define the feasibility condition. These constraints are application specific, but they can be sorted into two broad categories: processor and bus constraints.
Processor constraints define when a node can support its task set. Likewise, bus constraints define when the bus can support a message set. These constraints are typically based on scheduling models for the processing nodes (e.g. [17, 181) 
Task Allocation Objectives
of tasks to nodes that isfeasible, and that;
The goal of task allocation is to obtain an assignment 1. Minimizes the number of processing nodes. 2. Minimizes the utilization level of the broadcast bus.
These are competing goals and they must be balanced effectively.
Task Allocation as a Packing Problem
Once a node is specified, only a subset of tasks can be assigned to it without violating feasibility constraints. This leads to the packing-based problem representation which is shown in Figure 2 . Under this representation, each node is modeled by a vector bin. The bins are multi-dimensional, and the dimensions are defined by the capacity vector. For each bin, the capacity in each dimension is chosen to match the specifications of the processing node being modeled.
Similarly, each task is modeled by a vector object. The objects have demands for the hardware resources which are defined with respect to the demand vector. For each object, the demand for each resource is chosen to match the resource requirements of the task being modeled.
Lastly, the bus is modeled by a scalar bin. It is characterized by a scalar capacity, which is equal to the bus's schedulable bandwidth.
With this problem representation, task allocation becomes a matter of packing the vector objects into the vector bins such that none of the bins (including the bus's scalar bin) overflow. Note that demand for the e for the new packing problem, and relationship to vector packwith constrained object groupname will be used to refer to the
Related Work
. Conversely, with vector are independent. The obwith the restriction that the zes cannot exceed the bin's . This corresponds to the hypothetical case of a bus with unschedulable bandwidth. 19, 23, 26] , mathematical programming [6, 20] and heuristic [7,9, 11,251. The allocationtechniques described in this paper are based on heuristic solutions to the VPCOG packing problem. They differ from existing approaches in two main regards. First, tasks and nodes are modeled as multi-dimensional objects, and task allocation is treated as a multi-dimensional problem. Second, these algorithms base task assignment decisions on a set of user-supplied feasibility constraints2. These constraints are arbitrary, and they allow the designer to incorporate precise scheduling models for the hardware resources (e.g. CPU and bus). Thus, the solutions obtained by the allocation algorithms are provably correct within the context of the system's timing requirements.
Heuristic Packing Algorithms
This section describes heuristic solutions to the VPCOG packing problem. A set of candidate algorithms are presented, inspired by the classic first-and best-fit solutions to the bin packing problem. All of the algorithms are one-pass and greedy. They progress by choosing objects, one by one, and assigning them to bins. This continues until all objects are assigned and the packing is complete, or else a set of objects remain that will not fit into any of the bins. In this case, the algorithm fails.
A total of 256 algorithms are considered. They are specified with respect to a five-character acronym which is defined in Figure 4 . The first character of the acronym specifies the bin selection method. Likewise, the second character defines how the utilization level of a vector bin is measured. If two or more bins are found to be equally good according to the bin selection policy, then a tie-breaking strategy is needed. The third character specifies the tie-breaking criterion. The fourth character specifies the order in which the vector objects are packed. The fiith and final character of the acronym specifies the method for determining the size of a vector object.
Results
A set of experiments were performed to gauge the effectiveness of the heuristic packing algorithms. The instance of the VPCOG problem shown in Figure 5 was used during experimentation. The problem requires tasks to be packed into unit processing elements which communicate over a 1 Mbps CAN bus. Intra-processor communication is free, while inter-processor communication consumes bus bandwidth (based on the function shown in Figure 1) . The capacity and demand vectors contain six elements, which correspond to CPU throughput, ROM, RAM, digital U 0 channels, analog U 0 channels, and pulse-width modulated timer channels, respectively. The unit processing element's resource capacities were chosen arbitrarily. Tasks have varying levels of demand for the resources. Demand is computed statically, and is known prior to allocation. To determine feasibility, resource capacities are 2. invoked to determine whether a vector object will "fit" inside of a bin, without causing any of the bins to "overflow".
Within the packing paradigm, the feasibility constraints are compared against the cumulative demand imposed by the task (and message) sets.
Sixteen real and synthetic SDFGs were used as test cases. The real test cases were adapted from a commercial automotive electronic application described in [3]. The synthetic test cases are a mixture of random and hand generated SDFGs which, taken as a whole, span a large portion of the design space.
When a heuristic packing algorithm is applied to a test case, three metrics are used to gauge its effectiveness:
1. Number of vector bins (i.e. unit processors) used. 2. Scalar bin (i.e. bus bandwidth) utilization level.
Run time
Experiments were carried out in four stages, and a divide-and-conquer method was used to compare the 256 possible algorithms. The results are summarized in the Appendix, and the subsections that follow describe the stages.
utilization of the scalar bin. In fact, this prevents a feasible solution from being found for three of the test cases. Conversely, ordering on arc size tends to decrease the scalar bin utilization level, but often requires more vector bins than the baseline. Ordering on both object and arc sizes works best, outperforming all other schemes. This method exploits the benefits of object ordering and arc ordering (i.e. fewer vector bins and lower scalar bin utilization) without suffering from their weaknesses (i.e. not finding a solution). The effect of basing node size on the size of the maximum or average element was marginal and inconclusive. Furthermore, the run time variation across algorithms was not significant, since ordering the objects onlyrequires them to be sorted once (O(nlgn)) before packing begins.
To summarize, ordering based on object and arc sizes is the most effective technique, and equating object size to the size of the maximum vector element is preferred for simplicity. Hence, only (xxxBX) algorithms are considered in subsequent stages.
--Eachaconymspecifiesanalgxithm 
StageOne
The goal of stage one is to determine the effect of node ordering. This is done by comparing first-fit algorithms with different ordering schemes. Specifically, the following six algorithms are considered:
The results show that ordering based on object size tends to decrease the number of vector bins, but results in poor : , _ p a c k : The results reveal several things. First, selecting the least utilized vector bin (i.e. NxRBX) performs worse than the frst-fit baseline algorithm (F-BX). This is intuitive: Selecting the least utilized bin is a poor packing heuristic since it never encourages completely filling a started bin. This leads to excessive resmrce fragmentation and more bins, on average, than first-fit. Likewise, selecting the most utilized vector bin (i.e. XXRBX) requires no fewer vector bins, on average, than the first-fit decreasing order based on object and arc sizes, and object size based on the maximum vector element.
As defmed in Figure 4 , (F-BX) = First -fit bin selection, ine algorithm. This underscores the fact that veccking has no analog to bin packing's best-fit-deng algorithm. The reason for this is also intuitive:
being placed into a vector bin. In fact, the e "best fit" for any particular dimension be a poor choice across the remaining die vector bin that minimizes utilization of (i.e. Nb-RBX) out-performs the other rst-fit baseline algorithm, chnique requires no more t, but it does yield signifis a good choice for task althe object ordering scheme to a natural and dynamic use of its superior solution qualities, (Nb-RBX) is eferred and only (NbxxBX) algorithms are considered the experimentation stages that follow.
The goal of stage three is to determine the effect of ploying a tie breaking strategy. To accomplish this, following two algorithms are compared 
Summary
This paper described a generalization of the vector packing problem, VPCOG, which was shown to be isomorphic to task allocation for bus-based multicomputers. Since task allocation and the corresponding packing problem are NP-complete, heuristic solution techniques are required. A total of 256 heuristic packing algorithms were considered, and their performance was compared using a divide-and-conquer experimentation method on sixteen real and synthetic test cases with respect to three metrics: the number of vector bins (i.e. processing nodes), the utilization level of the scalar bin (i.e. bus bandwidth) and run time. Through experimentation, the (Nb-RBX) algorithm was found to be the most effective heuristic.
The (Nb-RBX) packing algorithm represents an effective and efficient way of performing task allocation for bus-based multicomputers. It is capable of minimizing the number of processing nodes needed for a design while simultaneously minimizing the utilization level of the broadcast bus. Furthermore, it supports a multi-dimensional representation of the task allocation problem, and it allows scheduling models to be incorporated so that timing correctness is achieved as a byproduct of task allocation. "-" indicates that the algorithm failed to find a solution.
?
