Abstract-Task preemption is a critical enabling mechanism in multitask very large scale integration (VLSI) systems. On preemption, data in the register files must be preserved for the task to be resumed. This entails extra memory to preserve the context and additional clock cycles to save and restore the context. In this paper, techniques and algorithms to incorporate micropreemption constraints during multitask VLSI system synthesis are presented. Specifically, algorithms to insert and refine preemption points in scheduled task graphs subject to preemption latency constraints, techniques to minimize the context switch overhead by considering the dedicated registers required to save the state of a task on preemption and the shared registers required to save the remaining values in the tasks, and a controller-based scheme to preclude the preemption-related performance degradation by: 1) partitioning the states of a task into critical sections; 2) executing the critical sections atomically; and 3) preserving atomicity by rolling forward to the end of the critical sections on preemption have been developed. The effectiveness of all approaches, algorithms, and software implementations is demonstrated on real examples. Validation of all the results is complete in the sense that functional simulation is conducted to complete layout implementation.
trol is necessary. For example, in asynchronous transfer mode (ATM), network mechanisms for delay and jitter control are critical. Once again, task preemption in addition to data buffering is a key mechanism to bound jitter in these applications.
Along a different dimension, multitask very large scale integration (VLSI) systems are becoming commonplace. For example, Fujitsu offers the 86 K line of application-specific programmable processor (ASPP) products [4] , while Motorola offers numerous digital signal processing (DSP) ASPPs [5] . Furthermore, LSI Logic has introduced a comprehensive line of ASPP products [9] . An ASPP can be dynamically configured to one of the implemented tasks. Although the reconfiguration time of an ASPP may be very low (because reconfiguration entails transitioning from the final state of the current task to the start state of another task), it may not be acceptable for an urgent task in a real-time system to wait until the current task is completed. Task preemption is required to temporarily stop the current task and execute the urgent task.
On receiving a task preemption request, the state of the active task must be saved, the context of the new task must be loaded and then executed. On completion of task execution, the state of the preempted task must be restored and the interrupted task resumed to completion. Important factors that should be considered while implementing task preemption include the following: 1) Preemption latency, defined as the maximum time it takes from the instant a preemption request is received to the instant the task state is saved. Clearly, any acceptable preemption handling mechanism should yield a latency that is appropriate for the preemption request. 2) Context switch cost of additional hardware incurred by the installation of a preemption handling scheme must be considered. A saved state should contain only enough information (and no more) so that the preempted task can be resumed at the precise point where it was interrupted. The task state should consist of the contents of the generalpurpose registers, the condition registers, and the relevant portion of background memory. 3) Performance degradation associated with the implemented preemption scheme. There are two main sources of performance degradation. On a preemption request, some task states that have already been executed may be aborted. Retracing these aborted states adds to the finish time of the aborted task.
Any scheme that saves the context of a preempted task in background memory may stall execution units. This is because, typically, there are only a small number of ports to save (restore) data to (from) the background memory. In this paper, a systematic methodology for incorporating preemption constraints in multitask VLSI systems will be presented. Specifically, methods on how context switch cost and performance degradation can be minimized while satisfying task-specific throughput and preemption latency constraints will be shown.
A. Micropreemption: A Motivating Example
Consider a system implementing two tasks A and B. Task A takes four clock cycles and task B takes six clock cycles for one iteration. Let A1, A2, A3, and A4 denote parts of task A that are executed in the first, second, third, and fourth clock cycles, respectively. Similarly, let B1, B2, B3, B4, B5, and B6 denote parts of task B that are executed in the first, second, third, fourth, fifth, and sixth clock cycles, respectively. The assumptions in the micropreemption controller design are summarized in Fig. 1 .
A simulation snapshot of micropreemption in the two-task VLSI system is shown in Fig. 2. 1 Towards minimizing the controller and context switch overhead, it is mandated that task A can be preempted in STATEs A1 and A3 alone. These are called task preemption points [24] . Initially, the system is in STATE A4. When the system goes to STATE A1, execution of a new task is requested by setting the SELECTED TASK to B and the data inputs to appropriate data values (these have not been shown here for simplicity). However, it is not a valid preemption request (since PREEMPT MASK is high). Even when a valid preemption request arrives in STATE A2 (i.e., all conditions in item 4 in Fig. 1 are satisfied) task A is not aborted immediately. Rather, the computation rolls forward to end of STATE A3 before the preemption request is serviced. Notice that it has taken two clock cycles from the time a valid preemption request arrived (beginning of STATE A2) to the time the new task (task B) became active (end of STATE A3). From the point of view of task B, this is its preemption latency. From the point of view of the multitask VLSI system as a whole (and task A in particular), rolling forward of the computation has eliminated the performance degradation due to an immediate abort. In a nutshell, preemption points A1 and A3 have partitioned the execution of task A into two critical sections {A2, A3} and {A4, A1}. 1 The controller has been implemented using 1 µm SCMOS standard cell library and simulated using IRSIM. Similarly, task preemption points B2, B4, and B6 for task B partition it into three critical sections {B1, B2}, {B3, B4}, and {B5, B6}.
The controller is shown in Fig. 3 . It is a collection of finite state machines (one for each implemented task) and has a state register file that holds the identification of the currently active tasks. At every clock cycle, a different task can be initiated or a preempted task resumed by the task select signal. The controller signals are pipelined so that the controller delay does not affect the critical path.
B. Research Contributions
In this paper, techniques and algorithms to incorporate micropreemption constraints during multitask VLSI system synthesis are presented. The main contribution of the proposed research includes the following.
1) Algorithms to insert and refine preemption points in scheduled task graphs subject to preemption latency constraints. 2) Techniques to minimize the context switch overhead by considering the dedicated registers required to save the state of a task on preemption and the shared registers required to save the remaining values in the tasks. 3) A controller-based scheme to preclude the preemptionrelated performance degradation by a) partitioning the states of a task into critical sections; b) executing the critical sections atomically; and c) preserving atomicity by rolling forward to the end of the critical sections on preemption. The rest of the paper is organized in the following way. First, the related work along several dimensions are briefly surveyed. Next, computational and hardware models will be discussed. Sections II and III introduce the proposed approach, formulate the micropreemption synthesis problems, and describe the proposed algorithms for micropreemption synthesis. Section IV presents the experimental results. Section V concludes by summarizing the results.
C. Related Research
The most relevant related work can be traced along the four lines of research and development: multitask VLSI systems, multithreading, interrupt/preemption processing schemes, and behavioral/system level synthesis.
Reconfigurable computing platforms are attracting a lot of attention recently. A fast-growing billion-dollar field programmable gate array (FPGA) industry is supported by a number of commercial and research tools [19] . A number of specialpurpose reconfigurable computers have been built. Early work in this direction includes the systems realized at University of Texas, Austin (TRAC) [7] . The Splash system enables reconfigurability to more than 100 different configurations that are well suited for several computational tasks in molecular biology [10] . Several generations of data path reconfigurable video processors with accompanying compilation support has been developed at the University of California, Berkeley [26] . In FPGA-based reconfigurable processors, a new personality vector is loaded to change the functionality. Since such a reconfiguration process entails significant performance overheads, on-board reconfiguration is limited to applications with infrequent context switching. Dynamically programmable gate arrays (DPGAs) [28] are capable of storing multiple personality vectors. A previously loaded functionality need not be overridden by preemptive loading of another functionality as long as the personality memory can accommodate both of them. However, the area overhead due to multiple personality vectors is not negligible. The memory requirements have increased from 10% in conventional FPGAs to 33% in DPGAs. ASPPs [35] , [36] , [39] have been introduced as an excellent candidate for multifunctional applications with frequent context switching. Though their functionalities must be determined in the design phase, a single ASPP implementing multiple functions obtains significant area savings when compared with the dedicated application-specific integrated circuit (ASIC) implementations of the functions. In addition, its capability for on-the-fly reconfiguration distinguishes itself from other techniques.
Multithreading has been proposed and studied to improve low processor utilization due to large communication latencies and synchronization delays in large-scale multiprocessors. A processor that rapidly switches to an alternate thread of computation during a remote memory request or synchronization can achieve high utilization. There are two basic forms of multithreading. Fine-grained multithreading maintains multiple threads in the processor and interleaves the different threads on a cycle-by-cycle basis. This eliminates most pipeline dependencies because the instructions in a single thread are separated, but suffers from poor single-thread performance [14] , [23] . Coarse-grained multithreading, on the other hand, interleaves the instructions of different threads only on some long-latency events such as cache misses and failed synchronization attempts [13] , [33] . This approach, however, does not eliminate the instruction dependencies. Simultaneous multithreading is proposed to dispatch instructions from several independent threads to multiple functional units of a superscalar processor, each cycle, thus eliminating the notion of switching threads [27] , [34] . While simultaneous multithreading has excellent potential to increase processor utilization, it can add substantial complexity to the design due to multiple task frames, each consisting of a process status register, a program counter chain, and a register set [27] . An alternative implementation [34] uses register renaming to remove dependences between instructions from different threads, and a single large register file to support logical registers of all threads plus additional registers for register renaming. This architecture also entails the requirements of a deeper pipeline to avoid clock rate degradation, and in turn, better branch prediction. The first commercial multithreaded processor is a two-threaded coarse-grained processor with three-cycle thread switch and no single-thread performance degradation targeted for transaction processing applications [33] , [37] . Also, the Intel Pentium 4 Hyper-Threading processor is introduced as the first commercial general-purpose simultaneous multithreading processor [40] , [41] . While the multithreading techniques are targeted for general-purpose processors in which the hardware utilization is relatively low due to the uncertainty of application functionalities, the proposed micropreemption techniques are targeted for applicationspecific multifunctional processors in which the hardware is highly utilized, but still more optimization can be exploited.
The IBM 360/91 supports precise and imprecise interrupt handling [1] . Upon the receipt of request for a precise interrupt, instruction decoding is temporarily halted and all issued instructions are allowed to complete. On the other hand, if an imprecise interrupt is requested, the state of the system is lost. Similarly, in CRAY-1 [2] , the instruction issue is temporarily terminated, all vector and memory bank references are allowed to complete, and the interrupt handler is loaded and executed. Hwu and Patt [3] proposed a check-pointing approach to handling interrupts. The checkpoints (which incur some penalty in processor performance) are used to divide the sequential instruction stream into smaller units to reduce the cost of resumption. Sohi [12] integrated the functions of reservation stations and reorder buffers into the register update unit to realize precise interrupts. In addition, Smith and Pleszkun [8] presented architectural solutions such as saving the intermediate state of vector instructions and saving a sequence of instructions that must be executed before saving the program counter. Mosberger et al. [32] presented a software-only solution to the synchronization problem in uniprocessors. Their idea was to execute atomic sequences without any hardware protection, and to roll the sequence forward to the end, thereby preserving atomicity. Simon and Patel determined preemption points having the minimum number of live cache lines for a given interval [24] . Furthermore, preemption points are inserted in real-time operating systems (RTOSs) so that real-time jobs may experience shorter blockings [38] .
Behavioral synthesis has been an active research area for more than two decades [11] , [21] , and numerous outstanding systems have been built targeting both data path oriented and control oriented applications [11] . Synthesis systems that optimize power [17] , testability [15] , and fault-tolerance [31] , [39] have been developed. System level synthesis has become an active research topic [22] , [30] . Examples include, hardware/software cosynthesis techniques targeting microcontroller design [20] , and hardware/software interface generation techniques [25] , [29] .
No previous work that provides a systematic methodology for incorporating preemption constraints in application-specific multitask VLSI systems is known. This is the first work to minimize context switch cost and performance degradation while satisfying task-specific throughput and preemption latency.
D. Computational and Hardware Models
The proposed computational model for a single task is homogeneous synchronous data flow [6] , a special case of the data flow process network family of computational semantics. The model assumes a periodic computation done on an incoming semifinite stream of data along the time axis. Within this model, a task is represented as a hierarchical control data flow graph G(N, E, T ) (or CDFG), with nodes N representing the flow graph operations, and the edges E and T , respectively, the data and timing dependences between the operations. Note that the control dependences are subsumed by the timing control dependences. The homogeneous synchronous data flow model provides semantics for numerous behavioral synthesis systems targeting numerically intensive applications [16] . Many of the most popular DSP, video, continuous media, communication, control, and graphics applications follow the selected computational model.
In modern designs, a variety of register file models have been used [16] . From among them, the dedicated register file hardware model for modeling at the structural register-transfer (RT) level was selected. This model clusters all registers in register files, and each file is then connected only to the inputs of the corresponding execution units. An important benefit of the chosen hardware model is that it reduces the interconnect at the expense of additional registers. This tradeoff is particularly important for modern and future submicron technologies.
It is important to note that although homogeneous synchronous data flow semantics and dedicated register file model are followed, a majority of the results, optimization techniques, and software implementations are directly applicable to almost all applications modeled by an arbitrary synchronous dataflow process network computational model and different hardware models.
II. MICROPREEMPTION

A. Issues and the Proposed Approach
Incorporating micropreemption constraints in a multitask VLSI system raises the following issues.
1) Where should preemption points be inserted? 2) What should the strategy for saving the context (on preemption) be? 3) How should performance degradation associated with preemption be minimized? On preemption, the data in the register files must be preserved somewhere for a task to resume. In general-purpose microprocessors, these values are transferred to background memory before an interrupt is serviced. This technique is not acceptable in multitask VLSI systems due to the attendant performance penalty. Alternately, a register windowing technique is used in the Sparc architecture [18] . In this scheme, data are saved in registers within the processor even when a new computation environment is required. However, it entails nonnegligible area overheads for duplicated registers.
In contrast, an intuitively simple technique is proposed by classifying registers into two groups. 1) Dedicated registers 2 (R t d ) are used for storing the values of edges of a task that straddle preemption points. These edges that straddle a preemption point are called the red edges, and they represent intermediate values essential to resume the task if preempted. 2) Shared registers (R t s ) are shared by the values associated with the remaining edges (of all tasks) in the system. These edges that do no straddle a checkpoint are called the green edges. The dedicated registers of a task can also be used to store the values associated with the green edges in the task. However, the shared registers cannot be assigned to red edges. Since dedicated registers are local to a task, the context switch overhead associated with them can be obtained as the sum of the dedicated registers over all tasks. On the other hand, the context switch overhead due to shared registers is the maximum value across all tasks. Putting it altogether, the context switch cost of a multitask VLSI system with task set T can be obtained as
2 Coefficient registers R t c that hold the constants used in the task are not targeted during the context switch optimization. This is because generally constants differ from one task to another and cannot share registers. Performance degradation resulting from aborting a task is eliminated by 1) partitioning the task states into critical sections; 2) execute critical sections; and 3) preserve atomicity of a critical section by rolling forward to the end of a critical section on preemption. This is analogous to the classical approach to precise interrupts. Within such a scheme, the address of a specific instruction, say instruction x is saved when the processor state is saved. All instructions preceding x have been executed. Instruction x, and those that follow it, has not been executed. Instruction x is thus a precise interrupt point. Since a scheme wherein the context is saved in the registers is adopted, context saving can be carried out in parallel without stalling execution units.
B. Micropreemption Controller
The preemption on an ASPP consumes zero clock cycles because the ASPP can be reconfigured to execute another function right after the completion of the current function. Toward investigating the ASPP on-the-fly reconfiguration, consider an example with two tasks A and B. Fig. 4(a) shows their state-transition graphs (STGs). The dark circles represent the starting states. The STG of the ASPP implementing these two tasks is shown in Fig. 4(b) . The starting states of the STGs are merged into a state that branches out one of the two STGs. The corresponding controller is a finite state machine (FSM) with an additional control input that selects the task to be executed as shown in Fig. 4(c) .
However, if the latency of the current task is long, oftentimes, it may not be acceptable for an urgent task in a real-time system to wait until the current task is completed. An interrupt is required to temporarily stop the current task and execute the urgent task. After the completion of the urgent task, the interrupted task must be able to resume. Consider an example in which task A is interrupted by task B at the second state as shown in Fig. 5(a) . The third state of task A is the resume state, and it must be pushed into a stack [ Fig. 5(b) ] so that it can be popped when task B is completed [ Fig. 5(c) ]. The corresponding controller is a push down automaton (PDA) as shown in Fig. 6 . A stack is added to the basic FSM to store the resume states. An interrupt occurs if the task ID of the previous clock cycle (Old ID) is not identical to the current task ID (New ID). The state to jump is calculated from the task select signal.
Most DSP, video, control, and communication applications assume semiinfinite streams of data as their inputs. Computational tasks on such streams are never completed. Instead, they are going to be preempted by each other so that their contexts are switched alternately. Consider an ASPP implementing an encoder and a decoder in a wireless local area network (LAN) system. While the encoder produces data to an output buffer, the decoder gets data from an input buffer. If the input buffer is going to be full, while the ASPP is generating the encoded data to the output buffer, the ASPP must be reconfigured to decode the data in the input buffer. If the encoder is preempted by the decoder at its second state [ Fig. 7(a) ], the state transits to the first state of the decoder [ Fig. 7(b) ]. After several iterations, preemption may occur at the first state of the decoder. Now, the state must transit to the third state of the encoder. In addition, if the encoder is preempted again by the decoder, the state must transit to the second step of the decoder no matter at which state of the encoder the preemption is requested. From this analysis, it is observed that: 1) each task needs an FSM; 2) only one FSM must be operating, while other FSMs are idle; 3) the output signals of the FSMs must be multiplexed so that only those from the operating FSM can reach the data paths.
A machine named multiple context FSM (MCFSM) was developed by merging these FSMs. This machine replaces the state register of the basic FSM with a state register file. Each entry in the register file represents a context of the FSM, which controls one of the tasks. Fig. 3 shows the block diagram of the MCFSM. At each clock cycle in which preemption is enabled, a task can be initiated or resumed by the task select signal, and its ID is put into the task ID queue. While the current task ID selects the current STATE out of the register outputs, the previous task ID (T − 1) enables one of the register inputs to store the next STATE (T + 1). The circuit is operated by a nonoverlapping two-phase clock {CLOCK 1, CLOCK 2}. During CLOCK 2, the current state is fed into the control logic to compute the control signals and the next state. CLOCK 1 delivers the control signals to the data paths and stores the next state. The control signals are pipelined so that the controller delay does not affect the critical path.
This controller does not have a scheduling scheme that arbitrates the competition between urgent tasks. An external scheduler that has an adequate arbitration policy can be incorporated with this micropreemption technique which enables onthe-fly preemption between tasks. 
III. MICROPREEMPTION: ALGORITHMS
In this section, algorithms for 1) preemption point insertion and 2) preemption context synthesis that minimizes the context switch overhead during multitask VLSI system synthesis are presented. The optimization problem can be defined as follows.
Given an underlying hardware model and N scheduled tasks, each with its own time bound (λ) and maximum preemption latency (τ ), insert preemption points, and bind edges to registers, so that the context switch overhead is minimized.
Simultaneous preemption point insertion and register binding to optimize the context switch overhead is NP complete. Even the register binding subproblem is NP complete because the circular-arc graph coloring problem (which is known to be NP complete) can be transformed in polynomial time into it. This is due to the cyclic nature of the lifetimes of variables.
The micropreemption synthesis flow is shown in Fig. 8 . Initially, all tasks are scheduled in an integrated fashion by considering their word length, precision, hardware, and topological similarities. All operations of task i that are scheduled to clock cycle j constitute task slice T i, j . A task slice graph for a task i is then constructed by drawing an edge from task slice T i, j to T i, j+1 . For example, consider a task t 1 that has been scheduled into three clock cycles in Fig. 9(a) . The corresponding task slice graph is shown in Fig. 9(b) . Task slice T 1,1 is annotated with the number of adders and the number of registers used in clock cycle 1 by task t 1 (2+, 2r) . The next three steps form the core of micropreemption synthesis and will be the focus of this paper. Initially, using the number of edges straddling a clock cycle as an estimate of the context switch overhead, preemption points are inserted. The resulting preemption point set for each task will have more than the minimum number of preemption points. Starting with a single preemption point set for each task, a list of preemption point sets for each task are generated by pruning preemption points with high context switch overhead. One preemption point set from each list (i.e., one for each task) is then selected and merged to yield a preemption point set with low context switch overhead. Finally, the preemption context is synthesized by binding edges to registers subject to preemption constraints. The output is then passed through hardware mapping and layout generation tools to synthesize a multitask VLSI system.
A. Preemption Point Insertion
Towards investigating preemption point insertion, consider a task with five edges (e 1 , . . . , e 5 ), an application latency of eight clock cycles and an edge-to-register binding shown in Fig. 10 . The register file has one input port and one output port that are accessed at the first half cycle and the last half cycle, respectively. The register overhead of a preemption point can be estimated as the number of edges straddling it. For instance, assuming a preemption latency of three yields clock cycles 1, 4, and 7 as preemption points as shown in Fig. 10 . The preemption points are marked by "P." On preemption point insertion, e 1 and e 5 become red edges and are assigned to two dedicated registers r 1 and r 2 . e 3 becomes a green edge and is assigned to shared register r 3 . e 2 and e 4 become green edges but are assigned to dedicated register r 2 . Initially, preemption points are inserted one task at a time using a polynomial heuristic time algorithm Insert_Preemption_Points() of Fig. 11 . It inserts preemption points (into each task) such that the context switch overhead due to dedicated registers is minimized.
B. Preemption Point Refinement
Minimizing dedicated registers alone does not reduce the context switch overhead. Instead, it may increase the number of shared registers and hence the total context switch overhead. For example, assume that preemption points are inserted at clock cycles 0 and 5. Consider the following scenarios: 1) Scenario 1 [ Fig. 12(a) ]: Red edges e 1 and e 5 are assigned to dedicated registers r 1 and r 2 , respectively. This results in two dedicated registers and zero shared registers for the task with a context switch overhead of two registers. 2) Scenario 2 [ Fig. 12(b) ]: Red edges e 1 and e 5 are bound to dedicated register r 1 . The green edges e 2 , e 3 , and e 4 are bound to shared registers r 2 and r 3 . This results in one dedicated register and two shared registers with a context switch overhead of three registers. Scenario 1 is superior to scenario 2 if all other tasks in the system do not require shared registers. Scenario 2 is superior to scenario 1 if at least one of the remaining tasks uses more than two shared registers. Based on these observations, it is clear that both the shared and dedicated registers must be considered in an integrated manner to optimize the context switch overhead.
For each task, preemption point sets generated by the insertion step were used as a start. A list of candidate preemption point sets by pruning preemption points with large context switch overhead is then generated [steps 2-7 in Refine_ Preemption_Points() in Fig. 13 ]. Both dedicated and shared registers are used to compute the context switch overhead. Since the peak usage of shared registers cannot be known a priori, edges are bound to registers [using Preemption_ Context_Synthesis()] to evaluate the context switch overhead exactly. This pruning technique is possible because for each Fig. 13 . Algorithm for preemption point refinement-for each task t i in the task set T , E i is the set of edges, λ i is the input latency, and τ i is the preemption latency. task, preemption point insertion usually inserts more preemption points than are necessary. Finally, the best preemption point set for each of the tasks is obtained by using the context switch cost function given by (1) . This is summarized in steps 8-11.
Consider a multitask VLSI system implementing three tasks t 1 , t 2 , and t 3 shown in Fig. 14 . Following the preceding steps, task t 1 has two candidate preemption point sets with context switch overheads (3, 4) and (2, 5) . Similarly, tasks t 2 and t 3 have four and three preemption point sets, respectively.
The context switch overhead for each preemption point set is given as the two-tuple (number of dedicated registers, number of shared registers). Selecting preemption point set 2 for t 1 , preemption point set 4 for t 2 , and preemption point set 3 for t 3 will result in a context switch cost of (2 + 2 + 1) + max(5, 7, 5) = 12. The context switch cost of selecting preemption point set 2 for t 1 , preemption point set 3 for t 2 , and preemption point set 2 for t 3 is (2 + 3 + 1) + max(5, 5, 5) = 11. From among the 2 × 4 × 3 = 24 configurations [which are enumerated, one at a time by Generate_Configuration()], this has the lowest context switch overhead.
C. Preemption Context Synthesis
The optimization problem associated with preemption context binding can be defined as follows.
Given a scheduled task and a set of preemption points, bind the edges to the registers so that 1) the red edges are bound to dedicated registers and 2) the total number of registers is minimized.
The algorithm, outlined in Fig. 15 , minimizes the number of dedicated registers first and then minimizes the number of shared registers. Initially, the algorithm groups the edges into red and green edges using Classify(). Then the red edges are bound to dedicated registers. Finally, the green edges are bound. The ordering is important since while green edges can be bound to either the dedicated or the shared registers, red edges can only be bound to dedicated registers. A graph coloring heuristic Bind() [outlined in Fig. 15(c) ] is used for binding. Edge e has a set of neighbors (e.neighbor) and a register (e.register) bound to it. The edge with the largest number of bound neighbors is selected and bound to a register that is not bound to any of its neighbors.
This greedy approach to the preemption context synthesis combined with the flexibility provided by the preemption point Fig. 12(b) in which the maximum preemption latency is 6, and in which edges are bound by this greedy approach. Preemption points are inserted at clock cycles 0 and 5 since they have the minimum number of edges. However, since the preemption point at clock cycle 5 is removed in the preemption point refinement phase, the edges are bound as shown in Fig. 12(a) by the same greedy algorithm. One out of these solutions will be chosen so that the global overhead is minimized.
The HYPER high-level synthesis system is used as the experimental platform. HYPER executes all the conditional paths in parallel and choose one of the outputs at the end. However, if only one of the conditional paths is executed at any given time, the register binding algorithm should be slightly modified to take into account the exclusive execution of the paths. The registers are shared among the conditional paths. A register that is bound to a red edge is dedicated to the task no matter which path the edge is on. Therefore, it is better to bind red edges to registers that are already bound to red edges of other conditional paths.
IV. EXPERIMENTAL RESULTS
Micropreemption synthesis techniques proposed in this paper were validated on the set of DSP, video, control, and communication applications summarized in Table I . The selected applications span a wide range of complexities in computational structures and include Arai's fast DCT algorithm (ARAI), decimate-by-four wave digital filter (DECBY4), four-state linear controller (FSLC), S. Winograd's small-N DFT for N = 8 (FFT8), digital wavelet transform (WAVELET), and ninth degree bireciprocal WDF with Butterworth response (WDF9). For each application, columns 2-5 show the number of nodes (|N |), the number of edges (|E|), the word length (wl), and the critical path (cp), respectively. The input latency (T ) for each application is shown in column 6. The next four columns give the hardware allocation. The column titled "reg" shows the number of registers used in the implementation. The numbers in parentheses are the register counts for constants. Synthesis modules for hardware mapping and layout generation from HYPER high-level synthesis system were used to complete the synthesis trajectory.
A. Register Overhead Evaluation
Micropreemption synthesis algorithms were invoked on a set of multitask VLSI systems implementing these applications. The results of six multitask VLSI systems are summarized in Table II . The first column shows the applications implemented in each system. The next three columns summarize the hardware allocation. The last three columns give the number of registers for the case when no preemption points are inserted (0-p), when one preemption point is inserted (1-p), and when preemption points are inserted at all clock cycles (all-p). Using the 0-p case as the basis, the register overhead for the 1-p case varies from 2% to 39%. At the other extreme, the register overhead for the all-p case varies from 30% to 70%.
Towards evaluating the impact of the proposed techniques on the overall area, the synthesis trajectory was completed by passing these designs through the hardware mapping and layout synthesis phase. The area overhead for the six designs using actual layouts are summarized in Table III . The areas are reported for the 0-p case and the all-p case. Again, using the 0-p case as the basis, the area overhead for the all-p case varies from 4% to 11% as shown in the last column.
When compared to the background-memory-based schemes, the proposed scheme does not need 1) additional ports of the register files, which are used to save/restore data to/from the background memories without stalling currently running task; 2) additional buses to interconnect the register files to the background memories; and 3) additional control logic to compute memory addresses. Since the preemption requests can be serviced in parallel without stalling the execution of the tasks, there is no performance penalty.
B. Tradeoff Between Preemption Latency and Context Switch Overhead
The context switch overhead can be reduced by limiting the preemption points to those control steps in which very few variables are alive. On the other hand, this increases the preemption latency and may violate the user-specified constraint on preemption latency. Towards investigating the tradeoff between preemption latency and context switch overhead, the proposed techniques were invoked for six preemption latencies. Constituents of the context switch overhead (dedicated, shared, and coefficient registers) are plotted in Fig. 16 . The preemption latency is normalized to (1/n) n i=1 (τ i /λ i ), where n is the number of tasks in the ASPP, τ i is the maximum preemption latency for task i, and λ i is the input latency of task i.
As the preemption latency increases: 1) the number of dedicated registers decreases;
2) the number of shared registers increases; 3) the total number of registers monotonously decreases. Finally, the number of coefficient registers is invariant to the preemption latency.
C. Register Transfer Level Analysis
The microarchitecture with preemption points for the multitask VLSI system implementing {ARAI, FFT8, WAVELET} are shown in Fig. 17 . Registers in the register files are classified as one of the three types by using different gray levels as shown in the legend in Fig. 13 . The number of registers increases by 15% (from 73 to 84) as the preemption latency decreases by 50%. However, since there is no interconnect overhead, the increase in the total chip area is very small.
V. CONCLUSION
Techniques and algorithms to incorporate micropreemption constraints during multitask very large scale integration (VLSI) system synthesis were presented. The area overhead of the proposed scheme is under 12%. A controller-based scheme to eliminate the performance degradation by 1) partitioning the task states into critical sections; 2) executing critical sections; and 3) preserving atomicity by rolling forward to the end of the critical sections on preemption was also implemented.
