Abstract
Introduction
Application-specific instruction set processor design (ASIP) is a promising approach for improving the costperformance ratio of an application. ASIPs are especially useful for embedded systems (e.g., digital cameras, cellular phones, color printers, etc.) where a small increase in performance and decrease in cost can have a large impact on a product's viability. A novel computer organization called the counterflow pipeline (CFP) [25] has several characteristics that make it attractive for custom embedded processors. The CF'P has a simple and regular
Jack W. Davidson Department of Computer Science
University of Virginia Charlottesville, Virginia 22903 jwd@cs.virginia.edu structure, local control, high degree of modularity, asynchronous implementations, and inherent handling of complex features such as register renaming and pipeline interlocking. However, the original CFP is not performance competitive for ASIPs. In this paper, we extend the CFP to a wide-issue organization, called a wide counreflow pipeline (WCF'P), that has much higher performance and is better suited for automatic design of instruction-level parallel (ILP) processors than the original proposal. Our work is a novel and practical application of the CFP to automatic ASIP design.
General-purpose processors are good for average workloads, as typified by popular benchmark suites such as SPEC'95. These architectures are optimized to execute applications containing a set of common operations and exhibiting similar behavior (e.g., such as the frequency of control-transfer operations). The ever increasing demand for performance and the need to handle arbitrary code has lead to very complex and costly general-purpose microarchitectures. For example, the 4-way superscalar HP PA-8000 processor [16] is optimized for exploiting ILP across a variety of application types; this goal requires the PA-8000 to tolerate high-latency operations such as memory accesses and frequent control transfer operations. To fully realize such an aggressive design requires large instruction windows and complex structures (e.g., register rename buffers, data prefetching support, etc.) to overcome high-latency operations and control dependences. The PA-8000 does this with a 56-entry instruction re-order buffer, data prefetch instructions, predicated execution, and branch prediction and history tables.
The cost of aggressive general-purpose processors is usually prohibitive for embedded applications; however, many such applications would benefit from high performance. An alternative to a general-purpose architecture is a custom processor matched to an application's performance and cost goals. An A S P has the flexibility to include the minimal instruction set and microarchitecture elements that give good performance and low cost (e.g., power consumption, code size, quick-turn around, etc.) for a single code without the complexity of devices for general-purpose codes. Because cost and time-to-market constraints are very important to embedded systems [24] , an ASIP architecture should permit automatic design, including high-level architectural design.
The WCFT is a good candidate for this type of fast architectural synthesis because of its superior composability and simplicity. This greatly reduces the complexity of automatic design because a synthesis system does not have to design control paths, determine complex bus and bypass networks, etc., as it would for a traditional architecture, such as a custom VLIW [8] . The WCFP also has local point-to-point communication between functional blocks, which may lead to faster and lower power processors than architectures that use traditional bus structures for interconnecting functional devices. Local communication is especially important as feature size is reduced, where wire latency and power consumption dominates transistor latency and power [ 111.
Our previous work discusses the advantages and disadvantages of application-specific CFPs [4, 5] . In this paper, we describe extensions to the CFP to make it better suited for custom ILP processors. Although ASIPs based on the original CFP take some advantage of ILP, they do not fully exploit an application's ILP because they issue only one instruction at a time. We observed that higher levels of performance may be obtained by widening the CFP's instruction pipeline to take better advantage of ILP with the assistance of compiler transformations such as if-conversion and software pipelining.
The WCFP is ideal for automatic design of ILP processors because its pipeline width, depth, and functional repertoire can be easily and quickly tailored to match the performance requirements of an application. Indeed, we have developed an end-to-end architectural design system that accepts an application in the programming language C and automatically generates a custom processor for that application using the WCFP. Our previous work describes the design methodology [7] and software infrastructure [6] needed for architectural synthesis of custom WCFPs. We have not previously described the WCFP architecture in detail. This paper focuses on WCFP architectural trade-offs and describes several microarchitecture enhancements necessary to get high performance, including increased instruction width, result packing, buffering stage results, register caching, and predicated execution. A future paper will address implementation and cost issues. This paper is organized as follows. Section 1 describes our design strategy, and section 2 introduces the original CFP as background for the extensions that we propose. Section 3 describes the WCFP and presents several enhancements to the basic CFT to get good performance from custom pipelines. Section 4 compares the performance of WCFPs to CFPs and VLIW architectures. Section 5 describes related work, and the final section concludes and summarizes the paper.
Design Strategy
The kinds of high-performance embedded applications that we are targeting with custom WCFPs typically have two parts: control code and a computationally intensive part. The computation part is typically a kemel loop that accounts for the majority of execution time. Increasing the performance of the most frequently executed portion of an application increases overall performance. Thus, synthesizing custom hardware for the computation-intensive portion of an application is an effective way to increase performance.
The type of applications we are considering need only a modest kernel speedup to improve overall performance. For example, JPEG has a function j -r e v d c t ( ) that accounts for 60% of total execution time. This function consists of applying a single loop twice (to do the inverse discrete cosine transformation), so it is a good candidate for a custom counterflow pipeline. For PEG, a small kernel speedup of 6 or 7 achieves most of the overall speedup.
Our target system architecture has two processors: a traditional processor (e.g., a MIPS core) for executing control code and a WCFP processor for executing the computation portions of an application. We are presently studying how best to incorporate a WCFP into the whole system and whether it should be incorporated as a loosely coupled co-processor or a tightly coupled custom functional unit. Similarly to other application-specific systems such as RaPiD [IO] and PipeRench [13], our initial plan is to use a co-processor structure, where the control and kernel processors communicate via 1/0 ports and have an unified data memory address space. To begin computation on the kernel processor, the control processor initializes the kernel processor's loop live-in registers and goes into a low-power sleep mode while the kemel processor executes. When the kernel processor completes execution, it wakes up the control proces-sor, which regains program control and copies loop liveout registers from the kemel processor. Care must be taken that the split processor system architecture does not introduce so much communicationJsynchronization overhead between the control and kemel processors that performance is not improved. In fact, in some cases, it may be advantageous to optimize the control/kernel processor interface to the specific needs of an application. For this paper, we focus only on the automatic architectural design of WCFPs for kernel loops, and in the future, we will address system architecture design and how to map an entire application onto the system.
For applications where there is not a clearly identifiable kemel, the above strategy will not be as effective. However, most embedded applications we have examined have execution profiles similar to JPEG-one kernel that consists of over 50% of the overall execution of the application. We profiled several applications from the MediaBench benchmark suite [19] and found that most of these applications had one loop that accounted for the majority of execution time. These applications include GSM 6.10 full rate speech coding, adaptive differential pulse code modulation, image compression/ decompression, MPEG-111 audio playback (not included in MediaBench), and CCITT G.721 voice compression/ decompression. For these benchmarks, the kernel computation accounted for 53% to 85% of execution time (when aggressive function inlining is applied).
Design Evaluation
To evaluate WCFPs, we use several common benchmarks, including loops extracted from embedded applications in the MediaBench suite [19] . The benchmarks have integer versions of the Livermore loops k l , k5, k7 and k12, the finite impulse response filter (fir), vector dot product (dot), and four other kernels extracted from complete applications. These loops include the 2-D discrete cosine transformation (dct) used in image compression and an implementation of the Floyd-Steinberg image dithering algorithm (dither). We also extracted the vector computation a = bk mod d from RSA encryption (mexp). The final loop is from GSM 6.10 speech decoding (gsm).
Counterflow Pipelines
In this section, we briefly describe the original CFP [25] as background for discussing our WCFP architecture. The CFP has two pipelines flowing in opposite directions as shown in Figure 1 . One is the instruction pipeline, which carries instructions from an instruction fetch stage to a register file stage. When an instruction issues, an instruction bundle is formed that flows through the pipeline. The instruction bundle has space for the instruction opcode, operand names, and operand values. The other pipeline is the results pipeline that conveys results from the register file to the instruction fetch stage. Whenever a value is inserted in the result pipeline, a result bundle is created that holds a result's name (i.e., register name) and value. The instruction fetch stage decodes and issues instructions and creates their instruction bundles. It also discards results from the pipeline. The register file holds destination register values of instructions that have exited the pipeline. It is updated with an instruction's destination register whenever an instruction enters the stage. The CFP has pipelined functional units called sidings that execute instructions. Sidings are connected to the processor through launch and return stages, which initiate siding operations and return values from sidings. Figure 1 shows an example siding for memory that is connected to the pipeline by mem-launch and mem-return. Instructions may also execute in a pipeline stage of an appropriate type (e.g., an addition stage) without using a siding.
The instruction and result pipelines interact: instructions copy values to and from the result pipeline (called garnering). This interaction is governed by rules that ensure sequential execution semantics. There are also rules that ensure result values are current for their posi-tion in the pipeline and not values from previous operations that use the same register names.
As Figure 1 shows, CFP's have local communication: functional devices communicate only with their neighbors. This has two advantages. First, synthesis does not need to determine device interconnection; it is implicit in pipeline stage order. Second, local communication may lead to fast and low power designs, especially as global wire delays dominate critical path latencies.
Although the simplicity and local communication of the CFP is appealing, there are some potential disadvantages [ 5 ] . First, the CFP requires arbitration between adjacent pipeline stages and, in practice, it has proven difficult to build fast (and correct) asynchronous CFP arbiters and control circuits because of race conditions, circuit hazards, and handshaking. However, some designs have been proposed [18] . Second, enforcing the pipeline rules may be expensive because they require examining an instruction's local state (e.g., whether it has been launched, executed, invalidated, etc.) and comparing an instruction's operands to a result bundle. Enforcing the pipeline rules may become a performance bottleneck because it affects instruction throughput and the speed at which results are sent to their consumer instructions. We have found that careful instruction scheduling and arrangement of pipeline stages can reduce the performance impact of the pipeline rules. Finally, CFPs may use more chip area than traditional architectures because pipeline registers (and other stage elements) tend to be very wide-they hold fully decoded and instantiated instruction and result bundles. However, the area cost of CFPs may be partly offset by the lack of global interconnection. Also, as time-to-market and power consumption replace chip area as the primary design considerations for embedded systems, a composable and simple architecture, like the CFP, that enables automatic design becomes extremely attractive. In future work, we will investigate how CFP implementation issues, such as power consumption, affect architectural trade-offs and automatic CFP design.
We have built a microarchitecture simulator and design infrastructure for studying CFPs [6] . The simulator models asynchronous CFPs by varying computational latencies. To move an instruction or result between stages takes 1 time unit, to garner a result takes 3 time units, and to launch or return an instruction from a siding takes 3 time units. To execute an operation such as an addition takes 5 time units. High latency operations are scaled relative to low latency ones, so an operation such as multiplication-assuming it is five times slower than addition-takes 25 time units. The statistics in this paper were collected using our simulator and the latencies above, with simulation models for the WCFP.
Wide Counterflow Pipelines
The WCFP takes advantage of ILP across many loop iterations to obtain better performance than pipelines based on the CFP. We have developed a design system for tailoring a WCFP to an application using software pipelining and design space exploration. In this section, we first summarize our design methodology and then describe our extensions to the original CFP to improve performance and to support automatic design.
Design Methodology
WCFP architectural synthesis operates on predesigned functional devices such as pipeline stages and functional sidings. The design space of WCFPs is defined by processor functionality and topology. Processor functionality is the type and number of devices in a pipeline and topology is the interconnection of those elements. Processor functionality is characterized by an user-supplied design database of computational elements that indicates device type and semantics for each database entry.
The WCFP design system accepts an application program (in C) with its kernel loop annotated as an input to the code optimizer vpo [ 2 ] , which compiles the application and transforms the loop using code optimizations such as strength reduction, induction variable elimination, global register allocation, loop invariant code motion, global common subexpression elimination, etc. vpo passes the optimized kernel loop to WCFP architectural synthesis which selects and instantiates computational devices from the design database and derives the processor interconnection network. The synthesis process has five steps as shown in Figure 2 .
The first synthesis step constructs a software pipeline loop using iterative modulo scheduling [23] . The software pipeline kernel specifies the operations and functional elements to include in a WCFP.
Step 2 does pipeline extraction, which instantiates pipeline stages and sidings for kernel instructions from the design database. After pipeline extraction, step 3 creates an instruction set architecture from the software pipeline kernel and the WCFP. Once the custom WCFP and ISA are determined, step 4 generates the complete instruction schedule for the software pipeline kernel loop. The final synthesis step does pipeline refinement, which itera-tively adjusts the order of pipeline stages to match the kernel loop's execution behavior. Stage order is refined using a simple heuristic that identifies locations in the pipeline with heavy resource contention and re-arranges stages to reduce this contention. A more detailed discussion of WCFP synthesis is in Childers 
Wide Instructions
The original CFP issues one operation per instruction and tries to overlap the execution of multiple operations in separate pipeline stages to get good performance. We found that this was not sufficient to attain competitive levels of performance for ASIPs. To solve this problem and to make the CFP better suited for A S P design, we extended the original proposal to a VLIW architecture that issues multiple operations per instruction. Figure 3 shows an example WCFP that issues three operations per instruction. The width of the result pipeline is matched to the maximum width of an instruction (in this example, the width is 2 since the add-cmp stage generates two results) to ensure there is enough space in results to convey all destination values of a wide instruction. There are no rules for what operations may be scheduled in an instruction; however, every operation in an instruction must use a different destination register to ensure correct execution, except for predicated operations that are guarded by complementary predicates. Although there is no dependence checking among operations in an instruction, dependences between instructions are enforced inherently by the WCFP's pipeline rules without any centralized interlocking mechanisms. Restrictions such as issuing multiple memory operations together are determined by the operation repertoire of functional devices. For example, issuing two loads together requires a memory unit with the capability of performing two memory accesses simultaneously. The operations in an instruction move through a WCFP in lock-step, although they may execute in different stages or sidings. For example, the operations in the instruction: execute in two stages in Figure 3 . The load is started in mem-launch and finished in mem-return, and the addition and comparison are done in add-cmp.
Doing operations in separate stages lets them execute at the best point in the pipeline. The best location for an operation to execute is usually in the stage immediately after the point where the operation acquires its last source operand. Because individual operations in an instruction may garner their operands in different stages and become ready to execute at different times, the location where each operation executes can be tailored to the data flow behavior of the application to significantly improve performance.
(Id [r5],r6; add r7,1,r7; cmp r10,O)
Pipeline Stage
One problem with the design of the CFP is how to arrange control and processing of a pipeline stage. Figure 4 shows a block diagram of our approach for a WCFP execution stage.
Conceptually, an execution stage has three processes: 1) garner, 2 ) execute, and 3 ) control. Garner checks for matching register names in an instruction and result bundle. If a match is found, the instruction and result are updated according to the pipeline rules of the original CFP [25] . Execute checks whether an instruction executes in the stage. If it does, it waits until the operation it executes has all of its source operands before doing the operation. After executing the operation, a new result is generated and inserted into the result pipeline. arbitration units whether it is ready to advance an instruction or result, or accept a new instruction or result.
The control process also moves instruction and result bundles between stages and inserts results generated by the execute process into the result pipeline. The writeback buffer in Figure 4 is used to insert new register values into the result pipeline. The buffer holds destination values generated by executing an operation in the stage, and the control process moves values from the buffer to the result pipeline. There are two feasible ways to do this. In the first choice, control waits until the result register is empty before moving the value from the writeback buffer to the result register. This inserts a new value into the stream of result bundles flowing through the result pipeline. This approach is relatively simple; however, it increases pressure on result pipeline bandwidth by generating new result packets for every destination register, which causes additional gamer operations over the execution lifetime of a loop.
The second alternative moves the buffer's value immediately to the result register, regardless of whether the register is empty. When there is a bundle in the result register with enough space to hold the buffer's value, the value is packed into the existing bundle. If the bundle in the result register is full, the buffer's value is retained until the full result bundle exits the stage. Once the full bundle exits, a new bundle is inserted containing only the buffer's value. Packing results limits the number of individual packets flowing through the result pipeline, which reduces the number of gamer operations and improves result pipeline bandwidth utilization. Figure 5a shows speedup for custom pipelines that pack results versus pipelines that do not. The pipelines in the figure have an instruction width of four and are customized to each benchmark using the strategy above. Figure 5a shows that for most benchmarks performance is improved significantly. For dct and k7, the speedup is 1.3. However, for two benchmarks, dither and W , there was no performance improvement. Figure 5b explains why performance is improved for each benchmark. The speedup for each benchmark comes from the reduction in the number of garners done during a benchmark's execution. For dither and W , there is very little reduction in garners-i.e., the data flow for these benchmarks is such that few results are packed into existing result bundles.
For dct and k7, many results are packed into existing bundles, which reduces the number of garners for each benchmark and improves performance. Packing results also reduces the number of garners for the other benchmarks; however, there is less of a relative speedup compared to dct and k7. The impact of reducing the number of garners is small because these benchmarks already do a good job of masking garner latency. The writeback buffer has another advantage: it lets results flow through a stage during the execution of an operation by decoupling the execute process from other stage activity. Our experiments indicate that for good performance it is key to avoid delaying the delivery of results to their consumer operations, and a writeback buffer lets results move through a stage as quickly as possible. The inclusion of a buffer improves performance by up to 13%.
The design of a WCFP stage presents a possible limitation to the architecture. The comparator network in each stage that enforces the pipeline rules is potentially complex, especially for very wide pipelines. A fully parallel network is O(n x m) big, where n is the number of operands in an instruction and m is the number of results in a result bundle. The size of the network is a concern because as network height increases, the delay to determine a register name match also increases, and a large network requires complex routing of instruction operands and results among the comparators. The size of each comparator is relatively small because they operate only on register names, which are typically 5 bits wide. Figure 6a shows the effect of an increase in comparison latency on performance. The figure shows that for a fixed instruction width, an increase in comparison latency of up to 30% causes performance to be degraded by 7-13.5%. Figure 6b shows average degradation for an increase in comparison latency of 3-30%. An increase in latency of up to 30% causes a maximum average increase in execution latency of only 11.1 %. This is very encouraging since the benchmark pipelines were not recustomized to take into account the increase in comparison latency. Also, the instruction width was held constant, and a wider instruction may improve overall performance by taking better advantage of ILP despite an increase in comparison latency.
Predicated Execution
To reduce control transfer operations and to support aggressive ILP code transformations, the WCFP supports predicated execution. As demonstrated by other work [20, 21] , this can expose significant ILP by flattening the control flow graph into a sequence of operations that can be scheduled together. The WCFT has guarded operations and predicate-generating comparisons similar to those of IMPACT EPIC [l] and HPL PlayDoh [17] . The WCFP handles predicated execution naturally by treating predicates as any other source operand and checking the value of the predicate before executing or launching an operation. Individual operations within an instruction are predicated; there is no "instruction predicate" that guards all operations in an instruction. For example, the instruction: (Id-p PO, [r5l,r6;add-p pl,r7,1, r7;cmp r10,O) has three operations, two of which are guarded by different predicates (Id-p and add-p) . Predicating each operation gives maximum scheduling flexibility since operations with different predicates may be scheduled in the same instruction. Operations that are guarded by complementary predicates with the same destination register may also be scheduled in the same instruction.
The hardware support needed for predicated execution is modest. First, there must be enough predicate and general-purpose registers to aggressively apply if-conversion. The WCFP has separate register files for predicate and general-purpose registers. The size of these files is adjusted by our synthesis system to meet the needs of a given kernel loop. For the benchmarks in this paper, the register file is small, with an average size of 26.
Second, execution units should not generate a result on a false predicate. Predicated operations are always executed regardless of their predicate value. However, the value generated by an execution unit is inserted in the result pipeline only if the predicate is true. Otherwise, the generated value is ignored. Finally, the status of an operation's predicate must be checked before updating the register file with an operation's destination register to ensure that the destination value is valid.
Predicated execution in the WCFP requires that the original value of the destination register be inserted into the result pipeline when a false predicate is encountered. This is done by including the destination register as a source operand to a predicated operation, and by copying the original value of the destination register to the result pipeline on a false predicate. The original value is inserted because dependent operations may be waiting for the destination register, and if a value were not generated, those operations would deadlock.
Although it is possible to squash operations with false predicates before they reach their execution point, the increase in stage complexity is usually not worth the small performance improvement. To squash predicated operations as soon as possible requires that all stages have the ability to recognize an operation with a false predicate and to insert the original destination value of the operation (contained in the instruction bundle) into the result pipeline. The pipelines in this paper support squashing operations with false predicates only at their execution point, which simplifies the hardware for stages that do not handle predicated operations.
An alternative predication scheme to the one proposed above inserts the destination register of a falsepredicate operation into the result pipeline when the operation flows through the register file. However, this scheme may substantially delay the delivery of source operands to dependent operations, and from experimental work, we have found that it is not competitive with the scheme described above. The pipelines in this paper use the first approach with a register cache at the beginning of the pipeline (see Section 3.5) that holds register values from results that have recently exited the pipeline. In most cases, the previous value of the destination register is acquired from the register cache, which avoids requesting the destination register from the register file.
Register Caching
The CFT has its register file at the end of the pipeline; however, placing the register file at the end of the pipeline may lead to poor performance because every instruction entering the pipeline must request its source operands directly from the register file. The source operands for an instruction are forwarded to their consumer instructions via the result pipeline. This forwarding can hurt performance because instructions may stall waiting for their source operands and because there is an increase in the number of individual results flowing in the pipeline, which causes additional comparisons and resource conflicts for result registers. Although Sproull, Sutherland, and Molnar suggest using a "register cache" at the beginning of the pipeline to mitigate this problem [25] , they do not investigate the structure or performance of such a cache. For our work, we studied three register caches and how these caches improve performance.
The WCFP keeps a cache of register values that have exited the pipeline in recent result bundles. In this way, the register cache serves as a future buffer from which instructions just entering the pipeline can acquire their source operands. The register cache is a stage that is located at the very beginning of the pipeline after instruction fetch. To maintain register cache consistency, whenever an instruction moves through the cache, the destination registers of the instruction are marked invalid in the cache. This ensures that subsequent instructions will not acquire an old value for a destination register of an instruction that is currently active in the pipeline.
We consider three register cache organizations: a result cache (R), a pass-through cache (PT), and a result andpass-through cache (R+PT). The R cache tracks the most recent value of registers that have exited the result pipeline. There is an valid bit associated with each cache entry that indicates whether a particular register has recently been seen by the cache. The valid bits are set for every register in a result that flows through the cache, and the bits are reset for each destination register in an instruction. An instruction that has source operands that are marked as invalid in the cache requests those operands from the register file (which acts as a history buffer of computation that has exited the pipeline). The PT cache, on the other hand, has a single bit for each register that indicates whether there is an instruction active in the pipeline that writes to that register (the pass-through cache is a scoreboard that tracks which registers will be set by active instructions). Pass-through bits are reset for every register contained in a result that enters the cache. The pass-through bit helps to avoid requesting source operands from the register file when there is an instruction in the pipeline that generates a particular operand. This reduces result pipeline pressure, which may improve performance. The final organization (R+PT) is a combination of the first two: it caches register values and tracks the status of destination registers., Figure 7a shows the effect of each cache organization on performance as loop speedup relative to a baseline pipeline that does not have a register cache (otherwise, the baseline is identical to each benchmark's pipeline).
For these caches, a cache access (a hit or a miss) takes 5 time units. The figure shows that a register cache can improve performance by up to 38%. In a few cases, however, performance is not greatly affected; it is even degraded for fir, modexp, and k12 (depending on the cache type). The performance degradation is due to the register cache causing pipeline stalls (while accessing the cache) early in the pipeline that affect the ability of instruction fetch to continue feeding the pipeline.
The figure also shows that the PT and R+PT organizations are the most effective at improving performance.
Indeed, in many cases, the PT cache does surprisingly well. This good performance is especially encouraging since the PT cache has much less state per cache entry than the R+PT cache. The relative difference in performance between the PT and R+PT organizations is an indication of the importance of reducing result pressure.
That is, when there is little difference between PT and R+PT organizations (e.g., for dither, k l , W , dot, and k12), the reduction in result pipeline pressure is more important than ensuring that an instruction acquires its source operands early in the pipeline. When the R+PT cache outperforms the PT cache (e.g., fir, gsm, dct, and k7), the early acquisition of source operands is more important than reducing result pipeline pressure. Figure 7b shows the hit rates of each register cache organization. The R cache has the lowest hit rate, as expected, ranging from 0.01-0.36, with an average of 0.17. The PT and R+PT caches have much higher hit rates that range from 0.8-0.98 (average is 0.88), and 0.98-0.99 (average is 0.99). The misses for the PT cache are associated with inactive destination registers and loop live-in and loop invariant registers, while the misses for the R+PT cache are associated with cold start misses for loop live-in registers. The PT cache always misses for loop invariant registers since they are never updated by an instruction. The high hit rates associated with the PT and R+PT caches indicate why performance is improved by these structures. The number of requests sent to the register file is much lower and there are fewer results flowing in the pipeline for the PT and R+PT caches than for the R cache.
For predicated operations, the high cache hit rate implies that the previous value of the destination register is typically acquired from the register cache, rather than from the register file. The use of a register cache is important for the predication scheme outlined in Section 3.4 to avoid unnecessarily increasing result pipeline pressure (due to including the original destination register as a source to a predicated operation) and as Figures  7a and 7b show, including a register cache can significantly improve performance.
Performance Comparison
To investigate how the WCFP compares to other ASPS, we compared custom WCFPs to ASIPs based on the original CFP and to conventional VLIWs.
CFP vs. WCFP
The WCFP exploits ILP across multiple loop iterations through the use of software pipelining. The original CFP architecture uses speculative execution to achieve the same effect, albeit in a very limited way. The custom CFPs generated by our design techniques rely on hardware loop unrolling to get multiple loop iterations active in the pipeline at the same time. In practice, custom CFPs usually have only two iterations active at the same time, and in fact, frequently the ith+l iteration only begins to execute while the ith iteration is executing. Furthermore, data dependences between loop iterations may limit the amount of execution overlap and the degree to which speculative execution can be applied across iterations. Through the use of software pipelining, the WCFP executes operations across many loop iterations at the same time. The WCFP increases instruction width so that multiple independent operations may be scheduled and executed simultaneously. By doing so, the WCFP better exploits ILP than the CFP. Table 1 with an instruction width of four. The CFPs have the same register cache ( R + P n and functional repertoire as the custom WCFPs. The WCFP achieves a maximum speedup of 3.9 over the CFP, with an average of 2.5. The higher performance of the WCFP is due to increased instruction width and better utilization of ILP. Forfir, k l , k7, gsm, dot, and mexp, the performance improvement is 2.5-3.9 times greater than the CFPs, which is the approximately the same as the average increase in instruction width. For these benchmarks, the width of kemel instructions (i.e., instructions in the software pipeline kemel) is 2.7-3.8 operations. k5, k12, and dither have a much smaller speedup. In this case, the speedup is constrained by instruction width; these benchmarks have 1.6, 2.2, and 2.52 operations per instruction. These benchmarks have less available ILP, so the CFP and WCFP have similar performance. Also, as discussed below, the WCFP's performance is sometimes con-strained by advancing operations in an instruction in lock-step through the pipeline.
While the WCFP outperforms the CFP, an important question is at what cost? For example, when going from a single-issue CFP to a dual-issue WCFP, is the cost of the WCFP twice that of the CFP? Although the full results are not reported here, this question was addressed by developing a model of pipeline cost and comparing the speedup/cost ratio of CFPs vs. WCFPs [3] . We found that a quad-issue WCFP increases cost over a CFP by an average of 1.87 times, while performance increases an average of 2.5 times. In most cases, the speedup/cost ratio is above 1 for the WCFP, with an average of 1.37. The two exceptions are A 5 and dither which have speedup/cost ratios of approximately 0.5 because performance was not improved, as discussed above. From these results, we conclude that the WCFP is cost effective since performance is usually improved significantly for a small relative increase in cost over the CFP. 
4.2.WCFP vs.VLIW
To evaluate the WCFP's performance relative to traditional VLIWs, we derived custom WCFPs and VLIWs tailored to the resource requirements of each benchmark. In this section, we summarize the results from Childers and Davidson [7] that compares WCFP and VLIW architectures. The custom WCFPs were generated using the design techniques described in Section 3.1. The custom VLIWs were tailored to the benchmark loops by varying the type of functional 'devices available so that operations could be scheduled in the software pipelined loop without concern for resource constraints other than instruction width. The only constraint placed on the custom VLIWs was instruction width; during scheduling, there are no conflicts for functional devices.
The custom WCFPs achieve cycles per operation measurements that are competitive with traditional VLIWs with similar resource capabilities. We found that the WCFP has performance within 0-18% of the traditional VLIW (the average is 8.6%), and in most cases, the performance is within 7% of the VLIW. For some cases, the WCFP does better than the VLIW. For example, a custom WCFP for the k5 benchmark has 22% better performance than a custom VLIW. This benchmark has a good balance between the type of operations and the flow of instructions and results, which led to peak performance with minimal pipeline stalls.
The performance of the WCFT is encouraging because WCFP functional elements communicate only with their neighbors, unlike traditional VLIW architectures which have complicated bus networks, pipeline bypasses, and other global structures to connect functional devices [8] . Also, as fly^ et al. [ l l ] have observed, architectures with local communication will be especially important in the future as feature size continues to scale down into the deep submicron realm. Because a WCFP design system does not have to determine global interconnection of functional devices, a simple and effective design methodology may be used. Furthermore, the simplicity of the architecture and the design system does not come at the cost of performance of the generated pipeline designs.
Related Work
Although the CFT was proposed as an asynchronous organization for general-purpose processors [25] , there has also been a proposal for synchronous version [22] . However, this work adds significant hardware structures to the original design to get good performance on a wide variety of applications. In our work, we customize CFPs to a single application to get high performance without introducing expensive new features such as explicit register renaming. We keep the WCFP extensions simple to retain the advantages of the original CFP.
There has been much interest in automated design of ASIPs because of the increasing importance of high-performance and quick turn-around in embedded systems. ASIP techniques typically address two broad problems: instruction set and microarchitecture synthesis. Instruction set synthesis attempts to discover micro-operations in a program (or set of programs) that can be combined to create instructions [ 151. The synthesized instruction set is optimized to meet design goals such as minimum program size and execution latency. Microarchitecture synthesis derives a processor from an application (or set of applications). Some ASIP techniques use a co-processor strategy to synthesize custom logic for a portion of an application and to integrate the custom hardware with an embedded processor core [ 10,131. Another approach tailors a single processor to the resource requirements of the target application [9,12]. Many co-design systems unify instruction set and microarchitecture synthesis in a single framework [14] .
Summary
This paper describes a novel application of the CFP to automatic ASIP design. In the paper, we extend the original CFP to a wide-issue CFP that is easily customized to the resource and data flow requirements of embedded kernel loops. The paper describes our enhancements to the original counterflow pipeline to better exploit instruction-level parallelism, including wide instructions, predicated execution, result packing, and register caching. We demonstrate that custom WCFPs have performance that is up to 4 times better than custom pipelines based on the original CFP.
