We challenge the widespread assumption that an embedded system's functionality can be captured in a single specification and then partitioned among software and custom hardware processors, The specification of some functions in software is vely different from the specification of the same function in hardware -too different to conceive of automatically deriving one from the other We illustrate this concept using a digital camera example. We introduce the idea of codesign-extended applications to deal with the situation, wherein critical functions are winen in multiple versions, and integrated such that simple compilerlsynthesis flags instantiate a particular version along with the necessary conno1 and communication behavior. By capturing a specification as a codesign-extended application, a designer enables "0th migration among platforms with increasing amounts of on-chip canfigurable logic.
INTRODUCTION
Hardwarelsoftware partitioning has been shown to provide excellent performance as well as power andior energy improvements compared to software-only implementations in embedded computing systems [41 [7] [30] Most approaches to hardwarelsoftware codesign assume that a designer initially dercnbes the behavior of an embedded system using one (or possible more than one) executable language.
Languages proposed for such purposes include C [IO] . Ctt and Java [14] , as well as Statechanr [12] , Esterel [8] , SpecC [24] [31] , and SystemC [26] . Most hardwardsoftware partitioning approaches assume in particular that the main functions of the system are each described once in the specification. Those partitioning approaches then consider the tradeoffs between Permission to make digital or hard copies of all or part of this work for personal or Classroom use is granted without fte provided that copies are not made or distributed for profit or commercial advantage and that copies bear this nolice and the full citation on the first page. To copy otherwise, or republish. to p a t on servers or to redistribute to lists. requires prior specific permission andlor a fee CODES'U2. compiling each function to software versus synthesizing to a custom hardware processor. The goal of such partitioning is to make best use of existing hardware to improve the performance andior energy of the system.
In our investigations of the performance and energy advantages of partitioning for single-chip platforms, we have found that the assumption that each function can be described using one algorithm in a specification, from which software or hardware implementations can be derived, does not apply to many functions. In some cases, the algorithm we would use to implement the function in software is very different from the algorithm we would use for a hardware implementation.
To cope with this situation, we propose the idea of codesignextended applications. In short. a designer finds the most frequently executed functions, and then wites two versions of those functions, one for software, the other for hardware. An automated partitioning tool then chooser between the versions.
In this paper, we illustrate the different specifications of some functions in software and hardware by using a digital camera example. We then describe the concept of a codesign-extended application and highlight our particular implementation of the concept. We show how using codesign-extended applications can enable smooth migration from all-software implementations to hardwarelsoftware implementations using evolving single-chip platforms.
DIFFERENT ALGORITHMS IN SOFTWARE AND HARDWARE
A system's specification typically consists of a set of functions. where each function's granularity is that of perhaps tens or hundreds of lines of sequential program statements, corresponding roughly to an algorithm Hardwardsoftware partitioning seeks to map each function to either software executing on a microprocessor. or to a custom hardware processor. Most approaches seek to keep as many functions in software. due to software's low cost and flexibility, while gaining as much speedup as possible by mapping ceMin functions to hardware, subject to hardware size constraints.
Some functions give g d speedups in hardware, due perhaps to more concurrency. more efficient bit-level processing, andlor less instruction fetch and decode overhead. A good hardware synthesis tool will maximally exploit existing hardware by transforming a function's algorithm to expose parallelism. Such transformations may include loop unrolling, subroutine inlining, subroutine cloning, and even process extraction. Thus, the resulting hardware algorithm may look vely different from the original algorithm in the specification. Nevertheless, the algorithms are fundamentally the m e , achieved through a suaightfoward series of hanSfo"nati0N. As a second example. consider a digital camera chip, whose main functions are illustrated in Figure 1 . The complete functionality could be described in a single software specification. with the following functions. The CCD pre-processor reads the charge coupled device and communicates data to the controller. The DCT component performs a discrete cosine lmsfonn. The Hufmman Encoder performs H u h a n encoding. The Conlroller is the main contmller of the system. The Communicalion transmits and receives data to and from other devices. To take a picfure, the conmller would signal the CCD pre-processor to gather data from the CCD. signal the DCT unit to transform the daw signal the encoder to encode the data, and then store the data. At a later time, the controller may upload or download data with other devices, like a personal computer.
We assume the communication method could be RS-232, USB. wireless, or some other method, but that the methods may use a CRC (Cyclic Redundancy Check [21] ) for error checking.
The CRC performed durins communication is a time-consuming function and thus is a good candidate for hardware implementation. If hardware is not available, the CRC can be done in software. However. the standard algorithm for a software CRC is radically different from that for a hardware CRC. Mefine LOBME(X) ((uchar)((x) a OxFF))
tdefine HlBME(x) ((uchar)((x) >> 8)) insigned short icrc(unsigned short crc. unsigned char'bufptr. unsigned long ien, shon jinit. int jrev) 
eise if (irev c 0)
return (jrev >= 0 ? cword :
L standard software CRC, taken from [21] , is shown in Figure 2 .
'his code uses the first function (icrcl) to create a table of the CRC of 256 characters. It then uses this table to calculate the CRC of an m y of characters passed to icrc. This relies heavily on looking into arrays-a task easy to do in software, but nor efficient in hardware For a hardware CRC, bit operations can be executed in parallel. Thus, a hardware CRC consisLs of numerous bit-wise exclusive OR operations. Figure 3 illustrates a hardware CRC in VHDL.
created automatically by the CRC generator tool in [6] (another tool can be found at [I]). Notice how different the hardware CRC algorithm is from the software CRC algorithm, even though the functions give the same result.
As another example of a function with different software and hardware algorithms. consider the DCT function, which is one of the most time consuming functions during picture taking. The DCT is thus is a good candidate for acceleration using hardware. It is also a popular hardware unit. and there are publicly available cores [20] . A major pan of the DCT is the matrix multiplication of an input matrix by a constant matrix. A simple implementation of a DCT in C is shown in Figure 4 . Some functions have been leh out in the interest ofbrevity When implementing a DCT in hardware, a key change made to the algorithm is to utilize an algorithm based on fixed point rather than floating point numbers. Thus, some precision is typically traded ON for hardware efficiency. In addition, more than one process could be used to connol dataflow. We omit a hardware description of the DCT for conciseness.
Again, the two functions are quite different from each other in appearance and execution. This time, the results will be different, as the VHDL code will introduce quantization noise due to the conversion to fixed-point arithmetic To quantitakvely observe the difference between software and hardware algorithms, we examined the CRC further. The results of translating the CRC from the software description to a hardware description are found in Figure 5 . Only the main process is shown to give a general idea of how it could be done.
This segment of code illustrates haw the body of the loop from icrc is translated into a hardware process that reads the arrays (which would have to be initialized in advance) from memory and 32138, 30273, 27245, 23170, 18204, 12539.  (32768, 27245. 12539. -6392, -23170, -32138. -30273,  (32768. 18204, -12539, -32136. -23170, 6392. 30273,  (32768, 6392, -30273. -18204. 23170, 27245. -12539.  (32768, -6392, -30273. 18204, 23170. -27245, -12539.  (32768. -18204. -12539, 32138, -23170, -6392. 30273,  (32768, -27245, 12539, 6392, -23170. 32138, -30273,  (32768. -32138, 30273. -27245, 23170. -18204 outputs the resulting CRC. -This preserves the sequential execution from the sohare algorithm. The VHDL version of the software algorithm uses an external memory that must be loaded with the arrays that are stored in icrcrb and rchr in the C version. If these were implemented directly in the hardware, they would represent 768 bytes of memory that would take even more area. We assume a ben-case scenario, where memory can be read from in one cycle. 
CODESIGN-EXTENDED APPLICATIONS
We define a codesign-extended application as a software description of an application. extended with additional versions of key functions using hardware algorithms, and using macros (or some other means) to enable existing compiler and synthesis twls to automatically generate a complete working implementation of the system using any of the function versions.
In order to facilitate a codesign-extended application, we need a way io implement certain functions in software or in hardware.
To do this. we have developed a standard method for different partitioning schemes 10 be compiledkynthesized by only choosing different compiler flags and synthesis flags. This gives the designer an efficient way to test his OT her codesign-extended application with several different configurations, and then on chips that have varying amountS of programmable logic available This method consists of putting macros around sections of code that can be implemented in hardware. After these macros are in place, a section of code is added that implements a handshaking routine to signal the hardware section to run. The complements of the macros that were placed around the original code are placed around the handshaking routine. For example, a designer could place #ifdef s around the original code, and place ?#&/"der s around the handshaking mutine. The handshaking routine could be as simple as setting a bit and then waiting on a different bit to be set by the hardware, or it could set a bit and put the microprocessor into a sleep or idle state and wait for an intermpt to be asserted by Figure 6 using the CRC as an example.
The VHDL that would be wrinen would have to be aware of the address of the extemal data variables "crc-start-flag" and "crc-done-flag." It would then use the "start flag" as an enable and notify the processor of its completion using the "done flag."
Using this methodology, we have implemented a simple codesign-extended application on two different platforms. The first platform was a two-chip solution consisting of a standard 8051 and a Xilinx FPGA. The second implementation was a
Triscend system on a chip-the TE520. This chip contains a ' w b o " 8051 and an array of configurable system logic.
Our codesign-extended application was a signal processing example containing three functions that were described for both software and hardware. The VHDL tiles were written separately, but with a simple tool. we were able to merge the files and associate the different functions with different enable SigMiS. This way, we could compile the C program with a certain set of compile-time macros set and then choose the VHDL file that complimented the functions that were left out of the s o h a r e . and synthesize that VHDL file. For example, to compile the program with the matrix multiply function in the configurable logic. the command used would be: cSI 1hree.c -df ( h w -m~t~~-m d t i p~) .
Depending on the design, and the tools used, the synthesis and place and route could take anywhere from a few mimnes to several hours. Some of the combinations were not able to be placed and routed onto the E 5 2 0 chip. This is to be expected. and gives motivation to experiment with higher capacity chips to determine how much performance could be increased for a given increase in price. The memc we used to rate the different implementations was energy consumption. The energy savings are shown in Table 3 . Energy was determined by using a digital multimeter that communicated with the workstation and having the workstation time the execution of the given programs. Therefore, the workstation could calculate the energy, knowing the voltage. current, and time.
CODESIGN-EXTENDED APPLICATION METHOLOGY
A design methodology incorporating codesign-extended applications requires more designer effort up front. but that effort may pay off in the long run. A designer can start by writing an all software version of the application --something typically done today anyway. The designer can then determine the most critical functions of the application, either through hisher own estimation (in many cases, the critical functions are well known), or through profiling ofthe application with expected input vectors. For each critical function, the designer can determine i f a unique hardware algorithm is necessary, in which case the designer can wire hardware-suited code for that algorithm.
Although earlier we mentioned that the hardware-suited code might be captured in a hardware description language, like VHDL or Verilog, this is not absolutely necessary. If the language used for the all software version can be compiled to either s o h a r e or hardware. then the hardware-suited code can simply be winen in the same language. For example, if the designer is using a hardwardsoftware capable environment bared on C, such as Proceler's environment [22] or SystemC 1261, then the hardwaresuited algorithm can still be winen in C.
Notice that the codesign-extended application idea can be extended IO support more than two versions of the same function. Likewise, if two functions were to both be implemented in hardware, the idea can be extended to allow a special combined version of those two functions that might perform better than the two hardware versions of those functions. Of course, extensions like these can quickly cause eodesign-extended applications to became unwieldy, so must be used sparingly. 
CONCLUSIONS
The basic assumption that hardware and software can be derived from the same specification is an assumption that doer not apply in some cases. In some cases, the hardware algorithm is very different from the software algorithm for the same function. We propose the concept of codesign-extended applications to deal with this SiNatiOn. A desiger determines the most Critical functions, and implements two versions of them -one for software, one for hardware. The designer uses a standard modeling approach to enable a compiler and synthesis tool to automatically include or exclude versions, thus generating unique hardwarelsohare partitions without requiring code rewriting or a sophisticated panitioner that can parse the s o h a r e and hardware languages. Codesign-extended applications enable graceful evolution of an application onto evolving platforms that include faster processors andlor additional onship configurable logic. In the future, we plan to generalize the concept of codesign-extended application to include more than two versions of a function, with multiple s o h a r e and hardware versions that support tradeoff performance, size and power.
ACKNOWLEDGMENTS

