Abstract-Modern DSP and graphical applications require fast and efficient implementations to meet the high demands of modern systems. The CORDIC process is a powerful candidate for these systems due to its ability to perForm many basic functions using the same hardware. Recent research has led to the development of a daunting number of implementation options, each with its own application restrictions and operating characteristics. This paper introduces the development of a behavioral synthesis tool designed to automate and simplify the design and use of COWMC processors in general applications. At the present time the tool produces 576 unique versions of a CORDIC prcicessing element in the Verilog HDL. These processing elements, or modules, can then be instantiated for use in embedded systems or System-on-a-Chip applications.
I. INTRODUCTION
The Coordinate Rotational Digital Computer (ClDRDIC) algorithm has been implemented in a wide variety of applications ranging from general purpose arithmetic logic units (ALUs) to application specific digital signal processing (DSP) filters and matrix equation solvers. The strength of the CORDIC algorithm is that it is able to solve a wide range of elementary functions, including trigonometric functions, natural logarithms and natural exponentials, using a single set of generic equaticm and a small set of basic operations such as shifting and adding.
In order to achieve design reuse, modules for inclusion in embedded systems or System-on-a-Chip designs need to be developed to satisfy a wide range of possible applications. The overall functionality of the module remains unchanged but applications may have different requirements, such as operating speed, data path width, input and output signals etc., which must be accounted fa-. The use of a module that does not completely meet the design goals results in reduced performance andor a waste of power and die area. A well designed module, therefore, allows for the specification of various implementation details and is scalable to the application requirements. One method of achieving this is by using an emerging computer aided design method, termed 'behaviclral synthesis', in which design implementations reflect specific parameters and constraints applied to a common design description.
In this paper the development of a behavioral synthesis tool is presented. This tool produces a wide range of CORDIC processing elements (PES), :suitable for use in a large number of embedded systems and SoC applications, using the behavioral synthesis approach.
BACKGROUND
The basic CORDIC algorithm was first described by Henry Briggs in 1624 in Arithmetica Logarithmica, as stated in [I] , although the acronym comes from the modern presentation of a Coordinate ROI ation DIgital Computer (CORDIC) by Jack Volder in 1959 [ 2 ] . Volder developed the algorithm for use in a real-time digital computer for the solution of trigonomelric relationships of navigation equations and coordinate Iransformations. His computer could be controlled to solve either set of the following equations:
where K is a constant.
The first set of equations are used for the rotation mode of operation, where the coordinate components of a vector ( x and y ) and a desired angle of rotation (4) are provided and the resultant components of the rotated vector are computed. The second set of equations are used for the vectoring mode of operation, where the coordinate components of a vector are provided and the magnitude and angular arguments of the vector are computed.
The CORDIC algorithm is performed by decomposing a rotation into a sequence of n so-called un-normalized micro-rotations over the base angles a,, with i (E { 0, . . . , n-1) and 0 < a, < p/2. These bise angles are chosen in such a way that the micro-rotations are easily implemented in hardware.
The general recursion for the micro-rotations is given by:
where si is the direction of rotation which is 1 for counterclockwise and -1 for clockwise. The inputs and outputs to the recursion are:
and (4) The use of un-normalized micro-rotations in Eq. 3 causes an increase in vector length, or scaling, by a factor (cos ai)-1 with every iteration. This scaling requires the division by K of x, and y,, where K is given by:
to normalize the final answer. In Volder's original work, the division by K required to compensate for the scaling factor was not performed. It is for this reason that the factor K is found in Eq. 1 and Eq. 2.
In vectoring mode, the vector is forced to rotate towards the positive x-axis such that yn E 0. The value of si is determined, in every step of the recursion, from the sign of the intermediate result yi.
(7)
In rotation mode, the vector is rotated based on the input angle q and the intermediate values of a z-recursion:
The input and output to the recursion are:
The direction of rotation si is determined, in every step of the recursion, such that the sign of si matches the sign of the intermediate result zi. In this way the vector is rotated such that z, E 0.
In the traditional CORDIC algorithm the angles ai are chosen such that:
where S(i) is termed the shift sequence. This selection results in the simplified calculation of Eq. 3 since the second operand can be formed by a simple shift operation. In the work presented in this paper, a double shift extension to the CORDIC algorithm, originally proposed by Deprettere et al. [3] , was implemented. An extra degree of angular freedom was allowed by permitting the angles to be represented by two shifted quantities according to:
where hi (E { 1, 0, -1 } and the two shift sequences are S(i) and S$(i). The purpose of this extension is to allow an angle selection which produces a K that is a power of 2. This enables the compensation of the scaling factor to be performed with a shift operation vice an expensive multiplication operation. There are other methods for compensating for the scaling factor, however, this method was found to introduce the least amount of latency and additional hardware. In general, only the first few angles use two shifted terms as they have the greatest effect on the scaling factor. The remainder of the angles use only one shifted term as in the traditional CORDIC.
BEHAVIORAL SYNTHESIS TOOL
Behavioral synthesis allows a design to be described at the algorithmic level of abstraction and contains no implementation details. This permits a greater degree of freedom and speeds up the exploration of the design space. Multiple implementations of the design can be generated by setting different constraints for the behavioral synthesis process as opposed to writing new implementation descriptions as would be required with traditional RTL synthesis. The present generation of commercial behavioral synthesis tools are not suited for the design of well developed algorithms with no need for resource optimization [4] . The implementations obtained from behavioral synthesis for the rotation of vectors were not ideal and did not exhibit the optimizations originally developed by Volder. To gain the advantages of behavioral synthesis with the CORDIC process, it was decided to develop the CORDIC Behavioral Synthesis Tool.
The developed tool was written in C++ using object oriented techniques and using a graphical user interface for input. A designer selects the attributes for the desired version of the CORDIC PE and the tool then generates the Verilog RTL design. In oder to be as versatile as possible, the tool does not make use of proprietary technology library formats and is not linked to a specific technology or minimum feature size. In traditional behavioral synthesis tools, designers set constraints such as timing and area. The information in the library is then used in conjunction with the constraints to produce an implementation level design. With the separation of tool and library, designers must provide the CORDIC tool with general information about the desired implementation architecture. To assist in making these decisions, characterization data is provided for a commercial 0.35m library. The characterization data enables a designer to quickly target the proper area of the design space. The ch,xacterization data can be generated by a designer for an:y other technology or library that is required.
2%-

IV. TOOL OPERATION
The operation of the CORDIC Behavioral Synthesis To101 is shown in Fig. 1 . The first task performed is input parameter verification. If a user has failed to make a choice in a field then an error message is displayed and the: code generation process is halted. If all inputs have been selected then a validation check is performed to ensure that they are valid selection combinations. In case of error, the process is halted until valid selections have been made. The next step in the process is the setting of the: variables which control the code generation process. These are, for the most part, directly coupled with the input selections. An exception is the requirement for a z dahpath which depends on the mode of operations and the choices made for the angle input and angle output format.
With the control variables set, the process of generating the desired RTL code can begin. The requirement for specific lines and sections of the HDL code are determined from the control variables in a distributed fashion.
Throughout all stages of code generation, conditional statements are executed which ensure that the CORDIC PE is properly generated and the parts correctly inter- The first module to be written is the highest level definition of the architecture as shown in Fig. 2 . The required preprocessor, pipeline stages, and postprocessor are instantiated along with the required pipeline registers and control signals. The modules for the double and single shift pipeline stages as well as the pre-and post-processor are written next. The last section of the code produced is generic, having no dependency on the input parameters set. It contains generic parts and the simulation code required by library components. When the code generation is finished a message is displayed to indicate a successful completion.
V. CORDIC PE ARCHITECTURE
The generic architecture of the CORDIC PE developed for the synthesis tool is shown in Fig. 2 . The input/ output signals for the two angle representations are indicated with an asterisk (*) as they are not present in all implementations. The Preprocessor aligns the floating point input data and selects a working exponent for the iteration computations. The iterations are performed in fixed point (or block floating point) format in order to reduce the overhead associated with floating point operations. The data then enters the iteration pipeline where the CORDIC algorithm is performed. Finally, a Postprocessor compensates for the scaling factor K and returns the data to floating point format. The Postprocessor can also optionally normalize the results if so desired. . In cases where the exponent dif-ference between xin and yin exceeds the bitwidth of the mantissa, no shifting of the mantissa is required as one is insignificant as compared to the other. In these cases the shift-fabric can be disabled to save power as the lesser mantissa can simply be forced to zeros. The additional control logic and hardware required for the activity scaling consumes additional power and affects the overall power savings. In cases where there is little likelihood of activity scaling, based on the input data characteristics, the additional power requirements exceed the savings. In these cases the non-activity-scaled version of the Preprocessor is more appropriate.
The micro-rotation pipeline stages have been designed to use either a full parallel or serial/parallel pipeline strategy. For full parallel pipelines, either one micro-rotation using two shifted terms, as in Eq. 11, or two micro-rotations using one shifted term, as in Eq. 10, are performed per stage. The mapping of two iterations per pipeline stage for the one shifted term micro-rotations balances the computation times of the two types of pipeline stages. Minimizing the number of iterations performed per pipeline stage allows the stages to be easily mapped to the various number of iterations, n, required for a wide range on implementations. For the seriallparallel pipeline strategy, feedback registers and control logic were added to each micro-rotation stage to allow multiple successive iterations to be performed in a single stage before the intermediate results are sent to the next pipeline stage. In this form, the stages perform p and 2 *p iterations per stage, where p is the degree of serialization, for the two and one shifted term iterations respectively. The hardware required for the z-recursion is not required in all implementations of the PE and it is eliminated whenever possible to realize significant area and power savings.
Two versions of the Postprocessor have been developed, one which normalizes the floating point results and another which does not. Normalization is not required in all applications, such as a systolic array of CORDIC PES, as the normalized output data may then be the input to another module which performs an alignment operation. Both of these operations consume power but counteract each other. Performing the normalization, or not, has no effect on the absolute accuracy of the results and so it is a question of limiting power consumption. Various PES produced by our synthesis tool were characterized with a 0.35m standard cell library from TSMC using the Synopsys, suite of design tools. Versions of the PE with and without a z datapath, with various input/output data widths, and using both pipeline strategies, were compiled for: 1) maximum speed, 2) minimum area, and 3) a compromise between area and speed. The characterization data produced measured Throughput ( lo6 samples/ sec) area, throughput, latency, energy, energy-area product and energy-delay product. An example of characterization data for the three compiles is presented in Table 1 and Table 2 for area and throughput respectively. It was found that the tool produced PES that covered a large design space and offered suitable trade-offs to the designer in order to meet specific design constraints. 
Z Data
VII. SUMMARY
In this paper, we have outlined the development of a CORDIC Behavioral Synthesis Tool for the generation of CORDIC modules suitable for use in embedded systems or System-on-a-Chip applications. The processing elements generated from the tool meet the requirements of a large number of graphical and DSP applications.
