Abstract-As a result of enormous competition in the systemon-chip industry, the current trends of system level design flows focus on the use of automated tools construction and verification, behavioral modeling and code generation. This paper presents a design methodology for synthesizeable HDL generation of digital signal processing architectures. All functional units in contemporary DSP architectures are composed of fundamental building blocks like adders, multipliers, macs etc. Given a processor described in BPDL, our system composes the optimal code for its functional units, storage elements and the control unit, which can then be collated by employing standard synthesis tools. The cogency of our system has been demonstrated with the precise HDL generation of several complex architectures. Thus by using the proposed system, the processor design time was drastically reduced as compared to the manual RT level design in HDL.
I. INTRODUCTION
ontemporary advancements in consumer electronics and their augmented employment in wireless and multimedia devices have set new measures for the evolution in Systems-on-Chip (SOC) related technologies. Digital signal processors are used extensively in areas ranging from speech/image processing to controls and communications. The diversity in use of DSPs has given rise to a new class of architectures called application-specific instruction set processor (ASIP). One main objective of ASIP is its short turn around time, which is determined by the ability to translate a given processor specification into hardware. Nevertheless the design of a complete ASIP is an extremely complex undertaking and includes the following development phases: design constraints specifications, design space exploration, architecture implementation, tools implementation, system integration/verification, final system synthesis and silicon implementation
With an increased time to market pressure, the development of architecture and its tools are carried out concurrently, from scratch. This is a highly fallible undertaking and may require formidable resources. Even if development is completed on time, considerable consistency problems may appear between hardware implementation and software development tools. Furthermore a small change in design requirements or a bug during later stages can ensue in daunting amounts of labor.
In order to elucidate all of these issues, it is now essential to employ automation at all levels. This approach ensures that designers have more room for early Design Space Exploration (DSE) and later requirement changes.
The proposed methodology attempts to provide an efficient framework for the design cycle of a new DSP and its associated tools. In the following sections, we present the hardware code generator of BURAQ [10] , a design platform for hardware/software co-simulation. The BPDL [11] description of an ASIP in BURAQ is passed to the micro-code state machine generator, which then generates the complete Verilog code for the design. This includes the data path, control logic and functional units. Afterwards, the HDL code can be verified and synthesized for an FPGA or an ASIC. BURAQ attempts to effectively address all of the previous difficulties in DSP description and its hardware/software cosimulation. This allows technology independence and function parameterizablity in designing complex digital signal processing architectures.
II. RELATED WORK AND OUR CONTRIBUTION
Considerable work has already been reported on automatic HDL code generation from machine description languages. The language nML [7] was developed in Germany and supports retargetable simulation and code generation, however it was not feasible to model and generate architectures with complex pipelining mechanisms and instruction level parallelism. nML has also been adopted in the CHESS framework as its key processor definition language. Instruction set description language, or ISDL [6] was developed at MIT and is an extended version of nML. ISDL allows description of VLIW based architectures and its associated tools including C compiler, assembler, linker and simulator. However no significant work was reported on the HDL generator. The language EXPRESSION [8] allows architectural exploration through compiler/simulator retargetability, however no results were ever published on the modeling of real-life architectures and on hardware synthesis, so the designer has to do considerable effort in HDL implementation and simulation/verification of the target ASIP. The language for instruction set architectures (LISA) [1] , was an ADL whose main characteristic was the operation level description of the DSP. Its HDL generation was also published but it was used mainly to generate a cycle-accurate simulator, with the motivation of behavioral model of the pipeline. LISA did not support description and HDL generation of processors with complex data paths. LPDP [1] was a step towards automated generation of tools and synthesizable HDL code from machine description language LISA. However the cost, design and rigidity constraints of ASIC solutions necessitate the need of generating weakly programmable, highly optimal architectures for a particular application. MIMOLA is an ADL, which captures the net-list level details of the DSP. One obvious advantage of this approach is that the HDL translation is quite easy. Nevertheless the process of creating a simulator becomes equally difficult.
Our work addresses two well-defined problem areas. First is the hardware-software partitioning problem, which has been the subject of immense study [5] and [9] . The other problem is the optimal design generation of hardware components and tools from scheduling and partitioning information that was calculated in step one. This paper mainly focuses on the later contribution. 
ADD SUB

+/-
III. BASIC BUILDING BLOCKS
System level design is a very complex problem. We restrict our attention to the design of embedded systems and in particular application specific systems. Our target architecture consists of a micro-coded programmable processor with a custom data path and a controller.
In most hardware code generators, the key to efficient design is highly dependent on the definition of the target architecture and its parameters. This should be combined with a highly specialized rule base to translate a higher-level abstraction to net list. Various other elements contribute significantly to the final architecture. For instance, factors such as different connection styles and number of memory ports and buses can increase chip area greatly.
Digital design process is finding one of the optimal tradeoffs among area-time-power (ATP)-competing design objectives. The current optimization methodology selects one of the architectures that best suits the design at relatively fine grain level. In the following sections, we describe the Verilog code generation scheme for several building blocks.
A. Storage Elements
The storage elements comprise of program/data memory units, register files and registers. VLIW architectures typically need multi-ported register files in order to cater the needs of multiple, concurrently executing functional units. Multi-ported register files can be very expensive in terms of area and one unique aspect of our system is that we generate optimal read and write ports by taking into account the exclusion constraints among functional units. Other factors of consideration include memory access timings, bit-true width sand ranges.
B. Functional Units
Since BPDL allows description of a processor in a natural hierarchical way, the EXECUTION_UNIT blocks are directly translated into high-speed functional units, which are composed of adders, macs, multipliers and compressors etc. The system performs these mappings according to the amount of available parallelism in the instruction set specification.
We now illustrate our HDL generation methodology applied to a functional unit with five operations. These operations along with their behavioral description in BPDL are indicated as follows:
Fig. 1. Circuit realization of a complex functional unit
From its parameterized library of basic building blocks, our tool combines the first two operations into a powerful adder that can do addition and subtraction. Afterwards all the operations are realized into their respective circuits, which are illustrated in Fig.1 .
The hardware descriptions of these operations are then combined to produce Verilog code for the functional unit. In the depicted case, the total number of gates is reduced by introducing multiplexers in ADD2 (which is clearly the most complicated operation) and using two adders; one shifter and one OR gate to generate the functional unit. This task is also called functional unit sharing Fig. 2 shows the final design of the unit.
Src1 Src2
+/- There are numerous types of binary adders available. These include carry ripple adder, carry lookahead adder, carry select adder, conditional sum adder and others. All these perform binary addition but they differ in chip area/power requirement and performance. Similarly the addition of partial products in the multiplier can be achieved through the following five reduction techniques: 1. Carry save, 2. Dual carry save, 3. Wallace tree, 4. Dadda tree and 5. Vuillemin. Multiplier recoding refers to the process of reducing the effective number of partial products. There are several ways to accomplish this. Our goal is to allow the user to explore all the architectural trade-offs and minimize the area (using above techniques) such that the timing constraints are satisfied.
C. Control Unit
The BPDL language describes behavior of its control unit in the PCU block. This block consists of description of the program counter, the instruction pipeline and special purpose registers. All increments to the program counter are performed from within the operation. However the HDL generator has to synthesize code related to the program counter inside the PCU. It does this by scanning through all operations and generating special code in PCU for instructions that involve special PC operations like branching or offset increments. Since the core has a VLIW type kernel, after each instruction, the program counter is incremented by an amount equal to the instruction width. As an example, we show in Fig. 3 the control unit of a processor that has a 4-stage pipeline and a parallel datapath.
The processor kernel has Harvard architecture and consists of a bus for instruction memory, a bus for each data memory, a register file for ALU and a register file for AGU. The VLIW core does not have the circuitry to detect data and pipeline hazards and it is assumed that the compiler orchestrates these intricacies.
Likewise, from the instruction set model in BPDL, instruction decoder, program sequencer and the dispatch unit are generated.
D. Data Path
Using the optimization algorithms mentioned in the previous section, we embed the data path directly in the execute stage of the pipeline. The data path description consists of three parts. The first part is the component declaration of used resources. The second part is the signal declaration of each connection. The last part is the component instantiation for all resources. The data path of example processor in Fig. 3 was effectively generated. It has a 16x64-b register file. Two read and write ports are used for simultaneous reading of operands and writing of results. The datapath contains two functional units, which can operate in parallel. One is the high-performance adder also shown in Fig.  2 . The other is a multiplier-accumulator unit. The pipeline registers are shown in dark boxes. These registers forward information about the instruction such as instruction opcode and operand registers etc.
In the final step of the datapath synthesis, we instantiate each register file, the functional units and the appropriate read and write ports. Finally we generate the interconnection in the data path and between the data path and the instruction decoder.
ADD src1, src2, dst {dst = src1 + src2;} SUB src1, src2, dst {dst = src1 -src2;} SHL src1, src2, dst {dst = src1 << src2;} OR {dst = src1 | src2;} ADD2 src1, src2, dst {dst = ((src1 
IV. EXPERIMENTS AND RESULTS
In this section, processor generation results are examined. The first step was to generate a processor which has a subset of StarCore's SC140. In the second step, an instruction subset of TMS320C62x was developed. These were synthesized using the Xilinx's XC3000 family.
From these experiments, 32 instructions in SC140 and 20 instructions in C62x were modeled. Instructions like load/store, arithmetic logic operations and branch and jump instructions can be generated. Several advanced instructions like the MAX2VIT in SC140 are not supported since they are beyond the scope of the proposed system. The SC140 description took nearly 741 lines and its realization took nearly two weeks. The results of the logic synthesis are shown in Table 1 .
Until now, the proposed system cannot be applied to multicycle functional units. This is because all the building blocks are implemented using combinational logic. Nevertheless the results demonstrate that general RISC instructions can be mapped easily and the amount of design time can be reduced to a considerable extent. Also we see that the design flow and algorithms can be improved by even further reducing the cost of multiplexers and buses. Some of the initial results of our code generation scheme are indicated in Table 1 In this paper, we have proposed a retargetable hardware code generator and its algorithmic foundations. The system generates synthesizable HDL from machine description language BPDL. The focus of this work is in obtaining highperformance designs, while at the same time reducing the hardware area.
Our current research addresses several aspects of hardware/software partitioning and pipelining. We plan to mathematically solve the partitioning problem through areatime-power (ATP) design space.
The results obtained by solving the problem using linear integer programming (IP) approach would help in suggesting the best design alternative that meets timing constraints while minimizing the overall area of the chip. 
