Abstract
complexity datapath (coprocessor) attached to a configurable embedded processor. We have investigated a second coprocessor configuration which includes a private register file. Results indicate that the new configuration is superior the previously reported method.
II. LPAS-BASED SPEECH CODERS
The G723.1 and G729A standards belong to the category of Linear-Prediction Analysis-by-Synthesis (LPAS) [21] speech coders. They produce low bit-rate, high-quality speech using a combination of analysis-by-synthesis techniques where the encoder (analysis) includes the decoder (synthesis) to determine the initial excitation signal, and linear prediction techniques to determine the coefficients of the speech synthesis filter. The G723.1 standard specifies a dual rate speech coder that can operate at 5.3 or 6.3 Kbps while the G729A operates at a rate fixed at 8 Kbps. Quality improves with higher bit rates although the overall performance of G723.1 at 6.3 Kb/s and G729A is similar. A clear difference in these coders is their algorithmic delay where the total oneway delay of G729A of 25 ms compares favorably with the 67.5 ms of G.723.1. Technically, G723.1 at 6.3 Kbps differs from G729A and G723.1 at 5.3 Kbps in the excitation model for the synthesis filter. G.723.1 at 5.3 Kbps uses multi-pulse excitation with a maximum likelihood quantizer (MP-MLQ) while G723.1 at 6.3 kbps and G729A use code excited linear prediction (CELP) [21] . CELP coders are based in a codebook that stores possible excitation sequences for the synthesis filter. This is the most common realization of the LPAS paradigm and its dataflow is depicted in figure 1 . In the figure, the original input speech is used to perform linear prediction analysis and calculate the coefficients of a tenth-order synthesis filter. The filter order models the number of resonant frequencies or formants of the transfer function of the human vocal tract. The excitation signal to the synthesis filter is obtained from two codebooks that model the initial stages of the human sound production system. An adaptive codebook is used to model the pitch structure of voice sounds originating in the vibrating vocal chords and a fixed codebook is used to model unvoiced sounds such as nasal or plosive sounds. The residual error between the reconstructed speech produced by the synthesis filter and the original input speech is then further processed by a perceptual weighting filter. The output signal from this process is then matched against the adaptive codebook elements to determine the codebook index and gain that best approximate the residual signal. The adaptive codebook contribution is removed from the residual and the same process is repeated using the fixed codebook. The index and gains for both codebooks are assembled together with the synthesis filter coefficients in the bitstream transmitted to the decoder. This processing is done for every frame of 10 ms of voice signal. The G729A decoder dataflow is illustrated in figure 2 . The received bitstream is disassembled to obtain the filter coefficients and the codebook parameters. The excitation is constructed by adding the adaptive and fixed codebook vectors scaled by their gains. The excitation is then filtered through the same synthesis filter as during encoding. Additional post-processing of the speech signal is performed to enhance its quality. 
III. PROBLEM FORMULATION
This research identifies architecture and microarchitecture requirements for the efficient implementation of the G729A and G723.1 speech coders on high-performance, low-cost, configurable microprocessors. The workloads where initially executed and profiled in native mode (Linux X86): Table 1 shows the relative amount of time spent outside the DSP emulation instructions. In order to investigate the potential acceleration of the algorithms when executing on an embedded microprocessor, the workload was recompiled for the Simplescalar instruction set architecture (ISA) [15] . Table 2 illustrates the simulated processor profiling results. As expected, the workloads spend a significant amount of time/instructions executing the DSP emulation functions. It is clear that efficient implementation of the DSP emulation instructions on a configurable extensible microprocessor can lead to a very high-performance, targeted-architecture for the particular workloads. The small form-factor and reduced power consumption of the proposed solution makes it a very attractive candidate for replication and integration in an SoC ASIC. 
IV. MICROARCHITECTURE
We have investigated two microarchitectures: One that uses the main CPU register file and another that utilizes its own. Both microarchitectures make use of the RISC memory subsystem (L1 Data cache) and are designed to be attached to a Sparc-V8 compliant SoC subsystem distributed under LGPL [10] . We choose to connect the coprocessors to the integer unit pipeline directly instead of designing them as AHBcompliant masters [11] for performance reasons: Stand-alone AHB coprocessors are very effective when working on medium to large blocks of streaming data. Although the workloads perform a lot of work on blocks of data (samples), there were many more instances where we had to insert custom assembly code into irregular (non-iterative) blocks. As a result, we opted for a very tightly-coupled configuration which accommodates efficiently both cases. High-level views of both microarchitectures are depicted in figures 4 and 6 respectively. This section discusses a number of design parameters:
A. Coprocessor Interface
The open-source embedded RISC processor lacked detailed microarchitecture documentation. Initial experimentation with the already existing coprocessor interface was inconclusive as to its ability to operate in a pipelined fashion. That would have had a detrimental effect on the performance of the coprocessors and it was therefore decided to implement a new, pipelined coprocessor interface. The newly developed coprocessor port can handle two coprocessors and is able to deliver an instruction on every cycle. External coprocessors provide flow control to the main processor through a dedicated stall signal.
The diagram of figure 3 shows a coprocessor data operation on cycle 1 followed by a host-to-coprocessor register transfer on cycle 2. In cycle 3, a coprocessor register is requested by the RISC processor but due to internal stall conditions, data are made available one cycle later than the expected time (cycle 5 instead of cycle 4). During that time, the main processor is held with the holdn signal. Finally, a second read operation, this time directed to Coprocessor 1, is initiated in cycle 6. Results are made available to the main pipeline in cycle 7.
B. Microarchitecture 1: Using the main RISC CPU Register File
This is the simplest microarchitecture since it makes use of the main RISC processor register file. This type of approach has been adopted by configurable microprocessor vendors [18] [22] and it is effectively a side-datapath with associated control, attached to the main CPU as depicted in Figure 4 : In this case, the coprocessor consists of the Datapath and the Control Pipeline Starting at the IFETCH stage, the main RISC processor fetches one instruction word from a multi-way set-associative instruction cache and clocks it into the instruction register.
RISC and coprocessor decoding take place concurrently at the DECODE stage with the main RISC register file accessed at the falling edge of the clock. Due to the significant number of Multiply-add operations in the workload, a third read port was added to the main CPU register file to accommodate singlecycle addition (RF3). This port is depicted as an embedded SRAM block, instantiated in the coprocessor hierarchy, clocked at the falling edge of the DECODE stage. Finally, all result bypassing takes place in this stage. The EXEC stage is the main processing stage for both the RISC processor and the coprocessor. During this stage all non-arithmetic operations are computed in the coprocessor. In addition, the 16-bit signed-multiplication is performed. All transfers between the main RISC pipeline and the internal coprocessor state take place in this stage. Coprocessor results are pipelined in the EXEC2 stage where the add part of the Multiply-add operation is performed along with saturation. During this stage, the L1 data cache is accessed and one 32-bit word is returned to the main RISC pipeline from the load path as depicted in the diagram. It is this stage that qualifies state updates in the coprocessor side since all possible exception conditions have been resolved. Finally, results are clocked into a staging register prior to committing to the RISC register file, on the falling edge of the clock.
C. Microarchitecture 2: Using private Register File
This microarchitecture is considerably different to the previous one due to utilizing a separate, 16x32-bit register file in addition to a more elaborate control mechanism. The coprocessor state is fully accessible from the RISC CPU and is shown in figure 5: It consists of sixteen 32-bit registers and a sticky overflow bit. Bi-directional transfer instructions, between the host RISC processor and the coprocessor, were added to accommodate the lack of Move-to-coprocessor/Move-from-coprocessor instructions in the Sparc V8 architecture [17] . The high-level schematic of the coprocessor with its own register file is depicted in figure 6 . In this case, the coprocessor pipeline is segmented in three major sections: Front-end, Control pipeline and Datapath. Starting from the top, the main CPU reads an instruction from the multi-way set-associative instruction cache and clocks it into the instruction register.. The latched command is then decoded, both at the RISC processor and the coprocessor front-end, and register-file read-addresses are extracted. In parallel, the coprocessor decoding logic computes a number of control fields that are sent to the control pipeline. During the EXEC/READ stage, the register file is accessed followed by operand bypassing. The resolved operands opr1, opr2 and opr3 are clocked into the operand registers where they are utilized during the first execution stage (EXEC1). In DMEM/EXEC1, all shifting, normalization and miscellaneous operations are performed. In addition, the signed-multiplier is accessed if the command specifies that. Results are passed to EXEC2 for the second stage of execution where all arithmetic and saturation takes place. The configuration of figure 6 permits the pipelined execution of all the commands with a latency of 1 cycle. The only exceptions are the multiply-add and multiply-subtract with saturation, which span both execution stages and have a latency of 2 cycles. 
DATAPATH

Figure 6: high-level microarchitecture
The following sections discuss in more detail the microarchitecture blocks common to both coprocessors. These include the EXEC1 and EXEC2 stages and lower hierarchical blocks.
1) EXEC1 Stage EXEC1 includes datapath logic to perform 16x16 bit signed multiplication, all ITU shift operations and a miscellaneous block responsible for handling all opcodes not falling in the previous category. These are depicted in figure 7 
a) Multiplier
This is the signed, 16-bit multiplier. Due to the highly configurable nature of the RISC processor and the portability requirements of this work, HDL constants are used to select whether the multiplier is inferred in the RTL code or instantiated. In the later case, a Booth-Encoded, Wallace-tree multiplier [20] is utilized due to the higher pipelined performance when compared to the implementations chosen by the synthesis tools.
shift_unit opr1o (16) opr2e (16) opr2o (16) cmde cmdo shif t_rese (16) opr1e (16) shif t_reso (16) shif t_setv (2) misc_unit opr1o (16) opr2e (16) opr2o (16) cmde cmdo misc_rese (16) opr1e (16) misc_reso (16) misc_setv (2) signed 16 mult Table 4 depicts the unpipelined and two-stage pipelined maximum operating frequency of the 16x16 signed multiplier in a high-performance 0.13 process. Our timing budget allows for the use of a non-pipelined multiplier thus, simplifying coprocessor pipeline design.
b) Shift Unit
The shift unit implements the 16 and 32-bit ITU shift operations. A particular characteristic of these operations is the ability to specify negative shift amounts resulting in a positive shift in the opposite direction. The high-level schematic of the shift unit is depicted in figure 8 .
2) EXEC2 Stage
This stage performs the Add-part of the MAC instruction as well as all arithmetic and saturation. Results commit to the private register file at the end of this cycle or return to the host pipeline during stage DMEM. The common EXEC2 highlevel schematic is shown in figure 9 . 
V. RESULTS
Results were obtained for both coprocessors at the architectural level with the baseline architecture being the Simplescalar ISA. The workloads where compiled and all ITU test vectors were validated on the standard architecture simulator (sim-profile). Tables 5 and 6 depict the number of simulated processor instructions required for each workload, for the G723.1 and G729A algorithms respectively The workloads where then modified to include custom assembly instructions and a new architecture-level simulator (sim-coproc), based on the existing profiling simulator, was designed. The test vectors were again simulated and the algorithmic complexity was measured and compared to that obtained in the previous run. Fully compliance to the ITU-T test vectors was maintained at any instance.
A. Coprocessor without register file results
Tables 7 and 8 depict the average (over all test vectors), relative algorithmic complexity for both the coder and decoder of the G729A and G723.1 standards respectively when compiled and simulated for a coprocessor using the RISC processor register file. The tables illustrate the fractional complexity reduction as extension instructions are added, one by one, for both coder and decoder. In the case of the G729A coder, an average architectural improvement in algorithmic complexity of the order of 49% (coder) to 47.1% (decoder) is achieved. The G723.1 standard achieves similar figures with to 45.7% and 49% complexity reduction for the coder and the decoder respectively. These improvement figures do not take into account cycle-effects such as cache misses, prefetching or the possibility of multi-issue. Tables 9 and 10 show the average (over all test-vectors), relative algorithmic complexity of the G723.1 and G729A coders respectively for a coprocessor with a private register file and utilizing all the defined instructions of table 3 (except division). Further substantial gains are observed: The G723.1 coder demonstrates an average relative complexity of 65% compared to the unmodified standard and an improvement of 35.6% over to the previous architecture whereas the G729A standard achieves 69% of unmodified complexity and improvement of 39.3% compared to the previous architecture. It is clear that the introduction of the coprocessor register file provided significant benefit due to reducing the register pressure compared to the previous method. In addition, a significant number of Load/Store operations were eliminated since transient values are now cached in the dedicated register file. 
B. Coprocessor with private register file results
VI. SOC SUBSYSTEM
Architecture research demonstrated the superiority of the coprocessor with a private register file. This microarchitecture is currently being implemented in RTL VHDL as a tightlycoupled coprocessor for the Leon Sparc-V8 CPU. Detailed microarchitecture analysis followed by trial synthesis confirmed that all instructions can fit in a single highfrequency cycle resulting in a latency of 1 and an initiation rate of 1. Exceptions to this are the Multiply-add/subtract instructions and the short divide with latency/initiation rate of 2/1 and 17/17 respectively. In particular, it was decided that due to the very low improvement, the iterative divider block would not be utilized. The CPU/Coprocessor attaches to a 32-bit AHB system which connects to an external host via an AHB-PCI Bridge. This is depicted in figure 10 . The optimized speech coder and the frames to be processed are transferred with DMA from the host PC to the SDRAM memory of the RISC/Coprocessor FPGA board. After that, the RISC CPU/coprocessor combination processes the frames and stores the compressed frames in local memory (SDRAM). The compressed frames are transferred back to the PC memory for comparison with the ITU-T test vectors.
VII. SYSTEM VERIFICATION
Significant effort is spent in validating the system both at block as well as system level [16] 
:
A. Block-level verification The reference code DSP emulation instructions were instrumented to produce human-readable files of their input operands, the state of the global Overflow flag and output results. These vectors were subsequently fed into the individual datapath blocks and their functionality validated on a per-workload basis.
B. System level verification
In parallel to block-level verification, system verification involved the design of a DMA controller, to transfer the embedded processor binary and frames from the host memory into the FPGA board SDRAM. The RISC processor, without the coprocessor, executed the workload and agreement with the ITU-T test vectors was obtained. 
VIII. CONCLUSIONS AND FUTURE WORK
We utilized a combination of techniques to profile and optimize the ITU-T G729A and G723.1 speech coders. A further significant source of optimization lies with tapping the amount of data-level parallelism available in the workloads. Our group currently investigates vector architectures for the efficient execution of the speech coders.
Additional insight on the cycle effects will be provided through the cycle-accurate modeling of both coprocessors when attached to a more generic RISC CPU with limited dualissue ability. This is portrayed in figure 11 where a highperformance scalar RISC processor with 8 pipeline stages and limited dual-issue capability (one scalar, one coprocessor) is described. This will allow for experimentation of the processor/co-processor design space and provide insight into the necessary microarchitecture requirements for the efficient execution of the workloads. Finally, we are building the RTL model of the microarchitecture of figure 6 in the context of the system of figure 10.
