Index Terms-Application-specific instruction-set processor (ASIP), DC-DC power conversion, digital control, programmable control.
I. INTRODUCTION

I
NCREASING numbers of multi-rail loads and the more commonplace application of multi-phase buck converter topologies have placed an increased computational burden on the digital controllers that implement the current or voltage mode control algorithms in switching mode power converter (SMPC) systems. This additional computational burden is a result of the requirement that a control algorithm must be executed for each voltage rail and the algorithm must be executed multiple times per switching period for multi-phase converters. The use of switching and sampling frequencies beyond 1 MHz and the application of multi-sampling techniques [1] - [3] are similarly increasing the required execution rate of the control algorithms. Furthermore, three-pole, three-zero and adaptive control algorithms are now being implemented for power conversion applications in order to meet the increasingly stringent performance and efficiency requirements [4] - [8] ; however, these algorithms require the execution of a larger number of instructions compared with the implementation of the previously standard fixed-coefficient PID algorithms [9] , [10] .
ASIC-based digital controllers serve their purpose well in low-cost single rail SMPCs where look-up tables (LUTs) and finite state machines (FSMs) are used to minimize computational hardware requirements [11] - [13] . Programmable digital signal processors (DSPs) are more suitable for a wide variety of multi-rail applications, because they can be time-multiplexed to execute multiple control algorithms and thereby control multiple power converters, as illustrated in Fig. 1 . They also provide flexibility through their ability to be programmed to execute a different control algorithm for each power converter [14] . In addition to their computational features, general purpose DSPs usually contain other hardware that is required in the digitally controlled switching mode power supply (SMPS) system, for example, analog-to-digital converters and pulse width modulators.
In spite of the availability of high performance digital signal processors that make use of high clock frequencies and deep pipelines, not much research attention has been given to developing an optimized architecture to meet the changing multirail SMPC system requirements. Many existing general purpose DSPs used for implementing power control algorithms are not strictly optimized for the needs of the SMPCs to which they are being applied. Aside from the fact that these DSPs contain superfluous peripheral hardware, they are unable to execute multiple complex control algorithms within the time frame dictated by the multi-rail DC-DC converter application. This is primarily due to them having insufficient computational elements in the datapath and their slow context switching between the execution of algorithms. The growing complexity of the digital control algorithms being applied has meant that the existing DSPs [15] - [17] , which were originally designed for implementing digital filtering and fast Fourier transform (FFT)-type algorithms, are struggling to meet the constraints of high switching frequency SMPCs due to excessive computational delays [18] , [19] . The supplementing of existing DSPs and microcontrollers with hardware accelerators and co-processors has been proposed in an attempt to hastily provide a solution to the rapid development of a wide range of advanced power control techniques [20] - [22] . Although such architectures can temporarily address deficiencies in existing processors, they do not take into account the idiosyncrasies of digital power control systems at a basic level. A further problem with existing DSPs is that the switching frequencies of the individual SMPCs being controlled in a multi-rail system are usually restricted to being identical or integer multiples of each other [23] .
There is a need therefore to implement an application-specific processor architecture that takes account of the constraints in the multi-rail DC-DC converter system in order to yield improved regulation and performance in switching mode power supplies. This paper proposes and develops a new processor architecture that can act as the computational engine of future intelligent SMPS systems. Its features include a dual multiplier-accumulator datapath and a fast context switching controller that enables efficient use of its computational resources over time. The proposed ASIP thus inherits the flexibility of a general purpose DSP using sufficient but not excessive resources while exhibiting improved computational performance. These traits allow the processor to implement higher order and adaptive control algorithms within the constraints of the multi-rail power conversion application.
The paper is organized as follows. Section II provides an overview of the proposed ASIP architecture and gives an insight into the reasons behind the development of the main features of the architecture. Section III focuses on the problems associated with the multiplexed control of non-integer switching frequency ratio systems and proposes a method of improving controller performance in such applications. Section IV details the experimental verification of the ASIP in multi-rail DC-DC converter applications, where improved performance is demonstrated when compared with a conventional DSP-based approach.
II. PROPOSED PROCESSOR ARCHITECTURE
A. Overview Fig. 2 illustrates the main hardware blocks of the proposed custom dual multiply-accumulate (MAC) processor, which is not based on any specific existing processor architecture. The 16-bit datapath is divided into two identical datapaths, each containing the necessary computational elements to execute MAC operations. Other computational elements that are also required in most power control algorithms but are less frequently used throughout the execution of the algorithms are shared by the individual datapaths to avoid unnecessary duplication and underutilization of resources [24] .
The processor has a Harvard-based memory architecture, similar to the majority of modern DSPs, whereby separate memory banks and buses are dedicated to data and program storage in order to reduce the execution time of operations which require multiple memory accesses. The primary data memory consists of multiple register banks that form a single register file. Data can be transferred between the register banks and external memory using load and store operations.
The program controller contains all of the housekeeping functionality of the processor, which includes the program counter, instruction decoder, interrupt controller, and the mode configuration registers. The interrupt controller contains a finite state machine that determines the exact mode of operation or context of the processor. This element is vital in executing multiple algorithms in a time-efficient manner and ensuring a fast response time when controlling multiple power converter rails.
B. Dual MAC Datapath
The datapath component effectively consists of two interconnected datapaths which receive data from a common register file and share a number of functional elements, as illustrated in Fig. 3 . Data is transferred between the datapath and the peripheral elements of the digital controller system through the data in and data out ports.
Each of the interconnected datapaths has a MAC unit to execute multiplication, addition, subtraction, rounding or combined multiply-accumulate operations in a single clock cycle. These operations are the main operations found in power control algorithms. The execution time of the control algorithms is therefore reduced by carrying out two such operations simultaneously. Data movement operations may also be executed in parallel with computational operations for updating filter delay lines.
The sequencing of shifting and saturation operations in power control algorithms means that more than one of these operations never needs to be executed simultaneously. A barrel shifter and a saturator can therefore be shared between the two datapaths. The internal saturation logic limits the output of the accumulator registers to prevent errors due to overflow. Other less frequently required functional elements are also shared between the two datapaths, for example the Arithmetic Logic Unit (ALU), which can perform bitwise logical and, or, exclusive or, and inversion operations. The datapath also includes a separate limiter unit that can limit input, output, and register data values to any given threshold in a single clock cycle operation. This facilitates limiting of the error from the analog to digital converter (ADC) or limiting of the duty cycle that is applied to the digital pulse width modulator (DPWM) in a straightforward manner.
The datapath uses fractional fixed-point two's complement arithmetic, similar to the majority of commercial digital power controllers [25] . The limited range of values being computed in power control algorithms means that the wide dynamic range of a floating-point DSP and its associated hardware overhead is not required. A standard 16-bit wordlength is preferred for the datapaths to facilitate straightforward interfacing with external digital hardware, though this resolution could be reduced if there were strict area requirements.
The correct resolution choices of the inputs and outputs of the processor are vital for stable operation of the closed-loop digitally controlled power converter system. In order to avoid creating limit cycle oscillations in the output voltage of the power converter the DPWM must have greater resolution than the ADC [26] , usually by one or two bits. Additionally by keeping the internally fed-back duty cycle signal at a higher resolution than the duty cycle applied to the DPWM, the effects of limit cycling due to quantization errors are also reduced. This dithering of the duty cycle is achieved by maintaining a higher resolution in the duty cycle delay line compared with that in the voltage error delay line of a linear compensator. The dual MAC architecture simplifies the implementation of this, whereby operations in the duty cycle delay line are executed on one MAC unit while voltage error delay-line operations are executed on the other MAC unit. In contrast this would require more operations to be executed sequentially on a single MAC architecture, thus leading to an overall increased algorithm execution time.
C. Multiple-Banked Register File Memory
The register file memory, consisting of four banks of thirty-two 16-bit registers, is the primary storage location of the operands required during the execution of multiple control algorithms. It provides fast access to the coefficients, delay-line values, and intermediate results and facilitates multiple parallel read and write operations in each clock cycle. The register file memory architecture eliminates the need to frequently access data from an external memory source, which would reduce the time available to execute computational operations.
The processor has a concise addressing scheme in spite of the large volume of data that needs to be accessed. This is achieved by dividing the register file memory into multiple register banks and restricting access to only a single pre-selected bank in any single instruction. Thus by limiting access to a bank of thirty-two registers, a register address length of only five bits is required in each instruction, which contributes towards minimizing the size of the program memory and instruction decoding hardware. Three of the register banks are dedicated to the storage of data associated with control algorithms, whereas the fourth stores data pertaining to the execution of background code. Using greater than four register banks would lead to increased multiplexer delays and therefore an increase in the critical path of the processor.
Although allocating sixteen registers per bank with a 4-bit register address would provide sufficient storage for the data associated with the execution of two-pole, two zero algorithms, it would not be sufficient to store the significantly larger quantity of data associated with the execution of more complex algorithms. For this reason the next largest register bank size of thirty-two registers is employed, corresponding to a 5-bit register address, which is adequate for use with more data-intensive control algorithms. If the number of registers per register bank far exceeds the number required by a single algorithm, then the data for more than one algorithm can be assigned to a single register bank.
Usually only the data associated with one power converter needs to be accessed while the algorithm for that power converter is being executed. The proposed processor actually allows a different bank to be selected automatically before executing the algorithm for the next power converter by means of the context switching functionality of the interrupt controller. Extra delays are incurred in existing DSPs when multiple DC-DC converters are being controlled because separate sets of coefficients need to be manually selected for the individual power converters in each iteration of the algorithms.
The register file memory can provide up to four operands per clock cycle. Two write ports and four read ports are needed to accommodate the memory accesses when two multiply-accumulate-with-update (MACU) operations are executed simultaneously. The MACU instruction requires reading two operands and writing one of the operands to the next location in the memory file. Two register-to-register-move operations can also be executed simultaneously whereby data can be written from one register to another register in the same register file.
D. Instruction Set
An assembly language-based instruction set was created for the processor containing only instructions that are relevant to the execution of power control algorithms. The instruction set comprises thirty-seven instructions, resulting in a 6-bit instruction opcode, though a smaller instruction set of thirty-two instructions or less and a 5-bit opcode would be sufficient to facilitate the execution of standard digital control algorithms and simple background tasks. A 6-bit opcode was chosen in this case in order to implement additional instructions to accommodate the execution of a wider range of background tasks.
Multiplication, addition and MAC instructions are the essential operations of the instruction set. Shifting is also required for scaling of results. Other operations which feature in the instruction set to deal with the limitations of digital representation include saturation and rounding operations. A compare instruction is also required in constraining the output duty cycle before passing it on to the DPWM. Data move instructions permit data to be moved between registers for temporary storage of variables and also to transfer data between the processor's register file memory and external memory or peripherals including the ADC and DPWM.
In order to take advantage of the flexibility of the dual MAC datapath architecture, the two datapaths can perform different operations in a single instruction cycle. Each instruction word may consist of the opcodes and operands for two independent operations, which are executed concurrently on the separate datapaths. Using instructions that can perform a number of operations in parallel in a single clock cycle therefore minimises the number of clock cycles taken to execute control algorithms. Operations that involve the use of shared datapath elements may only be specified in one of the two operations that form a single instruction. For example, two shifting operations cannot be performed in parallel because there is only one shifter in the main datapath of the DSP. A number of operations are also included in the instruction set that may not be executed in parallel with any other operations. These include branch or program flow operations and also operations that have an immediate data value or memory location in the instruction word. It should be noted that there are no conditional branch instructions in the instruction set because these are not necessary in the execution of standard power control algorithms and simple background tasks, but they could be included in an alternative implementation if there was a requirement to execute tasks depending on certain conditions, for example in the execution of the multi-mode control algorithms.
E. Context-Switching Interrupt Controller
The interrupt control hardware governs how the computational resources of the processor are time-multiplexed to ex- ecute control algorithms for multiple independent power converters. A separate interrupt signal is assigned to each power converter, which triggers the processor to execute the control algorithm for the assigned power converter as part of the corresponding interrupt service routine. The processor has eight interrupt inputs, Int0 to Int7, thus eight individual power converters can be controlled. Each of the interrupts is of the form of a periodic pulse, which is active for only one clock cycle for each pulse-width modulated switching cycle. The pulse occurs at a preconfigured offset from the beginning of the switching cycle corresponding to when a new sample has been acquired by the ADC. In general, only one sample is processed per power converter switching cycle.
Context switching delay is an important factor in the interrupt-based execution of multiple control algorithms when switching between the program code being interrupted and the interrupt service routine (ISR). This delay is caused by carrying out data movement operations which involve saving the data contained in the register set so that it will not be overwritten by new data during the execution of the ISR. The data must be restored to the appropriate registers of the register set after execution of the ISR. In this case, a finite state machine controls context switching operations in order to minimize context switching delay. The processor has four pre-defined contexts (C1, C2, C3, and BG), with a register bank associated with each one of them. After executing an algorithm for the control of one power converter, the processor is prepared for control of the next power converter by the FSM, a simplified version of which is illustrated in Fig. 4 . The input interrupt signals modify the output of the FSM to determine the context of the processor. The FSM thus selects the appropriate bank of the register file before any other operation of the ISR is executed, whereby two ISRs share each register bank. The FSM also independently determines the program memory address of the first instruction of the control algorithm to be executed based on the status of the interrupt signals.
When no control algorithm is being executed by the processor, it operates in background mode (BG), which provides the option of executing monitoring and communications instructions in a loop so that it does not remain idle. A return from interrupt (RETI) instruction at the end of an ISR causes the processor to enter background mode if no algorithm needs to be executed immediately. ISRs triggered by Int3 or Int7 also use the background mode register bank. In normal operation the interrupt controller does not permit an ISR to be interrupted by another interrupt signal, hence the ISR must run to completion before the next ISR can begin. Interrupts that occur simultaneously are serviced in terms of their fixed priority setting. Int0 has the highest priority whereas Int7 has the lowest priority. Lower priority pending interrupts are serviced when the higher priority ISR has been completed. This allows SMPCs with different switching frequencies to be controlled by the processor. This is explored in more detail in the next section.
III. CONTROL IN NON-INTEGER SWITCHING FREQUENCY RATIO SYSTEMS
A. Interrupt-Triggered Control
The interrupt signals in a multi-rail SMPC system are interleaved so that each control algorithm has its own fixed time slot, however this method is only practical when all algorithms are either executed at the same frequency or at different frequencies that are integer multiples of each other. When the execution frequencies have non-integer ratios it is impossible to simply assign time slots to each of the algorithms so that they do not overlap.
In a typical DSP, control algorithms are not interrupted during their execution to ensure that the duty cycle is calculated as fast as possible. This is usually achieved by disabling the interrupt nesting mode. If an interrupt occurs when a control algorithm is already being executed, as in the case where the interrupt frequencies have a non-integer ratio, the interrupt is not serviced until the execution of the algorithm has completed. This results in a delay in the calculation and updating of the duty cycle of the pending interrupt service routine. When multiple interrupt signals occur simultaneously, the algorithms are typically executed according to their pre-defined priority, where each interrupt has a different priority level. Thus an extra delay is again introduced between ADC-sampling and duty-cycle updating for the power converter controlled by the lower priority interrupt. This is illustrated in Fig. 5 , where Int0 has higher priority than Int1 and the interrupt frequencies have a non-integer ratio of 3/2.
Consequently, the delay between ADC-sampling and dutycycle-updating can vary each time an interrupt is triggered, depending on whether or not multiple interrupts have occurred simultaneously or if an algorithm is already being executed. If the duty cycle has not been calculated by the beginning of the switching cycle, the DPWM will apply the duty cycle from the previous cycle, as illustrated in Fig. 6 , where the execution of ISR2 has been delayed by the higher priority Int0 and Int1 interrupts. This behavior is undesirable because a fixed loop delay is assumed when designing the compensator for the closed-loop system. The power converter could also become unstable if the delay occurs for a number of consecutive cycles.
To overcome the aforementioned problems, the delay can be fixed at its maximum possible value for each iteration of each algorithm. This is achieved by setting the interrupt instant at a sufficient offset from the beginning of the next switching period, such that when the maximum number of interrupts occurs simultaneously, the duty cycle will be calculated just in time for the beginning of the next switching cycle. Conversely, when only one interrupt occurs there will be an idle interval between when the duty cycle is calculated and the beginning of the next switching cycle if no other interrupt occurs during that time. The execution point of the lower priority algorithms thus jitters within a permitted time interval.
A problem with using the maximum fixed delay is that it is excessive and prohibits the use of wide bandwidth compensators. The performance of the voltage regulator is therefore degraded due to a much slower response to load transients [27] . Improved performance can be obtained through a reduction of this delay [28] , i.e., reducing the time between when the ADC is sampled and when the calculated duty cycle is applied. Fig. 7 shows the maximum ADC-sample to duty-cycle-update delays, , in the situation where three interrupt signals coincide, where is the ADC conversion delay, is the duty cycle calculation time and is the pre-calculation time of a particular control algorithm. The pre-calculation time is the interval during which all operations for the next iteration of the algorithm that do not require knowledge of the next voltage error sample from the ADC are executed. This facilitates a short duty cycle calculation time, , in the next iteration, where only operations involving the new sample need to be executed, thus minimizing the ADC-sample to duty-cycle update delay. Each of the interrupts has a separate priority level whereby Int0 has the highest priority, followed by Int1 and so on. It is also assumed that the ADC hardware can convert multiple inputs in parallel. By examining Fig. 7 , an equation for for any interrupt can be determined as where is the index for the particular interrupt and also indicates the number of interrupts with higher priority than that interrupt. For example, the value of for Int1 is given by
Although the and values are equal for each of the algorithms in Fig. 7 , it should be noted that (1) is also valid for different values of and , if different control algorithms are executed in each of the ISRs.
B. Modified Interrupt Control
In order to avoid the effects of variable processor delays by fixing the delay at its maximum, it is proposed to modify the standard interrupt controller so that can be reduced to an acceptable value. Fig. 8 illustrates the resulting delays if all duty cycle calculations for coinciding interrupts are executed before any pre-calculations for the next iteration are carried out. By postponing the pre-calculations until after duty-cycle-updating, the total ADC-sample to duty-cycle-update delay as given in (1), is reduced. The reduced delay value, for each of the algorithms can be obtained from (3) where the reduction in delay, , is given by
The modified interrupt controller can achieve the reduced delays of Fig. 8 by performing the following tasks. If multiple interrupts occur simultaneously, the highest priority control algorithm is selected first, all other interrupts are disabled and a dedicated counter is loaded with a pre-configured duty cycle calculation time. Counting is subsequently enabled and the execution of the control algorithm commences. After the counter determines that the duty cycle calculation time has elapsed, all interrupts are re-enabled. At this stage the DPWM should have been updated with the newly calculated duty cycle value. Before the pre-calculations can begin, execution is interrupted by the highest priority pending interrupt. Again all other interrupts are disabled, the counter is reloaded, and counting is enabled. The same applies for the next priority interrupt and so on. After no further interrupts are pending, the processor continues with the execution of the pre-calculations for each of the algorithms that were interrupted.
If no interrupts are pending after the duty cycle calculation time counter expires, the execution of the control algorithm can continue immediately with the execution of the pre-calculations for the next iteration of the control algorithm. The pre-calculations may be interrupted at any stage if another interrupt signal becomes active before they have been completed. It should be noted that the duty cycle calculation time is configurable for each of the algorithms in order to provide the flexibility to execute a different algorithm for each individual power converter.
The substantial benefits of the modified interrupt scheme can be achieved by augmenting a conventional interrupt controller with minimal additional hardware, as illustrated in Fig. 9 . The highlighted area of Fig. 9 indicates the additional components that were added to the standard interrupt controller. The main enhancement is a counter to determine when to re-enable interrupts. Some extra registers are also required. A ret_adr register is required for each of the interrupts to store the return address for the ISR if its execution is interrupted by a higher priority interrupt. The duty-cycle calculation times must also be stored for each algorithm in terms of the number of instructions required, in order to be accessed by the interrupt disable counter. These values should be loaded into special function int_cnt registers during the initialization section of the program code.
Applying this method to a multi-rail power supply system provides the designer with flexibility in choosing arbitrary switching frequencies and also optimal component values for the individual SMPCs, thereby allowing optimization of the efficiency and performance of the individual power converters.
IV. EXPERIMENTAL RESULTS
A. Implementation
The processor was implemented using the Verilog hardware description language and synthesized using the integrated synthesizer in the Quartus II design software from Altera, targeting implementation on a Cyclone II field programmable gate array (FPGA) device. The two 16 16 bit multipliers in the datapath of the processor were implemented using the embedded multipliers of the FPGA and the program memory was implemented using the embedded memory blocks. In order to maximize execution speed, the speed optimization synthesis option was selected. The synthesis process yielded a maximum achievable clock frequency of 64 MHz due to the critical path from the register file memory, through the datapath to the accumulator register. The multiplier and adder were found to be the main sources of latency in the datapath, though a significant proportion of the delay was also contributed by the register file memory.
The processor uses twenty-five percent of the total logic elements of the Cyclone II FPGA. Table I compares the main constituent elements of the processor in terms of the quantity of FPGA resources they require and their percentage contribution toward the total quantity of resources required by the processor. It should be noted that the contribution of the embedded memory components that are used to implement the program memory is not included in the table. The multipliers of the processor are in this case implemented using the logic elements of the FPGA rather than the embedded multipliers by selecting the relevant synthesis option. This permits a clearer indication to be obtained of the contribution of the computational elements to the overall resource utilization.
The table indicates that a major proportion of the logic utilized is allocated to the register file memory block. Read access to four registers and write access to two registers is required in each clock cycle. The input de-multiplexers and output multiplexers therefore contribute to a significant proportion of the overall logic requirements. The computational elements consume the next largest quantities of resources, with the remainder consumed by program control hardware and other miscellaneous components. As expected, the majority of the registers are required by the register file, with the remainder needed as configuration and pipelining registers in the program controller.
B. Experimental System
The experimental platform used to evaluate the processor in the multi-rail SMPC application consists of two interconnected parts as illustrated in Fig. 10 . The first part is a commercial FPGA evaluation board, which includes the Cyclone II FPGA, on which the processor and all other digital hardware is implemented. The second part is a printed circuit board (PCB) featuring the buck converter, load, ADCs, and sensing circuitry. The synthesized processor design was combined with the necessary digital interface hardware to allow data acquisition from the voltage-sampling ADCs. An existing DPWM design was also interfaced to the processor. Fast on-line programming of the processor was achieved using a UART connection from the FPGA board to a PC. Additional logic was required to implement clock synchronization and soft-start functionality. The system clock frequency of 33 MHz was derived from the 50 MHz oscillator on the FPGA board using one of the embedded phase-locked loops on the FPGA.
The prototype power supply system featured on the PCB consists of three identical single-phase 12 V to 1.5 V synchronous buck converters each with a 500 kHz switching frequency. Other parameters associated with the buck converter are listed in Table II . The processor was programmed using assembly language instructions to execute a standard three-pole, three-zero (3P3Z) control algorithm for application to each of the buck converters. This third order algorithm is typical of the type of control algorithm being implemented by commercial digital controllers and being reported in the literature to meet the performance requirements of modern power converters [11] , [29] .
C. Performance Comparisons
The application of the dual MAC processor to the test system allowed an evaluation of the operation and performance of the processor to be undertaken. The application of a single MAC processor to the same test system, again using a 33 MHz clock frequency, facilitated a direct comparison with the dual MAC processor in terms of control algorithm execution time and impact on power converter performance. The architecture of the single MAC processor was derived from the existing dual MAC processor design whereby most architectural features remained the same, apart from the number of MAC units in the datapath. The results of the comparison are particularly important because the computational power of the single MAC processor core is representative of the performance level of typical commercial DSP-based controllers. The modified interrupt controller proposed in Section III is also compared with a standard interrupt controller in terms of the resulting power converter performance. Fig. 11 illustrates the time-multiplexing of the single MAC processor to execute the three identical 3P3Z control algorithms, where each control algorithm is executed in 600 ns. The bus signal at the bottom of Fig. 11 indicates the interrupt service routine or control algorithm that is currently being executed. In between the execution of the algorithms, background code is executed, which is indicated by '3' in the bus signal. The time between the activation of the interrupt signal and the calculation of the duty cycle is 330 ns.
The dual MAC processor was also used to execute three 3P3Z control algorithms, which were identical to those used by the single MAC core. Fig. 12 illustrates the time-multiplexing of the three control algorithms on the dual MAC core. Although the single MAC processor can successfully execute the same algorithms for the multi-rail power converter system, it can be seen by comparing Fig. 11 with Fig. 12 that the single MAC processor does not execute each algorithm as quickly as the dual MAC processor. The dual MAC processor only requires 360 ns to execute the same 3P3Z algorithm, which is 60% of the execution time required by the single MAC processor. This results in much less background code being executed in the same time interval by the single MAC processor, which is clearly visible in Fig. 11 . The interval between when the interrupt signal is activated and when the duty cycle is calculated is 270 ns, which is also shorter than the corresponding interval for the single MAC processor. Table III presents a comparison of the execution time of the 3P3Z algorithm on the single MAC and dual MAC architectures. The table also includes the execution time required by the single MAC C28x CPU core from Texas Instruments to execute the 3P3Z algorithm. A 50 MHz clock frequency is assumed in each case. The additional computational power of the dual MAC architecture compared with the single MAC architecture is reflected in the values for the maximum switching frequency of a single power converter that can be controlled by each of the processors and the maximum number of 500 kHz rails that can be controlled. The C28x CPU requires significantly more clock cycles to execute the 3P3Z algorithm primarily due to delays associated with context switching. Although the C28x core is Table III suggest that the dual MAC processor can provide sufficient computational performance for multi-rail SMPC applications without the increased power consumption associated with using a higher clock frequency.
By sampling the output voltage as close as possible to the end of the switching cycle, the controller can react quickly to any changes in the load. The proximity of the sampling instant to the end of the switching cycle is limited by the calculation time of the duty cycle value after the voltage sample is available to the processor. Thus, the longer calculation time required by the single MAC processor results in the sampling instant being further away from the end of the switching cycle compared with the dual MAC processor. The consequence of the longer delay between sampling and calculation of the duty cycle means that the single MAC processor requires a longer time to react to changes in the output voltage of the DC-DC converter. Depending on the relative locations of the interrupt triggers and the load step in the switching cycle, this can lead to either a larger voltage deviation or an oscillatory or unstable response if the additional delay is not accounted for in the compensator design.
The regulation of the output voltages of the three buck converters to 1.5 V is illustrated in Fig. 13 where the dual MAC processor has been used to execute the 3P3Z control algorithm. Rails 0 and 2 are subjected to identical simultaneous 3 A positive load current steps while a 3 A negative step is also applied simultaneously to Rail 1. The difference in voltage drop in the responses of Rails 0 and 2 is due to the location of the interrupt trigger signals (Int 0 and Int 2) relative to the location of the load step in the switching cycle. Fig. 14 illustrates the output voltage response of Rail 1 where the single MAC processor has been used to execute the 3P3Z control algorithm and a positive load step of 3 A has been applied. For Rail 0 and Rail 2, the additional delay of the single MAC core led to an unstable response when using the same 3P3Z compensator. The voltage response of Rail 1 is representative of what would be obtained using a standard commercial DSP-based controller with only one MAC element. Executing the same 3P3Z control algorithm using the dual MAC processor for application to the same DC-DC converter, for the same 3 A load step, results in the output voltage response illustrated in Fig. 15 . Comparing Fig. 14 with Fig. 15 , it can be observed that the transient performance is considerably improved in Fig. 15 in terms of voltage drop and settling time, which thus results in a more desirable transient response due to the use of the dual MAC core. The shorter computational delay of the dual MAC core compared with the single MAC core results in close to a 10% reduction in the output voltage drop from 316 mV to 285 mV, which in turn leads to a 40% reduction in settling time from 100 s to 60 s. The faster dual MAC response also has less overshoot and is therefore a more stable and more favorable response than that obtained using the single MAC core. The transient response obtained using the dual MAC processor is thus a very significant result in terms of the power converter performance.
In order to evaluate the performance improvement provided by the modified interrupt controller, it has been compared with the standard interrupt control method. Rail 0 and Rail 2 were configured to operate at a switching frequency of 500 kHz, whereas Rail 1 was configured to operate at a switching frequency of 495 kHz. Thus the ratio of the Rail 1 switching frequency to the Rail 0 switching frequency is 0.99. The control algorithm for Rail 0 has the highest interrupt priority, followed by Rail 1 and then Rail 2. The Verilog code was synthesized separately using both the standard and the modified interrupt controllers in order to compare each of the techniques. Table IV summarizes the delays measured for each of the voltage rails for the 3P3Z compensator. It also includes the percentage reduction in delay provided by the proposed interrupt scheme. Although the modified interrupt method does not provide any reduction in for the highest priority interrupt, it significantly reduces for all other interrupts. It should be noted that the delay for Rail 1 corresponds to the delay that would occur if the switching frequencies of the individual rails had an integer ratio.
In the test system identical positive 3 A load current steps were applied to the output of the DC-DC converters while using both the standard and modified interrupt control methods. restriction of the long delay for the standard method results in a slow response to the load current step as illustrated in Fig. 16 . The modified interrupt controller provides better performance and with the wider bandwidth compensator enables a faster response with less overshoot to be obtained, as Fig. 17 shows.
V. CONCLUSION
A detailed specification of an application-specific processor for multi-rail SMPC applications has been presented. The successful implementation of the ASIP as a digital controller in a multi-rail SMPC system has also been demonstrated. The dual MAC architecture has been shown to have significant performance advantages over an equivalent single MAC architecture. Fast access to relevant data, automatic context switching and the dual MAC datapath contribute to the reduction in the number of instructions required to execute control algorithms, which therefore reduces algorithm execution times compared with general purpose DSP implementations. This allows the processor to be applied in multi-rail systems that require multiple complex control algorithms to be executed, such as the demonstrated 3P3Z algorithm which consists of additional operations compared with standard PID-type compensators. It also allows more advanced power management features to be implemented in the background code and enables the processor to be used in applications where there is a requirement for the duty cycle to be calculated as fast as possible in order to react to rapid changes in the load current. Multi-rail systems where there is a non-integer ratio between the switching frequencies of the SMPCs have also been examined and a modified interrupt controller that performs significantly better than standard DSPs has been proposed and verified.
