Abstract-The cerebellar model articulation controller (CMAC) neural network has the advantages of fast convergence speed and low computation complexity. However, it suffers from a low storage space utilization rate on weight memory. In this paper, we propose a direct weight address mapping approach, which can reduce the required weight memory size with a utilization rate near 100%. Based on such an address mapping approach, we developed a pipeline architecture to efficiently perform the addressing operations. The proposed direct weight address mapping approach also speeds up the computation for the generation of weight addresses. Besides, a CMAC hardware prototype used for color calibration has been implemented to confirm the proposed approach and architecture.
I. INTRODUCTION
T HE cerebellar model articulation controller (CMAC) was proposed to formulate the processing characteristics of the cerebellum [1] , [2] . This model can learn nonlinear relationships from a broad category of functions. Unlike the global weight updating scheme used in the backpropagation-based neural networks, CMAC is characterized by a local weight updating scheme with the advantages of fast convergence speed and low computation complexity.
As shown in Fig. 1 , CMAC approximates a nonlinear function by using two primary mappings [2] where is a continuous -dimensional input space, is an -dimensional association space, and is a one-dimensional output space. For the systems with multiple output dimensions, this CMAC model can be extended directly. The function maps each point in the input space onto an association vector that has nonzero elements. For a standard CMAC model, the association vector contains only binary elements, either 
To date, most literature about the CMAC model concentrated on the application to simplify the control of robots [3] - [5] . Among others, some papers concentrated on the discussion of learning convergence [6] , [7] and new learning methodologies [8] , [9] of the CMAC model. However, there is little relevant literature discussing how to efficiently realize the CMAC model from the aspect of hardware design [10] - [12] .
From the aspect of hardware implementation, CMAC can be viewed as a special look-up table structure. Generally, a CMAC hardware system consists of a weight memory module, a weight address generator, and an output computation unit. In addition, a learning unit would be necessary if on-chip learning feature is required. The weight memory module stores the weights which are used to evaluate output values. The weight address generator is designed to implement the mapping , that is, to evaluate a set of weight addresses corresponding to those activated association cells. The output computation unit performs the mapping . Two dominant factors in the hardware implementation of CMAC should be carefully handled. First, CMAC requires a huge address space on weight memory, but the weight memory usually has a very low utilization ratio. Hence, how to reduce the size of weight memory is an important issue. Second, the speed of weight address generation is the key factor that influences the performance of CMAC, so a novel architecture with a good performance/cost ratio should be developed for the weight address generation. In this paper, these issues are solved by employing a direct weight address mapping approach. Based on this approach, a multiple weight memory bank structure and a pipeline weight cell addressing mechanism are adopted, which can remove the redundant weight cells and result in fast address mapping. Fig. 2 illustrates the detail operation flow of a standard CMAC model. Conventionally, the scalar output can be obtained through a sequence of operations. The corresponding operations are described in the following paragraphs.
II. WEIGHT ADDRESS MAPPING OF CMAC MODEL

A. Input Quantization
To reduce the complexity of problem, the input space is usually quantized into some discrete states in a CMAC model, and the resolution of the th input dimension equals , where is the maximum quantization value over the th input dimension space. Hence, the primary input vector is quantized to a quantization vector , where represents the quantization value to which belongs. The resolution determines the precision of the CMAC model. There is a tradeoff among the accuracy of CMAC output, the required weight memory size and the learning efforts. Finer resolution results in better accuracy but requires larger memory size. Table I presents the mapping relation between each input quantization level and the corresponding address segment vector . When the value of is shifted from two to three, all address segments are kept at the same values except that is changed from two to six. This means that there are (three in this case) overlapped address segments for two adjacent input quantization intervals on a certain input dimension space. Besides, every address segment always occupies the same position when it appears in different address segment vectors.
B. Address Segment Mapping
C. Address Segment Concatenation
This stage performs a concatenation operation to construct the active association vector which is used to locate the active association cells. An active association cell can be located by concatenating a set of address segments where each address segment vector of different input dimensions contributes one address segment. Moreover, all the address segments being concatenated as a logical weight address are at the same position in their respective address segment vector. Such a concatenation, in fact, denotes the main-diagonal addressing scheme of weight cells, which selects the permutation of address segments in the same mapping order of each input dimension. Of course, there are other addressing schemes that might be adopted to retrieve a larger number of weights to produce output [12] , [13] .
D. Physical Weight Address Mapping
Each association cell represents the logical address of a weight associated with the computation of output values. This stage maps the association cells into the physical addresses of activated weight cells. The mapping depends on the real structure of weight memory module. The simplest mapping strategy is to concatenate the address segments directly in binary bit-vector format. For example, given , we can easily get the values of and through the mappings in Table I However, such a physical address mapping scheme will result in very sparse distribution of weight cells over the entire weight space. Therefore, the utilization ratio of weight memory would be very small.
E. Output Function
The output function just accumulates the weights addressed by the corresponding active weight addresses (2) III. DIRECT WEIGHT ADDRESS MAPPING In this paper, the whole weight memory for one output dimension is divided into logical memory banks. Basically, with this memory bank structure, all weights related to the computation for an output value can be retrieved in parallel. Moreover, we propose a direct weight addressing approach which integrates the three mapping stages (address segment generation, address segment concatenation, and physical weight address mapping) into a single mapping stage, so that we can speed up the computation of weight cell addresses and remove redundant weight cells. To describe the direct mapping scheme, we first generates the weight addresses 's by assuming the weight memory is still a continuous onedimensional array. The corresponding evaluation formula is given in the following:
where , and denotes the number of input dimensions. Fig. 3 gives an example to illustrate the effect of the direct address mapping scheme. The first column shows the value of input quantization levels. The other columns represent the corresponding generated weight addresses on weight memory banks, respectively. In each memory bank, the column shows the concatenated address segments which will be used to locate a specific weight cell for any input quantization vector. The column records the weight cell addresses derived from (3). As shown in Fig. 3 , it is obvious that there are many unused memory locations, for example, the locations of , and . Based on the memory bank structure, (4) can be used to evaluate the necessary size of each memory bank after the unused weight cells are removed (4) where is the bank index, and is the operator of mathematical ceiling function.
By applying (4), we can analyze the utilization ratio of the one-dimensional weight memory. The whole required address space equals , Also, the utilization ratio of (5) can be evaluated as follows:
By means of Theorem 1, we can easily estimate the upper bound of weight memory utilization ratio. Table II illustrates the actual and estimated utilization ratios of weight memory with different input quantization schemes. In this table, we assume that the resolution of each input dimension is the same and . Obviously, an efficient weight address generation strategy is needed to overcome the severe waste problem of weight memory. Such a strategy could be based on the idea: separating weight memory into independent memory banks and assigning those really used weight cells with continuous addresses in each bank. In the following, we will explain the approach to reach the goal.
Equation (3) is the formula that maps an association cell to its corresponding weight address. Since the 's are the one-dimensional memory addresses of weight cells, we have to further transform each into the corresponding address, , in the independent address space of the memory bank storing the desired weight cell. Equation (6) gives the mapping function for evaluating the physical weight cell addresses from an input quantization vector (6) By (6) , all the weight cells are distributed in individual memory banks. The weight cells are continuously allocated in each memory bank. Those unused weight spaces have been removed so that the total weight memory size can be maintained as small as possible for physical hardware implementation.
Through Fig. 4 , we can explain how the above equation can be derived. In Fig. 4 , each circle represents a specific weight cell where meaningful information is stored. For a given quantized input pattern , a unique set of weight cells which are enclosed within a hypercube will be addressed to compute a corresponding output value. The location of the hypercube is determined according to the following basic rules: first, the hypercube originates at the grid where the quantized input pattern locates; second, the width of the hypercube along the direction of each input dimension is equal to quantization intervals. Since the logical address of an active weight cell is obtained by concatenating the corresponding address segments in each , all the active weight cells are distributed on the main diagonal line of the hypercube which originates from the grids , where ( mod )
. Therefore, such an addressing scheme is also called the main diagonal addressing scheme. As illustrated in Fig. 4 , the main diagonal addressing scheme results in a sparse distribution of actually used weight cells over the entire weight memory space. In other words, only a small subset of memory cells are really used to store the weights that will contribute to the CMAC operation.
The proposed direct weight cell addressing formula can remove those redundant memory cells (the grids without circles on them) so that both the memory size and hardware cost can be reduced. The reason is described as follows. First, the weight cells on the same row in memory bank are shifted girds to the left-hand side, so that each shifted weight cell is aligned to the same column of the nearest weight cell in Bank 0. Second, the weight space is squeezed into a dense one so that only those circled weight cells exist. As a consequence, the weight memory space can be implemented in memory banks, where each memory bank is composed of a sequence of blocks. The block size of each memory bank is equal to the number of weights on the same row, , as illustrated in Fig. 6 . The number of blocks in the th bank is equal to , for . All the weights in any memory bank are continuously allocated and can be easily and efficiently addressed. For example, given the input pattern located in the quantized region (2, 4) , the corresponding hypercube is marked with a rectangular widow. There are four active weight cells in this hypercube. By employing the direct weight addressing formula, we can identify the addresses of associated weights as and IV. ARCHITECTURE DESIGN FOR DIRECT WEIGHT ADDRESS MAPPING For a digital, off-chip learning CMAC implementation, the key design effort lies in realizing the direct weight cell address mapping formula [(6)], which generates the address of associated weight for each weight memory bank. Since the resolution of each input dimension is a constant value for a specific implementation, the values can be precalculated and stored in a two-dimensional matrix to reduce the hardware computation complexity. The alternative form of (6) is shown below (7) where . Since , and are all constants when evaluating the weight address to retrieve weight from a specific weight memory bank, the data flow for evaluating (7) can be further refined to optimize the CMAC hardware architecture. The expression for calculating is rearranged into another equivalent recurrence form for (8) and
The new expressions can be treated as performing a series of cascaded operations. Fig. 5 illustrates the data flow for evaluating (8) . Such a computation flow reduces the hardware design complexity and cost. In addition, the procedures for optimizing the operation and the operation are presented in the following sections.
A. Optimizing the Operation
As shown in Fig. 5 , the subtract-divide-unconditional-round (SDUR) operation needs one subtraction, one division, and one unconditional round operation to accomplish the expected function. From the viewpoints of time and area, it is too expensive to directly map the corresponding operations into physical hardware components. Therefore, we establish some design rules to efficiently implement the SDUR operation with digital hardware. is. According to the above two corollaries, the SDUR operation can be carried out by one integer division and one increment operators. Since the divisor is a constant value, if , where is a positive integer, the integer division operation can be implemented without any physical hardware components.
is equal to the higher bits of , and is equal to the lower bits of . Hence, for a given which is a power of two, only one increment operator affords the computation of SDUR operation.
B. Unifying and Optimizing the Operation
As shown in Fig. 5 , all the multiplication operations can be represented as a generalized notation , where is a variable value and is a constant value for a specific pair of and . If the constant operands , for all 's and 's, equal a certain common value, , it would be preferred to implement a specific and combinational multiplier. Through the analysis for the architectural parameters of CMAC as described in the following paragraphs, we find that it would be possible and reasonable for having all the constant operands equal either a single value or two values and . If the maximum quantization values of all input dimensions equal the same constant , i.e., , (8) can be further simplified to the following form: for (9) and where , In comparison with (8), the constants for all 's are equal to the same value . As a consequence, all the multiplication operations have a common operand while evaluating bank 's weight cell address.
As defined in (9), is equal to . Since the parameters and are constants which affect the performance . Hence, the operators for evaluating different 's will be different if the implementation of the multiplication is simplified with respect to the specific operand , and they can not be shared each other when allocating resources. If the computations of are designed under the resource sharing paradigm, for example, the pipeline architecture, the and operators have to be unified into a sharable operator. This problem can be solved by reformulating and into and . The unified operator, which can be shared while evaluating the 's, has one more adder than those original operators. However, such a modification will not increase any timing delay because the critical path (passing through one multiplication operator and one addition operator) of the unified operator is the same as that of the original operators.
C. Alternative Architectures for Implementing Weight Address Generator
As shown in Fig. 6 , the optimized DFG for evaluating is simpler. Only SDUR operators and operators are needed if the elements of the input quantization vector are available at the same time. If all 's are to be evaluated in parallel, totally SDUR operators and operators are needed to generate weight addresses for retrieving the activated weights. This architecture, shown in Fig. 7 , is referred as the parallel-in/parallel-out (PI-PO) architecture.
Such an architecture can be implemented with a fully combinational circuit. When the input vector is applied to the circuit, all weight addresses will be available when the signal propagation time of the critical path has elapsed. Then weights can be simultaneously retrieved, where denotes the number of output dimensions, in one memory read cycle. Finally, the retrieved weights are summed up to calculate the output value. The PI-PO architecture exhibits the features of design simplicity and good performance. However, it has some drawbacks: 1) all operators are invoked exhaustively; 2) the circuit size is rather large; and 3) the number of input-output (I/O) pins is so large that the cost of integrated circuit (IC) packaging is excessively high. Since most of I/O pins are used to access the weights, such an architecture is inappropriate for accessing off-chip weight memory.
In the PI-PO architecture, the critical path is composed of one quantization operator QUANT, one SDUR operator, The key task in constructing the pipeline PI-PO architecture lies in how to get an even partition for the data flow graph. An even partition means that the pipeline architecture will have shorter latency and higher throughput. The latency depends on not only the clock period (usually limited by the execution time of the slowest stage) but also the number of stages. The throughput depends on both the clock period and stage utilization. In our design, the pipeline stages are designed to be fully utilized, so the key factor that influences the performance of pipeline PI-PO would be how to partition the data flow graph (especially the critical path) of weight address generation to get a shortest clock period but do not sacrifice the latency.
Since partitioning result of critical path will dominate the final performance of the pipeline architecture, we adopt the execution time of the slowest operation as clock period of the desired pipeline. Then, we allocate as many operators to each stage as possible. By this partition strategy, throughput and latency would be balanced in our design. Now assume the weight memory access is the slowest operation among all the operators in the CMAC architecture. Then, the ratios of the execution time of the other operators with respect to the weight retrieving cycle time are labeled as follows:
: ratio of the execution time of QUANT operator with respect to the weight retrieving cycle; : ratio of the execution time of SDUR operator with respect to the weight retrieving cycle; : ratio of the execution time of operator with respect to the weight retrieving cycle; : ratio of the execution time of adder with respect to the weight retrieving cycle; By applying such an allocation strategy, at most cascaded operators and cascaded adders, which are in the critical path, can be allocated into one pipeline stage, respectively. Besides, we can also allocate the QUANT operator and the SDUR operators into the same pipeline stage if the value of is less than one. As a result, the number of pipeline stages can be determined by the following formula:
no. of pipeline stages For the PI-PO and pipeline PI-PO architectures, all the weights are retrieved in just one memory access cycle. If weight memory is realized with off-chip memories, the pin counts of the CMAC chip will be very large. Therefore, the PI-PO and pipeline PI-PO architectures become unsuitable when off-chip weight memory is adopted. To reduce the pin counts and packaging cost, the parallelism among weight access operations must be broken. All memory access operations has to be scheduled in a serial scheme, so that redundant pins can be removed. An alternative SI-APLPO architecture is developed to avoid such high packaging cost problems. In summary, SI-APLPO architecture has these characteristics: input values are fed in serial, but weight addresses are generated in a pipeline fashion and output values will be available simultaneously. The overall DFG of the SI-APLPO pipeline architecture is shown in Fig. 8 . The operations in the SI-APLPO CMAC architecture are partitioned into three overlapped phases: preprocessing phase, pipelining phase, and summing-up phase.
The preprocessing phase performs the quantization operation and the SDUR operation on all the input variables. In this phase, only one quantization operator QUANT and SDUR operators are necessary. The weight address generation part for one weight memory bank invokes one SDUR to manipulate the quantization values of all input variables. Every time the SDUR completes its computation, it issues the result immediately to the latch of destination operator. Therefore, although the weight address generation is executed in a pipeline scheme, we need only one SDUR operator.
The pipelining phase is constituted by an ( )-stage pipeline, which is used to compute the weight addresses. In this phase, each weight address generation part contains operators, where each stage consists of one operator. All address generation parts are arranged in a delayed pipeline scheme, as shown in Fig. 8 , so that the operators can be shared. In other words, the whole architecture needs operators totally. By the delayed pipeline scheme, all weight addresses 's are generated sequentially. As a consequence, successive weight access cycles are needed for retrieving weights from the weight memory. In each access cycle, only weights are retrieved.
All the output values are calculated in the summing-up phase by accumulating the retrieved weights. In this phase, pipelining is not adopted to produce the output values in parallel. Each output variable invokes one adder to accumulate the weights. Therefore, we need only adders. The execution time for the three phases are , , and stages, respectively. If they are operated in a nonoverlapped scheme, the performance will be lowered. For the SI-APLPO architecture, there are overlapped stages between the execution of preprocessing phase and pipelining phase, and overlapped stages between the execution of pipelining phase and summing-up phase. Therefore, the total execution time for the SI-APLPO architecture is equal to the execution time of , i.e., , stages. As a result, the latency time for the SI-APLPO architecture equals stages for generating one output vector . The throughput of SI-APLPO is bounded by the delayed pipeline mechanism because each generation of output vector requires successive weight retrieving cycles. If , the SI-APLPO architecture generates one output vector per clock cycles because the first pipeline stage has been idle before next input vector arrives. On the other hand, if , the SI-APLPO architecture generates one output vector per clock cycles because the first pipeline stage is still busy in performing the remaining tasks allocated to it. The next input vector can not be processed until the first pipeline stage is idle. Therefore, the throughput of SI-APLPO architecture is equal to , where is the cycle time for accessing the weight memory.
The comparisons about resource utilization for the proposed architectures are listed in Table III . The data path is composed of I/O Pin, QUANT, SDUR, , and ADDER operators. The control path is composed of multiplexers and latches. Since the parallelism of data flow in PI-PO architecture is higher than the others, the number of operators in its data path is the largest. On the other hand, the complexity of control flow is in an opposite situation. The number of multiplexers and latches in SI-APLPO architecture is the largest since it needs more control mechanisms to handle the operations on data path.
The performance analysis for these architectures are also evaluated and shown in Table IV . The row of OP Stages records the number of operation stages in alternative architectures. The row of Acc. Cycles records the number of weight retrieving cycles for each computation of output vector. The row of WT/Cycle records the number of weights retrieved in each weight retrieving cycle. No matter how the architecture is organized, the total number of weights to be retrieved from weight memory are always equal to . The latency time of PI-PO architecture is the smallest than the others. As described previously, the pipeline PI-PO architecture is designed under the goal to speed-up the performance while inhibit the increasing of latency time. Hence, the latency time of the pipeline PI-PO architecture is just slightly larger than that of PI-PO architecture. The throughput of pipeline PI-PO architecture is the best one among the three architectures. One output vector will be generated in each time interval. In summary, 1) the SI-APLPO architecture uses less operators and I/O pins to implement the CMAC chip; 2) the control path for SI-APLPO architecture is more complex than those of PI-PO and pipeline PI-PO architectures; 3) the latency time for SI-APLPO architecture is longer than that of PI-PO architecture; 4) the weight memory structure for SI-APLPO is simpler than that of PI-PO architecture; 5) the SI-APLPO architecture requires less I/O pins; and 6) the pipeline PI-PO architecture exhibits the best performance. However, which architecture is the best? It is dominated by the operating requirements and the design constraints of the CMAC chip. To implement the circuit of CMAC chip, we employ the high-level synthesis methodology to achieve the goal. Very high-speed hardware description language (VHDL) is used to establish the behavior model of CMAC chip [10] - [12] . The SI-APLPO CMAC computation architecture presented in previous section has been translated to VHDL descriptions. As a case study, a CMAC chip, with off-chip weight memory, for color calibration has been modeled and synthesized by Synopsys VHDL-based synthesis tool. The specifications of this chip are: the number of input and output dimensions and are three; the weight memory is logically split into four banks, i.e., , for each output dimension; the resolutions of all input dimensions are the same and equal 45, i.e., ; The valid values of input dimensions and weights are represented in eight-b binary data format, i.e., . For current prototype, the SI-APLPO architecture is implemented.
A. Color Calibration with CMAC Model
The color image reproduction system based on an embedded CMAC controller is shown in Fig. 9 . For a color image reproduction system, the original color image has to be processed by a series of complex transformations between the scanning and printing devices, so the reproduced image will not equal the original image at least in both resolution and colors. The goal of the CMAC-based color calibration is to recover the colors on the reproduced image to what they should be, so that the resemblance in colors for the original image and the calibrated image can be maintained.
B. Synthesis Results of the CMAC Chip
The CCL 0.8 m SPDM technology library, which is a standard cell library in CMOS technology, is used as the target library when using Synopsys HLDA tool to synthesize the CMAC's VHDL descriptions. The synthesis results are summarized as follows [12] : There are 741 primitive cells invoked in this prototype chip and the equivalent gate count is equal to 3074. The delay time of critical path is 18.4 ns and thus the maximum clock rate of this chip is about 50 MHz if the access time of weight memory is less than 20 ns. The prototype chip is fabricated in a 84 pads PLCC package. The dimensions of this prototype chip are 2.5 mm 2.8 mm for the core size and 4.6 mm 4.6 mm for the die size, respectively.
VI. CONCLUSIONS AND DISCUSSIONS
Hardware implementation of neural networks is still an important issue in researches. In this paper, some conclusions about the hardware implementation of CMAC on reliability, cost, and performance criteria can be achieved.
For the standard CMAC model, digital technique can be applied to implement the hardware. Therefore, the reliability would be nice due to the mature digital very large scale integration (VLSI) manufacturing technology.
In cost, the size of weight memory has been sharply reduced by the proposed direct address mapping approach. Besides, although digital controller synthesized by means of high-level synthesis technique usually results in higher gate count requirement, a novel pipeline SI-APLPO architecture and several optimization approaches are proposed, which have efficiently reduced the gate count to implement a CMAC controller.
For hardware performance, the pipeline architecture increases the throughput. Besides, the optimizations also result in a shorter critical path in circuit. The timings for parallel or serial retrieving weight cells have also been optimized.
