Constant multipliers are widely used in signal processing applications to implement the multiplication of signals by a constant coefficient. However, in some applications, this coefficient remains invariable only during an interval of time, and then, its value changes to adapt to new circumstances. In this article, we present a self-reconfigurable constant multiplier suitable for LUT-based FPGAs able to reload the constant in runtime. The pipelined architecture presented is easily scalable to any multiplicand and constant sizes, for unsigned and signed representations. It can be reprogrammed in 16 clock cycles, equivalent to less than 100 ns in current FPGAs. This value is significantly smaller than FPGA partial configuration times. The presented approach is more efficient in terms of area and speed when compared to generic multipliers, achieving up to 91% area reduction and up to 102% speed improvement for the case-study circuits tested. The power consumption of the proposed multipliers are in the range of those of slice-based multipliers provided by the vendor.
INTRODUCTION
FPGA devices offer high computational power and reconfiguration capabilities that allow for complete customization and runtime adaptation of the circuit functionality. Many applications require these features in order to adjust the design to particular parameters at any given time step. Reconfiguration is a thriving field, but the long reconfiguration times and storage or generation of the new bitstream are still important issues [Compton et al. 2002; Dandalis and Prasanna 2005; Kalra and Lysecky 2010] .
In signal processing, it is usual that one of the multiplier operands is a constant coefficient. Thus, it is possible to optimize its hardware structure to improve area This work was partially supported by projects P07-TIC-02630 (Junta of Andalucía), TIN2006-01078 (Ministry of Education and Science of Spain), and USP-BS PPC05/2010 (University CEU San Pablo and Banco Santander). Authors' addresses: J. Hormigo, Department of Computer Architecture, University of Málaga, Málaga, Spain; email: hormigo@ac.uma.es; G. Caffarena (corresponding author), Department of Information Technologies, University CEU San Pablo, Spain; email: gabriel.caffarenafernandez@ceu.es; J. P. Oliver, Facultad de Ingeniería, Universidad de la República, Uruguay; E. Boemo, Universidad Autónoma de Madrid, Spain. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. c 2013 ACM 1936 -7406/2013 .00 DOI:http://dx.doi.org/10. 1145/2490830 usage and computation speed with respect to conventional blocks [Gustafsson 2007; Xu et al. 2008] . However, in some applications, these constants may change in certain time steps, which prevents the use of standard constant multipliers [Bosí et al. 1999; Bouganis et al. 2009; Huang et al. 2008; Shoufan et al. 2010] . Some researchers [Chen and Chang 2009; Demirsoy et al. 2007; Turner and Woods 2004] have addressed this problem when the constant changes to several predefined values, as it does in FFT, DCT, filters, and many others. Other authors have proposed reconfigurable architectures for specific applications (i.e., FIR filters [Mahesh and Vinod 2010; Park et al. 2004] ) that enable the use of a priori unknown constants. In this article, we propose a new reconfigurable constant multiplier that combines the advantages of the previous works, covering a wider range of applications (e.g., adaptive filters, neural networks, channel equalization, gain control, cryptography, etc.).
There are many approaches that tackle constant multiplier design, but not all of them are suitable for runtime reconfiguration to any constant value. For instance, the shift and add method produces efficient designs for most of the cases, although the resulting architectures and their features (i.e., area, delay, etc.) strongly depend on the constant value [Gustafsson et al. 2006; Nguyen and Chattejee 2000] . Other ideas are based on storing the results of partial products in lookup tables (LUT), which are added to compose the final multiplication value [Chapman 1996; Meher 2010; Wirthlin 2004] . In Wirthlin [2004] , the application of such methods to FPGA designs are thoroughly studied. The main advantage of this method is that the architecture is fixed disregarding the value of the constants: only the values of the tables change. Thus, such multipliers are easier to design or to generate automatically. Therefore, they can be straightforwardly reconfigured if a different constant is required. Also, their areatime-power figure is known, and the same multipliers structure is utilized, independently of the constant value. In this article, we select this approach as the starting point to implement an on-the-fly reconfigurable constant multiplier.
Our proposed circuit enhances the regular LUT-based constant multiplier from Wirthlin [2004] , enabling real-time reconfiguration of the constant with no restrictions on the possible constant values. The latter point is achieved by dedicating a portion of the multiplier to compute the contents of the LUTs, given a particular constant value. Thus, the multiplier is capable of reconfiguring itself. The circuit is able to self-reconfigure without implementing any standard FPGA reconfiguration techniques. Therefore, it can perfectly replace conventional multipliers in those applications where one of its operands changes its value only at particular time steps, being constant during long intervals of time (i.e., n clock cycles or n data computations). Domains that are suited well for this situation are cryptography, gain control, channel equalization, etc.
In this article, we use middle-cost Spartan-3 or Virtex-4 families Xilinx FPGA as the technology framework for validating the proposed ideas. Nevertheless, most of the concepts can be easily extended to high-end current FPGAs (which have different LUT size) (see Section 3.3). The work is divided into the following parts: Section 2 explains the fundamental ideas about LUT-based constant-coefficient multipliers optimized for FPGA devices. Section 3 deals with the architecture of the proposed self-reconfigurable multiplier. In Section 4, the implementation results and the comparison to other approaches are shown. Finally, conclusions are drawn in Section 5.
LUT-BASED CONSTANT MULTIPLIERS
In this section, we present the mathematical modeling which supports the design of LUT-based constant multipliers. Additionally, the specific architecture for FPGAs proposed in Wirthlin [2004] is shown, since it has been chosen as a starting point to develop our self-reconfigurable multiplier. Tables   Unsigned  Signed  d 
Theory
First, let us assume that data is composed of natural numbers. All the concepts explained can be easily applied to fixed-point real numbers. The main idea is to decompose the nonconstant m-bit multiplicand (A) in digits of the same size (i.e., q bits, where
where each digit complies with 0 ≤ d i < 2 q . Therefore, the product of A by the constant K is expressed as
The values of the partial products of all possible digits by the constant (i.e., from K · 0 to K · (2 q − 1)) are stored in tables. The final results are obtained by means of adding the partial products K · d i , applying the corresponding shifting, as shown in Equation (2). The value of parameter q allows a trade-off between the amount of memory and the number of addition operations. Thus, as long as the value of q increases, the number of additions required is reduced, but more memory is needed. For FPGA implementations, q is usually chosen to match the number of inputs of the FPGA LUT.
The extension of this method to signed constants is direct. The partial products must be stored in two's complement (TC) format, and the addition must use sign extension. If the multiplicand is represented using TC format, then only the most significant partial product lookup table has to be changed, since all digits (d i ) are unsigned except for the most significant one (d m/q ). Thus, for this digit, the values of the partial product from −2 q−1 · K to (2 q−1 − 1) · K have to be stored in the table. The correct storing order would be from 0 · K to (2 q−1 − 1) · K and from −2 q−1 · K to −1 · K. Table I shows an example of the values stored for unsigned and TC digits, supposing q = 3.
Architecture
There are several adder architectures able to perform the partial product summation. In this article, we focus on a cascaded array of carry-propagated adders. It provides a good trade-off between area and speed when implemented in FPGAs, and it can be directly pipelined. For the sake of clarity, the technique is illustrated for generic fourinput LUTs (LUT4). The value q is selected to 3 instead of 4, since this allows for an important area reduction, as we will explain next. shows the content of the table for K = 5(0101). There are a total of m/3 stages, each with an (n + 3)-bit adder and a lookup table. Note that the three least significant bits of the accumulated sums Ac(i) are directly connected to the final result (due to the shifting), and the remaining bits are the inputs of the next stage, E(i + 1)). Therefore, the bit-width of the constant K determines the size of each stage E(i), whereas the size of the multiplicand determines the number of stages required.
An efficient FPGA implementation of the previous scheme which reduces the overall size of the multiplier by 33% is presented in Wirthlin [2004] . The author combines the table and the addition operation corresponding to one bit within a single logic cell. A LUT4 is used to implement a three-input table plus a half-addition, as shown in Figure 2 , for bit position j. The three-bit table outputs the bit j of the partial product corresponding to digit d i , which is added to the previous accumulated sum (Ac(i) j ). The operation is completed by means of another half-addition (corresponding to the carry signal cin j ) implemented by using the specific FPGA carry-logic resource. It generates the next accumulated sum Ac(i + 1) j and carry cout j . As an example, Figure 2 shows the values for the least significant bit (LSB) supposing K = 5(0101).
Hence, in order to efficiently implement the architecture showed in Figure 1 (b), each stage E(i) is composed of n + 3 of the circuits shown in Figure 2 working in parallel. All these circuits are interconnected through the carry chain and have the same input d i . Each LUT4 stores the bit j of the addition corresponding to all possible partial products and the previous accumulated sum. It must be stressed that all stages E(i) are identical. The described architecture is valid for any constant value. Changing this parameter only involves modifying the data stored in the tables.
RUNTIME SELF-RECONFIGURATION
In this section, we propose a modification of the previous architecture to enable runtime self-reconfiguration. Two basic tasks must be carried out. First, a local mechanism to change the LUT4s that contain the partial products tables is included. Thus, the long transactional time involved in conventional FPGA reconfiguration is avoided. Second, the values to be stored in the LUT4s must be computed on-the-fly.
The first task can take advantage of the fact that in Xilinx FPGAs, the slices allow K-LUTs also to be used as shift registers (SRL-2 K blocks). 1 For example, in Spartan-3 and Virtex-4, it is possible to reprogram the LUT4 by simply shifting the register 16 times. In order to add runtime reconfiguration capabilities to the constant multiplier just described, we use the SRL16E primitive instead of the LUT4 primitive. The SRL16E has some extra inputs: shift enable (CE), serial input (0), and a clock signal (clk). The LUT4 is loaded through the serial input by activating the shifting. After the new constant is entered, the shifting is disabled and the block acts as a constant multiplier. Self-reconfiguration can be achieved if a method for automatically obtaining the sequence of values for reconfiguration is developed. This is dealt with in the following sections. Tables   A[3 
Unsigned Multiplicand
As we explained in Section 2.2, LUT4 should be configured to implement the addition (i.e., the XOR) of the partial product d i · K and the accumulated addition Ac(i) (see Figure 2 ). Hence, since 0 ≤ d i < 8, the partial products from 0 · K to 7 · K have to be stored if Ac(i) = 0. Otherwise, these values are negated. The position in the LUT of all these values depends on how the inputs Ac(i) and d i are connected. The easiest sequence to be serially generated occurs if Ac(i) is connected to the least significant bit of the four inputs of the LUT4. This sequence corresponds to the partial products sorted in increasing order, negating those in even positions, as shown in Table II . The generation of the programming sequence previously described is easily achieved by the following recurrence.
(4) Figure 3 shows the circuit proposed for implementing Equations (3) and (4). Partial products (register P) are obtained serially by adding the constant (K) to the previous computed partial product, starting from the value 0, which corresponds to 0 · K. The next partial products are generated every two clock cycles until 7 · K is reached. The XOR gates allow for selectively negating the partial products at each clock cycle. When a constant K is introduced and new is activated, the programing sequence is generated and signal pe (program enable) is activated during the 16 cycles. As an example, Figure 3 shows the programming values ( pv) generated for K = 5(0101). In each clock cycle, one column is produced, starting from the left to the right. Each row corresponds to the programming sequence (pv j ) associated with one LUT4 according to its bit position (j). It can be seen that the highlighted row ( j = 0) coincides with the programing values in Figure 2 .
Figure 1 displays in gray how this element is connected to the constant multiplier in order to perform runtime self-reconfiguration. Now, each stage E(i) is constructed using n + 3 SRL16E primitives instead of the LUT4 of Section 2 (fixed constant case from Figure 2 ). As in the fixed case, these elements are arranged in parallel, with the carry signals connected from the least to the most significant one. The same d i is shared by all of them. Signal pe (program enable) is connected on each stage E(i) to all the SRL16E elements in order to allow the shift operation and, consequently, the reprogramming of the LUT4s. Bus pv also goes to all the stages, and each of its bits It must be noted that the SRL16E block shifts its bits from the least to the most significant ones. Thus, the programming sequence is generated in inverse order by the circuit just described. It is not difficult to design a circuit which generates the sequence correctly. It should start from 8 · K value (by taking the constant shifted three bits to the left) and subtract K every two cycles to obtain the partial product sequence in correct order. Figure 4 shows the architecture for producing the programming sequence in direct order. In the architecture shown in Figure 3 , the register P is initialized to zero by using a reset signal, whereas in the new direct architecture (Figure 4) , this register should be initialized to 8·K. Thus, the reset signal cannot be used, and it is necessary to add a multiplexer at the input of register P. Therefore, as we will show in Section 4, the circuit implemented using this approach is slower than the corresponding one to the inverse method. For this reason, we use the inverse architecture, negating all the bits of the multiplicand before indexing the LUT. Thus, both approaches are now equivalent.
Signed Multiplicand
As aforementioned in Section 2, only the table corresponding to the most significant digit of the multiplicand is affected, since in TC, all bits have a positive weight except for the most significant one (the sign bit), which has a negative weight. Thus, a different partial product sequence should be computed ranging from −4 · K to 3 · K and following the order described in Section 2.
In principle, this implies designing a different programming circuit for the last stage. As an alternative, we propose adding a correction step at the end of the signed multiplier to correct the result provided by the unsigned unit. Figure 5 shows the architecture used to build a signed multiplier based on the unsigned one. When the multiplicand is positive, the result from the unsigned unit does not require any correction. On the contrary, when the multiplicand is negative, its most significant bit is one, and then the value K · 2 m−1 already has been added to the final product of the unsigned multiplier. This value should have been subtracted instead of added, since in TC representation format, this bit has a negative weight. Hence, the quantity K · 2 m (i.e., two times K · 2 m−1 ) is subtracted from the output of the unsigned multiplier to obtain the correct result. Figure 5 depicts how the correction step is implemented by a final subtracter which is driven by the output of the unsigned multiplier (input A) and 2 m · K or zero (input B), depending on the sign bit of the multiplicand.
Extension to Different FPGA Families
The ideas explained so far can be applied to any FPGA containing configurable logic blocks that can be used as lookup tables as well as shift registers. For instance, we find that this capability is supported by Xilinx FPGA families, such as Spartan-3, Virtex-4, Spartan-6, Virtex-6, Virtex-7, etc. We would like to stress that our approach is not limited to LUT4-based FPGAs and that 6-LUT-based devices can also be used. For these devices, the design presented in Section 2.2 can be applied by selecting q as 5 instead of 3. In Xilinx, the LUT6 is implemented through two five-input LUTs (LUT5) combined by a multiplexer, and the shift register of M-slices is only associated to one of the LUT5s. Thus, if runtime reconfiguration is desired, q should be fixed to 4, and only one of the LUT5s would be used. In this case, the only difference with the architectures previously presented is that 32 cycles are required to complete the configuration of a new constant value. The runtime self-reconfiguration scheme could be applied also to other LUT-based FPGA devices, given that a mechanism for changing the content of the LUT at runtime is provided.
IMPLEMENTATION RESULTS AND COMPARISON

Area and Time Results
The advantages of the ideas proposed in this article are proved by implementing a parametric self-reconfigurable constant multiplier VHDL module. This block allows setting the size of both the constant and the multiplicand (N × M). Also, signed or unsigned operations can be selected. The degree of pipeline depth is controlled by the number of partial product levels contained within each stage (E). For instance, E = 1 implies a finest-grain pipeline, and E = m/3 leads to a combinational version. The module has been synthesized with Xilinx ISE 9.2 and Spartan3E-5 devices using a wide range of parameters values (N × M × E). The results have been condensed in several tables where least-relevant parameters have been omitted.
In Table III , we present the maximum frequency of operation for the pipeline multiplier depending on the constant size (N) and the pipeline depth (E). The width of the multiplicand is omitted since it only affects the total number of stages, not the frequency, except for a combinational design. For the unsigned case, only results for fully-pipelined designs are shown, since the others are very similar to those of signed multipliers. Table IV shows the occupied area (slices) for different sizes and signed/unsigned operands. We only show the extreme cases, that is, fully-pipelined and combinational, since the differences in area for other values of E are relatively small. The area values shown in Table IV cover the Wirthlin LUT-based multiplier [Wirthlin 2004 ] as well as the newly added blocks: the programmer and the correction stage for the signed case. The area increase with respect to the nonreconfigurable constant multiplier proposed by Wirthlin is shown in Table V (the value of M does not affect this quantity). Thus, the efficient architecture proposed by Wirthlin for performing constant multiplication has been modified to also support constant reconfiguration with a reasonable area increase.
Regarding delay, our proposal does not modify the critical path of the constant multiplier proposed by Wirthlin, since the hardware for reconfiguration is independent of the processing data path. The only time penalty is the one required for reprogramming the LUTs when it is necessary to change the constant. The configuration time is shown in Table VI for several sizes of the constant and the two programming schemes (i.e., direct or reverse, see Section 3). The direct scheme requires about 20% more time for reconfiguration than the reverse one. It can be seen that the reconfiguration process is extremely fast, in the order of 100 ns. As a last step, we compare the proposed design (OURS) to standard multipliers implemented using Xilinx Coregen v10. Multipliers based on both slices (SLICES) and MULT blocks (MULT) are considered. Note that one of the multiplier's inputs is registered in order to use the operator as a programmable constant multiplier.
Tables VII and VIII contain the relative increase of the maximum operation frequency (decrease for negative values) obtained using the architecture proposed in this article instead of each of the standard multipliers, for signed and unsigned operands, respectively. Except when comparing with MULT for signed operands, where there are some negative values, the use of our proposal always increases the maximum operation frequency. In general, this improvement is bigger when the size of the operand (M) increases. The speed improvements with respect to SLICES are very significant, ranging from 5.9% to 102.3% for signed, and from 5.2% to 102.7% for unsigned, with a mean of 31.8% and 39.6%, respectively. Compared to MULT, the embedded blocks are faster in approximately half of the signed cases and in less than a third of the unsigned ones. It must be noted that for values of N and M bigger than or equal to 18 bits, only 20% of MULT perform better than OURS for the signed case; as for the unsigned case, all of OURS are faster. The speedup ranges from −22.3% to 48.9% with a mean of 4.9% for the signed case, and from −16.7% to 60.3% with a mean of 12.9% for the unsigned case. This improvement, although smaller than that of SLICES, is very valuable, since it should be remembered that MULT uses specialized multipliers directly implemented on silicon, whereas OURS uses general logic resources. Also, it must be noted that MULT is beating our approach in terms of speed for small-size multipliers mainly.
In the second experiment, we have synthesized each type of multiplier unit with 49 combinations of N × M, different pipeline depths, and 26 frequencies ranging from 50 MHz to 300 MHz. An XC3S1600E-5 device is utilized as the technological framework. The comparison is made by selecting the number of pipeline stages which produces multipliers with minimum areas that comply with a certain operation frequency constraint. For the sake of fairness, the area comparison is made by using the metric that accounts for the maximum number of multiplier units that fit the device [Bouganis et al. 2009; Caffarena et al. 2009] , that is, the total number of resources available in the device divided by the number of resources used in the design, where the resources could be slices or dedicated multipliers.
Figures 6 and 7 display the maximum number of multiplier units versus frequency results for several combinations of parameters. When one architecture is not able to achieve the required frequency, a zero value is represented. For the signed case, we can see in Figure 6 that OURS clearly outperforms SLICES in area-time for the majority of cases, whereas it requires much fewer resources than MULT implementations. On the other hand, the unsigned case from Figure 7 shows a similar trend with slightly better results. The experimentation embraces a total of 1,274 f ck × N × M scenarios tested for each type of operand (signed and unsigned). For signed multipliers, there are 84 cases where OURS is the only feasible implementation for a given frequency and 56 cases where MULT is the only candidate. This never happens to SLICE implementations, since they are the slowest for all experiments. OURS has the smallest area for 647 cases, MULT for 56 cases, and SLICE for 245 cases corresponding to small-size multipliers. Considering unsigned multipliers, there are 148 cases where OURS is the only option and 18 cases where MULT is the only candidate. OURS has the smallest area for 784 cases, MULT for 18 cases, and SLICE for 184 cases corresponding to small-size multipliers. The overall area results for signed multipliers are condensed in Table IX . It display the average reduction in comparison to SLICE and MULT when our approach is used. For each combination of N ×M, the mean of the area reduction achieved for all frequencies tested is shown. To calculate this mean, only results corresponding to frequencies which are reached for both compared designs are used. The numbers yield that our approach outperforms the other implementations substantially for bitwidths equal to or bigger than 12 bits. For the signed case, the mean area reduction when compared to SLICE ranges from −41.12% to 34.11%, with a mean for all sizes of 3.92%. However, if only multipliers with bitwidths equal to or bigger than 12 are considered, it can be observed that OURS outperforms SLICE for most cases, and the overall area reduction rises to 14.9. For bitwidths from 18 bits, the mean area improvement is 21.6%. The area improvement with respect to MULT is more significant, and it ranges from 55.95% to 91.42%, with an overall mean of 76.32%. That means that the number of multipier units that could be implemented in one device goes from 2× to 11× when using OURS instead of MULT. These improvements are slightly greater for the unsigned case, as shown in Table X .
Power Consumption Results
In the previous section, the area-time properties of the proposed multiplier were presented, so in order to complete the characterization of the multiplier, in this section, power consumption is analyzed.
The proposed design (OURS) is compared to standard multipliers implemented using Xilinx Coregen. The comparison only includes slice-based multipliers (SLICES), since MULT blocks have a reduced power consumption. The power consumption is obtained though experimental measurements using a Digilent Spartan 3 Board with a Xilinx Spartan 3 XC3S200-FT256 FPGA device. This board is not specifically designed to perform power measurements; therefore, some modifications were made in order to measure the internal core power consumption; thus, IO power was not measured. The on-board 1.2-volts regulator was removed and substituted by a circuit that includes an external regulator and a serial shunt. A calibration procedure of the shunt resistor and the measurement probes was performed. The voltage across the shunt resistor was measured with a Tektronix TDS3052C oscilloscope and with a Fluke 45 multimeter, having a relative error in the measures of less than 1.5%. Figure 8 shows the block diagram of the experimental setup that is based on connecting the circuit under test (block 1) to an on-chip test vector generator (block 2). The circuit under test is a signed integer multiplier with registered inputs and outputs. The test vector generator is composed of a digital clock manager (DCM), a linear feedback shift register (LFSR) that generates pseudorandom data, and a parity function that limits the number of FPGA outputs to one. This experimental setup minimizes the use of inputs and outputs and their influence on the design [Oliver and Boemo 2011; Wilton et al. 2004] .
The board has an external oscillator of 50 MHz, but the DCM output was fixed to 100 MHz for all the experiments. The multiplier input is changed every clock cycle, and the constant is changed every 1,024 clock cycles. The proposed multipliers (OURS) are fully pipelined to meet the clock frequency, while the Coregen multipliers (SLICES) have five pipeline stages, as recommended by the Coregen tool.
The power consumption of the DCM, the LFSR, and the parity generator was measured separately and then subtracted from the total measured power. Table XI shows the power consumption of the circuits under test.
The results yield that the power consumption of the proposed multiplier is in the range of the multipliers provided by Xilinx.
CONCLUSIONS
The use of standard optimized constant multipliers is not suitable for applications where constants change. This situation forces the use of generic multipliers. This article presents as an alternative idea a constant multiplier able to reconfigure itself in runtime to change the constant value with no restriction. Thus, this design could substitute generic multipliers in such cases. The configuration time of the proposed architecture is shorter than the partial reconfiguration times required by FPGA devices. It does not use any storage for programming data, and its size is easily parameterizable. Compared to generic multipliers based on slices, it clearly outperforms those implementations in terms of area and speed and poses similar power consumption features. Compared to embedded blocks, our approach is faster in most cases (especially for unsigned operands), and it allows for implementing many more multiplier units in the same device.
We regard the application of the proposed reconfigurable constant multiplier to floating-point arithmetic as an interesting future research line. A floating-point multiplier is composed of a fixed-point multiplier with added hardware blocks to deal with the exponents of the multiplicands as well as with the normalization and rounding of the final results. Thus, the multiplier proposed in this article could replace the fixedpoint multiplier in the floating-point architecture, being only necessary to provide the value of the constant in a floating-point format where there is information of both mantissa and exponent. The mantissa value would be used as the constant value to program the fixed-point architecture proposed in this article, and the exponent would be stored to be added to the exponent of the variable input.
