Abstract. Using adiabatic CMOS logic instead of the more traditional static CMOS logic can lower the power consumption of a hardware design. However, the characteristic differences between adiabatic and static logic, such as a fourphase clock, have a far reaching influence on the design itself. These influences are investigated in this paper by adapting a systolic array of CORDIC devices to be implemented adiabatically.
Functional Simulation of Adiabatic Logic
Adiabatic logic families such as Positive Adiabatic Logic (PFAL, Vetuli et al., 1996) use voltage ramps in order to charge/discharge the capacitances in an energy efficent way. In contrast, traditional static CMOS loads/unloads the capacitances with steep voltage slopes. In addition to two phases where the clock is high or low, respectively, adiabatic logic uses two more phases of the same duration where the clock transitions in a ramp (Fig. 1 , signals φ 1 and φ 2 ). See Fischer et al. (2003) for a more detailed description.
The optimum operating frequency of adiabatic logic tends to be lower than that of static CMOS, leading to the desire for highly parallel designs. We will present such a design for linear signal processing in the sequel.
Due to the way basic functional blocks such as not, and or larger entities like a half-adder are implemented, they are inherently synchronized with the power clock: they sample their inputs in the rising phase of their clock and their outputs are valid during the immediately following high phase.
When basic blocks are connected, they automatically form a pipeline. For example, when two blocks are connected serially, they form a pipeline that can consume one input every
Correspondence to: M. Vollmer (marius.vollmer@udo.edu) clock-cycle and produces one output every clock-cycle with a delay of two phases. The second block needs to sample its inputs while the outputs from the first block are valid. Thus, its clock needs to be in the rising phase while the clock of the first block is in its high phase. In general, an adiabatic circuit therefore needs to provide four synchronized global power clocks such that in every phase all of the four phase kinds (low, rising, high, falling) are available. Figure 1 shows this situation for two inverters. It also shows the dual rail encoding that adiabatic logic families use: for every input signal, one also needs to provide the logically inverted signal. For a logic one, a signal follows its associated power clock, for a logic zero, it stays at ground.
To verify the lower energy dissipation of adiabatic logic as compared to static logic, one needs to perform transient simulations on the transistor and wire level with, e.g., SPICE. When designing larger circuits, like the array of CORDICs in the sequel, it is advantageous to first concentrate only on the functional aspects. This allows a simulation to complete much faster and thus mistakes can be made and corrected more quickly.
We have therefore created a simple set of conventions for VHDL that allow the description of logic blocks that are implicitly synchronized with a four phase clock. Figure 2 shows the simplifications relative to Fig. 1 and Fig. 3 shows simulated wave forms. The dual rail encoding is not modeled, and there is only one global clock net. Signals are valid during two phases since VHDL events happen at phase transitions and the sampled signal must be stable at this point. Nevertheless, the phase relations of two blocks can be observed in simulated waveforms and misalignments can be detected. Building on this conventions, a parameterizable array of CORDIC cells has been developed that can be programmed to carry out a number of signal processing tasks. Figure 4 shows the VHDL code for a adiabatic inverter and Figure 5 shows how to instantiate the two inverters of Figure 2 . The phase generic of a component aligns it with one of the four phases of the global clock.
Building on this conventions, a parameterizable array of CORDIC cells has been developed that can be programmed to carry out a number of signal processing tasks. Figure 6 . The adiabatic architecture. Figure 5 . VHDL code for instantiating two inverters. 
Systolic Architecture
The presented architecture is an array of locally connected CORDIC devices that resembles the familiar triangular array for computing a QR decomposition (Haykin, 1996) . It is depicted in Fig. 6 together with its inputs and outputs. Such a device is able to find a transformation and apply it to matrices V 1 , V 2 , W 1 , W 2 of real numbers such that
In this equation, V 1 , which must be upper triangular, and V 2 represent the values in the internal registers of the architecture. The matrices W 1 and W 2 represent the input values. The transformation is chosen such that the W 2 matrix is annihilated. After applying , the matrices V 1 (again upper triangular) and V 2 represent the new values of the internal registers, and the matrix W 2 represents the outputs. Figure 6 . The adiabatic architecture. The CORDICs can be programmed at run-time to construct transformations with different additional properties, see the next section. A circular cell in Fig. 6 represents a vector CORDIC: it finds a elementary 2×2 transformation that annihilates a single element of W 1 . A square cell represents a rotation CORDIC that applies the elementary transformation found by the circular cells in its row.
The CORDIC cells have an internal feedback loop as depicted in Fig. 7 . In a way, these loops create the internal ORDIC devices that resembles the familiar triangular array icted in Figure 6 together with its inputs and outputs. it to matrices V 1 , V 2 , W 1 , W 2 of real numbers such that
represent the values in the internal registers of the architecThe transformation Θ is chosen such that the W 2 matrix is r triangular) and V ′ 2 represent the new values of the internal uct transformations Θ with different additional properties, a vector CORDIC: it finds a elementary 2 × 2 transformarepresents a rotation CORDIC that applies the elementary icted in Figure 7 . In a way, these loops create the internal e a real register to form the loop, but in an adiabatic design, registers.
Cs, the longer the be fed back to the fect. It shows sim-ORDIC. It can be all stages until the that a given stage or during the time ontrol information o itself and assume ors can be removed f CORDIC devices 0 confirms that the le determines how g nsformation matrix Θ in Equation (1) can be controlled at ray can be in. The modes and possible applications of them registers that store V 1 and V 2 . In a static design, there will be a real register to form the loop, but in an adiabatic design, the adders themselves are pipelined and behave inherently as registers.
The more micro-rotation stages there are in the CORDICs, the longer the pipeline inside each cell becomes. Waiting for the result to be fed back to the input creates large pipeline bubbles. Figure 9 depicts this effect. It shows simulated waveforms of the signals between the stages of one CORDIC. It can be seen that one must wait for a value to slowly ripple through all stages until the next meaningful computation can be started.
To get around this unfortunate situation, one can observe that a given stage in a given CORDIC cell can do the work of its right neighbor during the time it would otherwise sit idle. So instead of distributing the control information to the right, a circular cell can feed this information back to itself and assume the role of its neighbors in subsequent cycles. Those neighbors can be removed from the design. Figure 8 shows the result: a single column of CORDIC devices that can be in both the vector and rotation modes. Figure 10 confirms that the bubbles have disappeared. The length of the original bubble determines how many columns can be collapsed into one.
the role of its neighbors in subsequent cycles. Those neighbors can be rem from the design. Figure 8 shows the result: a single column of CORDIC de that can be in both the vector and rotation modes. Figure 10 confirms th bubbles have disappeared. The length of the original bubble determines many columns can be collapsed into one. 
Linear Transformations for Signal Processing
As mentioned in the previous section, properties of the transformation m run-time such that a number of modes are created that the array can be in. are listed below. The CORDICs can be programmed at run-time to construct transformations Θ with different additional properties, see the next section. A circular cell in Figure 6 represents a vector CORDIC: it finds a elementary 2 × 2 transformation that annihilates a single element of W 1 . A square cell represents a rotation CORDIC that applies the elementary transformation found by the circular cells in its row.
The CORDIC cells have an internal feedback loop as depicted in Figure 7 . In a way, these loops create the internal registers that store V 1 and V 2 . In a static design, there will be a real register to form the loop, but in an adiabatic design, the adders themselves are pipelined and behave inherently as registers.
To get around this unfortunate situation, one can observe that a given stage in a given CORDIC cell can do the work of its right neighbor during the time it would otherwise sit idle. So instead of distributing the control information to the right, a circular cell can feed this information back to itself and assume the role of its neighbors in subsequent cycles. Those neighbors can be removed from the design. Figure 8 shows the result: a single column of CORDIC devices that can be in both the vector and rotation modes. Figure 10 confirms that the bubbles have disappeared. The length of the original bubble determines how many columns can be collapsed into one. 
As mentioned in the previous section, properties of the transformation matrix Θ in Equation (1) can be controlled at run-time such that a number of modes are created that the array can be in. The modes and possible applications of them are listed below. 
As mentioned in the previous section, properties of the transformation matrix in Eq. (1) can be controlled at run-time such that a number of modes are created that the array can be in. The modes and possible applications of them are listed below.
-In the orthogonal mode, is chosen to be orthogonal, such that the following relationships can be established:
Thus, this mode can be used to carry out a QR decomposition X=QR of an arbitrary matrix by letting V 1 =0 and W 1 =X. Then we will find V 1 =R. Additionally, we will find V 2 =Q H W 2 which allows us to solve the least squares problem min w Xw−y by setting W 2 =y and using a subsequent transformation in the linear mode (see below).
This mode can also be used to perform a QR updating step, which is most concisely expressed as starting from an upper triangular R 1 such that R H 1 R 1 =X H 1 X 1 and efficiently finding an upper triangular R 2 such that R H 2 R 2 =X H 1 X 1 +X H 2 X 2 . This can be achieved by letting V 1 =R 1 and W 1 =X 2 , leading to V 1 =R 2 . V 2 and W 2 can be used in the same manner as above to update the right hand side such that the solution to a least-squares problem can be updated.
Note that the QR updating step can be used repeatedly to compute an upper triangular R such that R H R= i X H i X i without having to access the internal registers, apart from the initialization V 1 =0. In fact, the device effectively computes the QR decomposition of X by successively updating the solution one row at a time until all rows of X have been accounted for.
Channel estimation, channel equalization, data detectors and adaptive filters can use the methods mentioned above, for example.
-In the linear mode, has the form
leading to the computation of the Schur Complement
This can be used to compute matrix-matrix multiplications, matrix inverses, and solutions to systems of linear equations, and combinations thereof. For example, the first step of solving the least squares problem Xw−y has left the device in the state V 1 =R and V 2 =Q H y (see above). The solution can be completed with a transformation in the linear mode by letting W 1 =−I and W 2 =0. Then W 2 =R −1 Q H y=(X H X) −1 X H y. By letting X be square, this method can clearly be used to solve arbitrary systems of linear equations with an arbitrary number of right hand sides.
To compute the arbitrary matrix-matrix product AB, one can start with V 1 =0, V 2 =0 and input W 1 =I, W 2 =B in a orthogonal run, leading to V 1 =I, V 2 =B. A subsequent linear run with W 1 =A, W 2 =0 will yield
This mode is useful for filters and other signal transformations such as an FFT, for example.
-In the hyperbolic mode, fulfills
This mode can be used to compute a QR downdating step, which can reverse a updating step. Similar to the updating step described above, an application of the device in hyperbolic mode will give us the upper triangu-
The hyperbolic mode can also be used to carry out one step of the Schur Algorithm and can thus be used to efficiently compute the QR decomposition of a matrix with a Toeplitz-derived structure (Kailath and Chun, 1994) . These matrices appear in time-invariant singleand multi-user systems (Vollmer et al., 1999 (Vollmer et al., , 2001 ).
-The set mode is provided to initialize the array. It performs the assignments
where W 1 ∈ R 1×n is a row-vector whose elements are put on the diagonal of V 1 .
Initializing the diagonal of V 1 is useful to compute the best linear estimator in white noise, for example, which is similar to a least-squares solution and is given by the formula
Setting V 1 =σ I, V 2 =0 via a run in the set mode, and then inputting W 1 =X, W 2 =y for a run in the orthogonal mode will lead to V 1 =R such that R H R=σ 2 I+X H X and V 2 =R −H X H y. This can be transformed into the desired solution with a final run in the linear mode: as previously, W 2 =−I and W 2 =0 will lead to W 2 =V −1
The copy mode finally can be used to retrieve V 2 in case it is needed, such as with the Schur algorithm or when the Q factor of a QR decomposition is needed explicitely. It sets
Example: Adaptive RLS Filter
A adaptive RLS filter alternates between estimating and equalizing the transmission channel. The filter that equalizes the channel is modeled as a FIR filter and the estimation is performed while a known training sequence y 1 is transmitted. The coefficients w of the equalization filter are chosen such that X 1 w−y 1 is minimized where X 1 is the convolution matrix of the signal received during the training period. The convolution matrix X of a sequence {. . . , x i−1 , x i , x i+1 , . . . } has a Toeplitz structure:
The equalization is then performed by computing y 2 =X 2 w where X 2 is the convolution matrix of the received signal during the payload period. The CORDIC device presented above is well suited to carry out this task. The estimation phase first sets V 1 =0, V 2 =0 and then carries out the QR decomposition of W 1 =X 1 and W 2 =y 1 as explained above, giving V 1 =R and V 2 =Q H y. The equalization phase runs a subsequent linear mode transformation with W 1 =−X 2 , W 2 =0, computing W 2 =X 2 R −1 Q H y 1 =X 2 w=y 2 .
The convolution matrices X 1 and X 2 are constructed implicitly by connecting the outputs of a delay-line to the inputs of the device. The device can be simply switched from the training mode to the filter mode by inputting zeros instead of the training sequence and switching the mode from orthogonal to linear.
In order to allow the filter to gradually forget the past, it is customary to change the scaling factor in of each CORDIC such that each elementary orthogonal rotation reduces the length of the involved vector by a factor of 0.97, say.
Conclusions
The well-known systolic QR array can be generalized to also be able to compute a wide variety of linear signal processing tasks. Implementing this generalized array with adiabatic logic offers opportunities for significant low-level optimizations that find uses for hardware resource that would otherwise sit idle. The array has been simulated with a bit-true and phase-true VHDL model by making use of a general VHDL package that allows the description of adiabatic logic on a functional level.
The result is a highly parallel, highly efficient data flow processor that can compute things like matrix/matrix products, matrix inverses, solutions to systems of linear equations, QR decompositions, least-squares solutions to overdetermined systems of equations, QR up-and down-dating steps, the core tasks of the Block-Schur algorithm, and Best Linear (Unbiased) Estimates.
