Abstract-We present a method and organization of an arithmetic array processor for solving tridiagonal systems of linear equations. The method uses online arithmetic approach which allows parallel computation of the result digits of the solution vectors. The basic operators are digit-vector by digit multiplication and redundant addition which results in precisionindependent cycle time. The method takes about m carry-free cycles to obtain m digits of the solutions. Details of a processor array organization implementing the method and a comparison with a conventional approach are discussed.
INTRODUCTION
Tridiagonal (TD) systems are frequently used in numerical methods [2] , [7] . Examples include finite-difference approximations to partial differential equations and interpolation with splines. Besides using sequential algorithm such as the Thomas' algorithm [11] , many parallel TD system solvers have been developed for supercomputers of array and pipeline type, the main approaches being cyclic odd-even reduction and recursive doubling methods [8] , [1] , [9] , [10] . All these solvers were implemented in software. Various scheme were proposed to alleviate corresponding communication problems between processors and memories. This paper presents a direct hardware-oriented approach for solving TD systems intended for application-specific architectures or accelerator coprocessors. Of particular interest are reconfigurable architectures implemented with FPGAs and softcore processors allowing efficient implementation of the primitive arithmetic operations used by the proposed method.
I. THE PROPOSED APPROACH
We solve the linear diagonally-dominant system L L : A · y = b (1) iteratively using MSDF serial arithmetic [3] , [6] , known as the E-method. The coefficient matrix A of order N is
For convergence, the coefficient matrix has to be diagonally dominant as indicated above, i.e., j =i |a i,j | < 1 with more specific constraints given later in the paper.
The solution vector is y = (y 1 , . . . , y N ) and the right-hand side vector is
The algorithm for solving the system L is
where:
• The output digit selection function uses rounding of vk [j] , an estimate of vk [j] as argument, truncated to one fractional bit:
For radix 2 case, the selection function reduces to
• The result digit-vector at step j is
where digit dk j ∈ {−1, 0, 1} is the j-th digit of the k-th solution component
• CON V ERT performs on-the-fly conversion [6] of the redundant result digits produced serially into a conventional digit-parallel form. We now discuss the convergence conditions of the algorithm. For simplicity, we focus on radix-2 iterations. Adaptation to higher radices is straightforward. The iterations converge to the desired result if vector w[j] is bounded. Define constants ξ, α and Δ (the overlap between the selection intervals 0 ≤ Δ < 1) such that
for each row of the coefficient matrix A and
Since |d k(j−1) − wk[j − 1]| ≤ 1/2 due to the rounding rule of the selection function, from the residual recurrence we find
To achieve this bound, we must assure that a suitable choice of d kj in {−1, 0, 1} is possible. This implies that |wk[j]| should not be larger than 3/2, giving immediately the following condition
For Δ = 0 (no overlap, non-redundant residual), the bound on the sum of off-diagonal elements (absolute value) in a row is α = 1/4. For Δ = 1/4, α = 1/8. Since |w k [0]| must not be larger than 3/2, we get for the bound on the initial values
The E-method has the following features:
• Each solution component y k is evaluated on a separate MSDF multiply-add module.
• The module uses a digit-vector × digit multiplication and a redundant addition, i.e., carry-save or signed-digit form.
• The method eliminates the use of explicit division.
• For higher radix r, which has stricter convergence conditions, suitable prescaling is used.
A. Basic Module
The basic module implements the residual recurrence
(for i = 1) and the corresponding digit selection function
The module organization is shown in Figure 1 . It consists of a [4:2] adder, two multiple generators which are implemented as digit vector by digit multipliers, and four registers. To avoid carry-propagation in multiple generators, the output can be obtained in a redundant form (e.g., sumcarry vectors) and the residual adder becomes a [6:2] adder. The output digit is selected as mentioned above. There are also four registers storing two coefficients, and a sum-carry representation of the residual. In the case of radix 2, the multiple generators are reduced to multiplexers. Figure 2 depicts a general organization of a TD solver: for an N -the order system, N basic modules are used. These modules have two digit-serial inputs, connected to the left and right adjacent modules (except the first and the last module), one digit-serial output, and two parallel ports per module to load the coefficients. The output from each module is in a redundant form which, if needed, can be converted using onthe-fly converters into a conventional 2's complement parallel form. There is no extra delay of this conversion, i.e., it is performed in parallel with the main digit recurrence.
B. General Scheme for a TD System Solver
The approach allows easy extension of the order of the system being solved or/and of the precision. If modules are augmented by a 2k-register FIFO, the N module scheme can implement a solver for a TD system of order kN , or, alternately, handle precision k times the precision of the basic module. In either case, the recurrence cycle cycle is k times longer than the basic module cycle. Fig. 2 . General Scheme for a TD Solver.
parallel serial Initialization inputs not shown

II. EXAMPLE 1: ONLINE IMPLICIT METHOD FOR SOLVING PDES
As an example of the application, we present the use of the proposed approach in implementing the implicit method for solving partial differential equations, originally developed in [12] . The parabolic equation with appropriate initial and boundary conditions
can be solved by implicit methods in which derivatives with respect to time are approximated by backward differences. These methods are always computationally stable. However, such a method requires solving a sparse system of linear equations. We show how to use the E-method in solving this system. The finite difference equation for system (8) at the point (i, t) is
Arranging (9) leads to the following finite difference system for the E-method
where
Δt and 0 ≤ t ≤ N Δt. This system corresponds to the following tridiagonal system of linear equatuions suitable for the application of the Emethod:
where Φ b1 , Φ bn are the boundary conditions. For the convergence of the E-method it was established that p ≤ 0.2,which can be achieved by Δt ≤ kΔx 2 /3 [12] . The expressions on the right-hand side are computed using two left-to-right carry-free (LRCF) multipliers [4] and online adder [6] , combined into OMAU module with a total online delay δ RHS = 3+2 for r = 2. The TD solver uses a modified E-method in which the right-hand side elements are used in the MSDF manner [12] , [5] . Its online delay is δ E = 2. The total online delay for one time sweep is Δ = 7 (plus 1 cycle to output the result digit). For a fully unfolded implementation which has the lowest latency, we need m/8 × n OMAUs and E-method basic modules. Figure 3 illustrates the overall scheme. 
Fig. 3. Unfolded Implementation of Parabolic Equation Solver
The cycle time t cycle of the recurrence loop of the Emethod, measured in full-adder delays (t F A ), is estimated as
is larger than the OMAU cycle. The component delays are:
• t SEL = 1.5t F A is the delay of the selection function, • t MG = 0.5t F A is the delay of the multiple generator network, • t [4:2] = 1.5t F A is the delay of the redundant adder,
The overall delay is estimated as
The time of a sequential implementation of the Thomas' algorithm for solving tridiagonal systems [11] , assuming the same cycle time for all operations, is
Consequently, the proposed scheme has a speedup O(mn) with respect to the Thomas' sequential algorithm. Clearly, the cost of the proposed scheme is much higher than that of a sequential implementation. Details of the cost comparison will not be discussed here. 
III. EXAMPLE 2: SOLVING TRIDIAGONAL SYSTEMS WITH SPECIAL COEFFICIENT MATRICES
The residual recurrence (shown for i = 1)
is very simple and leads to a greatly simplified implementation of the basic module: a residual in a nonredundant (2's complement) form is sufficient; the adder for radix 2 k is k + 1 + 3 bits wide. For r = 2, the adder can be replace by an 8-input, 2-output combinational network.
The simplified module implementation for radix 2 case is shown in Figure 4 . We estimate the cycle time as
and the total time to solve the TD system as
The cost of a module is roughly
and the total cost of the solver for N = 6 shown in Figure 5 is
which is much less than a cost of a single m × m multiplier. The cost of on-the-fly converters is about 6 × mC F A . The TD solver for N = 6 is shown in Figure 5 .
IV. SUMMARY
We have investigated online algorithms for a highly parallel TD solver which consists of a linear array of basic modules. Each basic module implements an online multiply-add operator with a cycle time independent of precision due to the redundancy in representation of the residuals. is m cycles for m digits of precision, i.e., equivalent to the latency of a serial-parallel multiplier. The basic module has a simple implementation and it communicates using digitserial method. This is advantageous in physical realization. The proposed approach allows also simple mapping of larger TD systems or/and larger precision on a fixed array of given precision. The approach is suitable for variable precision as well as for compound algorithms. The work in progress include mapping of a TD solver to FPGA platforms with soft cores which allow custom instruction sets. Such a set of special instructions implementing primitive operations of the E-method would simplify the use of the proposed arithmetic processing approach.
