ARCHITECT: Arbitrary-precision Hardware with Digit Elision for Efficient
  Iterative Compute by Li, He et al.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1
ARCHITECT: Arbitrary-precision Hardware with
Digit Elision for Efficient Iterative Compute
He Li, Student Member, IEEE, James J. Davis, Member, IEEE, John Wickerson, Senior Member, IEEE,
and George A. Constantinides, Senior Member, IEEE
Abstract—Many algorithms feature an iterative loop that
converges to the result of interest. The numerical operations in
such algorithms are generally implemented using finite-precision
arithmetic, either fixed- or floating-point, most of which operate
least-significant digit first. This results in a fundamental problem:
if, after some time, the result has not converged, is this because
we have not run the algorithm for enough iterations or because
the arithmetic in some iterations was insufficiently precise? There
is no easy way to answer this question, so users will often over-
budget precision in the hope that the answer will always be to run
for a few more iterations. We propose a fundamentally new ap-
proach: with the appropriate arithmetic able to generate results
from most-significant digit first, we show that fixed compute-
area hardware can be used to calculate an arbitrary number of
algorithmic iterations to arbitrary precision, with both precision
and approximant index increasing in lockstep. Consequently,
datapaths constructed following our principles demonstrate ef-
ficiency over their traditional arithmetic equivalents where the
latter’s precisions are either under- or over-budgeted for the
computation of a result to a particular accuracy. Use of most-
significant digit-first arithmetic additionally allows us to declare
certain digits to be stable at runtime, avoiding their recalculation
in subsequent iterations and thereby increasing performance
and decreasing memory footprints. Versus arbitrary-precision
iterative solvers without the optimisations we detail herein, we
achieve up-to 16× performance speedups and 1.9× memory
savings for the evaluated benchmarks.
Index Terms—Arbitrary-precision arithmetic, hardware archi-
tecture, online arithmetic, field-programmable gate array.
I. INTRODUCTION
IN numerical analysis, an algorithm executing on the realnumbers, R, is often expressed as a conceptually infinite
iterative process that converges to a result. This is illustrated
in a general form by the equation
x(k+1) = f
(
x(k)
)
,
in which the computable real function f ∈ (RN → RN) is
repeatedly applied to an initial approximation x(0) ∈ RN .
The true result, x∗, is obtained as k approaches infinity, i.e.
x∗ = lim
k→∞
Π
(
x(k)
)
,
where the operator Π denotes projection of the variables
of interest since the result may be of lower dimensionality
than N . Examples of this template include classical iterative
methods such as Jacobi and successive over-relaxation, as
The authors are with the Department of Electrical and Elec-
tronic Engineering, Imperial College London, London, SW7 2AZ,
United Kingdom. E-mail: {h.li16, james.davis, j.wickerson,
g.constantinides}@imperial.ac.uk.
Algorithm 1 Generic finite-precision iterative algorithm.
Require: xˆ(0) ∈ FPNP , fˆ ∈
(
FPNP → FPNP
)
1: for k = 0 to K − 1 do
2: xˆ(k+1) ← fˆ
(
xˆ(k)
)
3: end for
Assert:
∥∥∥Π(xˆ(K))− x∗∥∥∥ < η
well as others including gradient descent methods, the key
algorithms in deep learning [1].
In practice, these calculations are often implemented using
finite-precision approximations such as that shown in Algo-
rithm 1, wherein FPP denotes some finite-precision datatype,
P is a measure of its precision (usually word length), fˆ is
a finite-precision approximation of f and η is an accuracy
bound. The problem with this implementation lies in the
coupling of P and iteration limit K. Generally, this algorithm
will not be able to ensure that its assertion passes, and when
it fails we are left with no knowledge as to whether K should
be increased or if all computations need to be thrown away
and the algorithm restarted with a higher P instead.
As a simple demonstration of this problem, suppose we wish
to compute the toy iteration
x(k+1) = 1/4− 1/6 · x(k)
starting from zero.
When performing this arithmetic using a standard approach
in either software or hardware, we must choose a single,
fixed precision for our calculations before beginning to iterate.
Fig. 1a shows the order in which digits are calculated when
the precision is fixed to six decimal places: approximant-by-
approximant, least-significant digit (LSD) first. Choosing the
right precision a priori is difficult, particularly with respect to
hardware implementation. If it is too high, the circuit may be
unnecessarily slow and power-hungry, while, if it is too low,
the criterion for convergence may never be reached.
Our proposal, illustrated in Fig. 1b, avoids the need to
answer the aforementioned question entirely. The digits are
calculated in a zig-zag pattern, sweeping through approximants
and decimal places simultaneously. The longer we compute,
the more accurate our result will be; the computation can
terminate whenever the result is accurate enough. This avoids
the need to fix the precision beforehand, but requires the
ability to calculate from most-significant digit (MSD) first:
a facility provided through the use of online arithmetic [2].
ar
X
iv
:1
91
0.
00
27
1v
1 
 [c
s.A
R]
  1
 O
ct 
20
19
2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
x(1):
x(2):
x(3):
x(4):
x(5):
x(6):
x(7):
x(8):
.
.
.
.
.
.
.
.
0 2 5 0 0 0 0
0 2 0 8 3 3 3
0 2 1 5 2 7 7
0 2 1 4 1 2 0
0 2 1 4 3 1 3
0 2 1 4 2 8 1
0 2 1 4 2 8 6
0 2 1 4 2 8 5
(a) Conventional arithmetic, approxi-
mants calculated LSD first.
.
.
.
.
.
.
.
.
0 2 5 0 0 0 0 0 0
0 2 0 8 3 3 3 3 3
0 2 1 5 2 7 7 7 7
0 2 1 4 1 2 0 3 7
0 2 1 4 3 1 3 2 7
0 2 1 4 2 8 1 1 2
0 2 1 4 2 8 6 4 7
0 2 1 4 2 8 5 5 8
(b) Our proposal, MSD first.
.
.
.
.
.
.
.
.
0 2 5 0 0 0 0 0 0
0 2 0 8 3 3 3 3 3
0 2 1 5 2 7 7 7 7
0 2 1 4 1 2 0 3 7
′′ 2 1 4 3 1 3 2 7
′′ ′′ 1 4 2 8 1 1 2
′′ ′′ 1 4 2 8 6 4 7
′′ ′′ ′′ 4 2 8 5 5 8
(c) Our proposal, MSD first with don’t-change digit
elision. ′′-marks indicate don’t-change digits, with
the solid line representing the region’s boundary.
Fig. 1. Digit-calculating strategies for the solution of x(k+1) = 1/4− 1/6 · x(k) starting from x(0) = 0. Arrows show the order of digit generation.
While general-purpose processors featuring traditional, LSD-
first arithmetic units exhibit inefficiency for the realisation
of online arithmetic, field-programmable gate arrays (FPGAs)
represent excellent platforms for the implementation of such
MSD-first operations [3]–[5].
As originally formulated, this method is somewhat ineffi-
cient since the triangular shape traced out results in the com-
putation of more digits than is actually needed. In the bottom-
left corner lie high-significance digits of later approximants;
these generally become stable over time, so we call them
don’t-change digits. By detecting the presence and avoiding
the recomputation of these digits, we arrive at a digit pattern
such as that shown in Fig. 1c. This increases efficiency while
having no bearing on the chosen iterative method’s ability to
reach a result of any accuracy.
The proposed architecture, coined ARCHITECT (for
Arbitrary-precision Constant-hardware Iterative Compute), is
the first to allow the runtime adaption of both precision
and iteration count for iterative algorithms implemented in
hardware. We make the following novel contributions:
• The first fixed-compute-resource hardware for iterative
calculation capable of producing arbitrary-precision re-
sults after arbitrary numbers of iterations.
• An optimised mechanism for digit-vector storage based
on a Cantor pairing function to facilitate simultaneously
increasing precision and iteration count.
• Theoretical analysis of MSD stability within any online
arithmetic-implemented iterative method, enabling the
runtime elision of don’t-change digits to obtain perfor-
mance speedups and increase memory efficiency.
• Exemplary hardware implementations of our proposals
for the computation of both linear (Jacobi method) and
nonlinear (Newton) iterations.
• Qualitative and quantitative performance and scalability
comparisons against traditional and state-of-the-art online
arithmetic FPGA implementations.
An earlier version of this work appeared in the proceedings
of the 16th International Conference on Field-programmable
Technology (FPT) [5]. This article combines that paper’s
material with the don’t-change digit proposal taken from our
24th IEEE Symposium on Computer Arithmetic (ARITH)
publication [6], extending both. In particular:
• We add an arbitrary-precision divider to our available
operators, enabling the construction of datapaths for the
calculation of irrational results with Newton’s method.
• Changes to our digit elision technique, originally de-
signed for linear-convergence algorithms, are made to suit
the Newton method’s quadratic convergence.
• To complement the new digit elision strategy, we pro-
pose an enhanced memory-addressing scheme, leading
to greater performance and higher achievable result ac-
curacy for a given memory budget.
• Finally, we exploit digit-parallel online addition to de-
crease datapath latency.
These optimisations allow us to obtain significant increases in
throughput and memory efficiency over previous designs.
The implementations presented and evaluated in this article
are fixed-point. ARCHITECT’s principles are, however, generic,
and could be employed for the construction of floating-point
operators supporting arbitrary-precision mantissas.
II. BACKGROUND
In scientific computing, machine learning, optimisation and
many other numerical application areas, methods of iterative
calculation are particularly popular and interest in their ac-
celeration with FPGAs is growing [7]. Recent studies have
demonstrated that FPGAs are promising platforms for the
acceleration of the Jacobi [8], Newton’s [9], conjugate gra-
dient [10] and MINRES methods [11]. However, implemen-
tations relying on traditional arithmetic—whether digit-serial
or -parallel—enforce compile-time determination of precision;
for digit-parallel designs this affects their area and input/output
bandwidth requirements, while for digit-serial it is one of
the factors affecting algorithm runtime. Runtime tuning of
precision in iterative calculations was enabled through the use
of online arithmetic in recent work [4], however unrolling
was necessary in order to implement the algorithm’s loop;
area therefore scaled with the desired number of iterations.
As shown in Table I, ARCHITECT stands apart from these
alternatives by enabling the runtime selection of both factors
affecting result accuracy while keeping compute area constant.
LI et al.: ARCHITECT: ARBITRARY-PRECISION HARDWARE WITH DIGIT ELISION FOR EFFICIENT ITERATIVE COMPUTE 3
TABLE I
COMPARISON OF ITERATIVE ARITHMETIC PARADIGMS.
Area scales with Runtime scales with
Prec. Iter. limit Prec. Iter. limit
LSD-first, parallel 4 8 8 4 unbounded
LSD-first, serial 8 8 4 bounded 4 unbounded
Zhao et al. [4] 8 4 4 unbounded 8
ARCHITECT 8 8 4 unbounded 4 unbounded
A. Arbitrary-precision Computing
Applications requiring very high precisions have become
increasingly popular in recent years [12]. For example, today,
hundreds of digits of precision are required in atomic system
simulations and electromagnetic scattering theory calculations,
while Ising integrals and elliptic function evaluation need
thousands of digits [13]. In experimental mathematics, Pois-
son equation computations frequently require results to tens
or hundreds of thousands of digits of precision [14]. Stan-
dard numeric datatypes, such as double- or even quadruple-
precision floating-point, are therefore no longer sufficient in
an increasing number of scenarios.
Many software libraries have been developed for arbitrary-
precision arithmetic [15]–[17]. The de facto standard is MPFR,
which guarantees correct rounding to any requested number
of bits, selected before each operation is executed. Interest
in the hardware acceleration of high-precision operations, in
particular those within iterative algorithms, is growing [7].
FPGAs provide flexibility not available on other platforms,
allowing for the implementation of bespoke designs with
many precision and performance tradeoffs. Libraries including
FloPoCo [18] and VFLOAT [19], alongside proprietary vendor
tools, facilitate the creation of custom-precision arithmetic
units. These provide designers with many options to suit
particular frequency, latency and resource requirements. Sun et
al. proposed an FPGA-based mixed-precision linear solver: as
many operations as possible are performed in low precision be-
fore switching to a slower, higher-precision mode for the later
iterations [8]. A dual-mode (double- and quadruple-precision)
architecture based on Taylor series expansion has also been im-
plemented [20]. Zhao et al.’s work enables arbitrary-precision
computation but, as mentioned earlier, necessitates compile-
time determination of iteration count [4].
With the exception of Zhao et al.’s architecture [4],
each of the aforementioned proposals requires precision—or
precisions—to be determined a priori. In many cases, this is
not a trivial task; making the wrong choice often means having
to throw the calculations already done away and starting from
scratch with higher precision, wasting both time and energy
in doing so. In our work, we are particularly interested in
hardware architectures which allow precision to be increased
over time without having to restart computation or modify the
circuitry. Table II presents a side-by-side comparison of these
techniques and their features with ARCHITECT, the only entry
supporting the determination of result precision and iteration
count after each calculation has commenced.
TABLE II
COMPARISON OF ARBITRARY-PRECISION TECHNIQUES.
Level Prec. setper calc.
Iter. limit set
per calc.
MPFR [16] Software Before During
FloPoCo [18], etc. Hardware Before During
Mixed precision [8], [20] Hardware Before During
Zhao et al. [4] Hardware During Before
ARCHITECT Hardware During During
B. Online Arithmetic
Achieving arbitrary-precision computation with fixed hard-
ware requires MSD-first input consumption and output gen-
eration. A suitable proposal for this, widely discussed in the
literature, is online arithmetic [2]. By employing redundancy
in their number representation, allowing less-significant digits
to correct errors introduced in those of higher significance,
all online operators are able to function in MSD-first fashion.
Online operators are classically serial, however efficient digit-
parallel (unrolled) implementations targetting FPGAs have
been developed as well [3]. We make use of both digit-serial
and -parallel online operators in this work, employing the
de facto standard radix-2 signed-digit number representation,
wherein the ith digit of a number x, xi, lies in {−1, 0, 1}.
In hardware, each xi corresponds to a pair of bits, x+i and
x−i , selected such that xi = x
+
i − x−i . Data can be efficiently
converted between non-redundant and redundant forms using
well known on-the-fly conversion techniques [2].
Of particular significance to the material presented in this
article is the concept of online delay. When performing an
online operation, the digits of its result are generated at the
same rate as its input digits are consumed, but the result is
delayed by a fixed number of digits, denoted δ. That is, the
first (i.e. most-significant) q digits of an operator’s result are
wholly determined by the first q + δ digits within each of
its operands [2]. The value of δ is operator-specific, but is
typically a small integer. When chaining operators to form
a datapath, as we do, the total online delay is the highest
cumulative delay through the complete circuit [4].
1) Addition: A classical online adder makes use of full
adders and registers to add digits of inputs x and y presented
serially as xin and yin, as shown in Fig. 2 (left), from most to
least significant [2]. Digits of z start to appear at serial output
zout after two clock cycles, hence δ+ = 2. Duplication of
such a serial adder P times and removal of its registers leads
to the creation of a P -digit parallel online adder devoid of
online delay, as shown in Fig. 2 (right). Crucially, while carry
digits are presented at the least-significant end of the adder and
generated at the most, there is no carry chain; independent of
its word length, the critical path lies across two full adders.
This demonstrates the adder’s suitability for the construction of
more complex online operators and indicates that its maximum
frequency is independent of precision.
2) Multiplication: Algorithm 2 illustrates classical radix-
2 online multiplication [2]: a process that operates in serial-
4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
x+in x
−
in y
+
in y
−
in
z+out z
−
out
Full adder
s c
Full adder
s c
> >
>
x+0 x
−
0 y
+
0 y
−
0 x
+
1 x
−
1 y
+
1 y
−
1
. . . x+P−1x
−
P−1y
+
P−1 y
−
P−1 c
+
in c
−
in
c+out c
−
out z
+
0 z
−
0 z
+
1
. . . z−P−2 z
+
P−1z
−
P−1
Full adder
s c
Full adder
s c
Full adder
s c
Full adder
s c
Full adder
s c
Full adder
s c
Fig. 2. Radix-2 online adders [2]. Left: Serial. Right: Parallel.
Algorithm 2 Radix-2 online multiplication [2].
Inputs: serially presented multiplicand x, multiplier y
1: x,y,w ← 0
2: for j = 0 to P + 2 do
3: y ← y ‖ yj
4: v ← 2w + 2−3(xyj + yxj)
5: zj−3 ← sel×(v)
6: w ← v − zj−3
7: x← x ‖ xj
8: end for
Output: serially generated product z
in, serial-out fashion. So-called digit vectors x and y are
assembled from digits of multiplicand x and multiplier y over
time from the most significant first; ‖ represents concatenation
performed such that
x =
j∑
i=0
xi2
−i−1, y =
j∑
i=0
yi2
−i−1
during cycle j. Digit-selection function sel× serves to deter-
mine the digits of product z. This is defined to be
sel×(v) =

1 if v ≥ 1/2
0 if − 1/2 ≤ v < 1/2
−1 otherwise.
zj is produced at cycle j + 3 since δ× = 3. Note that ‘digits’
zj for j < 0 are ignored. P -digit online addition lies at the
heart of the algorithm; due to its fixed width, hardware that
implements Algorithm 2 can multiply to a precision of at most
P digits, which must be fixed in advance.
3) Division: The process of classical radix-2 online division
is shown in Algorithm 3, in which dividend x and divisor y are
used to produce quotient z. In contrast to Algorithm 2, division
requires the formation of digit vector z since all prior output
digits are needed for the calculation of v, while updates to
w require the full history of y. Online division therefore has
more complex computation dependencies than multiplication.
Its digit-selection function, sel÷, is
sel÷(v) =

1 if v ≥ 1/4
0 if − 1/4 ≤ v < 1/4
−1 otherwise.
zj is produced at cycle j + 4 since δ÷ is 4.
Algorithm 3 Radix-2 online division [2].
Inputs: serially presented dividend x, divisor y
1: y,w, z ← 0
2: for j = 0 to P + 3 do
3: y ← y ‖ yj
4: v ← 2w + 2−4(xj − zyj)
5: zj−4 ← sel÷(v)
6: w ← v − zj−4y
7: z ← z ‖ zj−4
8: end for
Output: serially generated quotient z
III. PROPOSED ARCHITECTURE
Using classical online operators as a starting point, we now
describe the construction of constant-compute-resource hard-
ware capable of performing iterative computation to increasing
precision over time. We call this concept ARCHITECT.
A. Digit-vector Storage
Classical online operators make use of registers to store digit
vectors. When implementing Algorithm 2 in hardware, for
example, P -digit registers are needed for x and y. To compute
to an arbitrary precision p instead, this is unsuitable; we must
use random-access memory (RAM) for digit-vector storage to
avoid both under- and over-budgeting register resources. We
break p into two dimensions: one fixed, U , that determines the
RAM width, and a second variable, n = dp/Ue, representing
the number of these ‘chunks’ that constitute each p-digit
number. For digit index i, where 0 ≤ i < p, we define chunk
index c = bi/Uc and chunk digit index u = i mod U such
that i = Uc + u. When performing iterative calculations,
independent digit vectors exist for each step, thus their in-
dexing requires three variables: c ∈ [0, n), u ∈ [0, U) and
approximant index k.
Since ARCHITECT requires k and i to both vary non-
monotonically as time progresses, it is necessary to uniquely
encode a one-to-one mapping from two-dimensional approxi-
mant and chunk index pair (k, c) into one-dimensional time.
We use a Cantor pairing function (CPF) [21], a bijection from
N2 onto N, for this purpose, defined to be
cpf(k, c) =
(k + c) (k + c+ 1)
2
+ c. (1)
The function’s bijectivity is crucial for ARCHITECT. Unlike
classical row- or column-major indexing, the injectivity of
the CPF allows both dimensions to grow without bound
while providing a unique result for every (k, c). Its operation
is demonstrated visually in Fig. 3; what is conceptually a
three-dimensional array indexed as (k, c, u) becomes a two-
dimensional array indexed by (cpf(k, c), u) instead, thereby
suiting the ‘flat’ nature of RAM. The function’s surjectivity
ensures that every cpf(k, c) is produced by some (k, c), thus
enabling the most efficient use of the available memory.
B. Arbitrary-precision Operators
1) Multiplication: We are now in a position to rewrite
Algorithm 2 such that it can compute results to arbitrary
LI et al.: ARCHITECT: ARBITRARY-PRECISION HARDWARE WITH DIGIT ELISION FOR EFFICIENT ITERATIVE COMPUTE 5
u
k
c
x[0][0]
x[1][0]
x[2][0]
x[3][0]
··
·
x[0][0]
x[0][1]
x[0][2]
x[0][3]
· ·
·
0
1
2
3
· · ·
0
1
2
3
··
· ·
··
· · ·
RAM width
R
A
M
depth
0 1 · · · U−1
x[cpf(0,0)]
x[cpf(1,0)]
x[cpf(0,1)]
x[cpf(2,0)]
··
·
0
1
2
3
··
·
Fig. 3. Operation of our Cantor pairing function, showing the transformation of
a three-dimensional array growing with both approximant and chunk indices
k and c to a structure growing only in a single dimension.
Algorithm 4 Radix-2 ARCHITECT multiplication.
Inputs: serially presented multiplicand x, multiplier y; ap-
proximant index k, precision p
1: x,y,w ← 0
2: for j = 0 to p+ 2 do
3: y[cpf(k, bj/Uc)][j mod U ]← yj
4: for c = bj/Uc to 0 do
5: v[cpf(k, c)]← 2w[cpf(k, c)]+
2−3(x[cpf(k, c)]yj + y[cpf(k, c)]xj)
6: if c > 0 then
7: w[cpf(k, c)]← v[cpf(k, c)]
8: end if
9: end for
10: zj−3 ← sel×(v[cpf(k, 0)])
11: w[cpf(k, 0)]← v[cpf(k, 0)]− zj−3
12: x[cpf(k, bj/Uc)][j mod U ]← xj
13: end for
Output: serially generated product z
precision. These transformed steps are shown in Algorithm 4.
Most importantly, a new loop has been introduced; this iterates
over the n pairs of p-digit numbers’ chunks, most significant
first, to facilitate arbitrary-precision multiplication with a U -
digit online adder. Digit vectors x, y, v andw are now indexed
in two dimensions, corresponding to standard RAM addressing
denoted as [word][digit]. Where a digit index is not given, all
U digits of that word are accessed simultaneously.
2) Division: The equivalently transformed version of Algo-
rithm 3 is shown in Algorithm 5. Mirroring the increased com-
plexity of classical online division over multiplication, here,
two accumulation loops are needed: one for the calculation
of v, as for multiplication, and a second for w. Consequently,
n−1 more cycles are required for the computation of an output
digit in ARCHITECT division than multiplication.
Particular care is required for digit alignment in online
division since input operands need to be bounded such that
the output range is (−1, 1) [2]. The normalisation of quo-
tients following online division ordinarily necessitates variable
δ÷ [22]. To avoid this, we can maintain a fixed online
delay by bounding divisor magnitude within [1/r, 1) [23]. For
Algorithm 5 Radix-2 ARCHITECT division.
Inputs: serially presented dividend x, divisor y; approximant
index k, precision p
1: y,w, z ← 0
2: for j = 0 to p+ 3 do
3: y[cpf(k, bj/Uc)][j mod U ]← yj
4: for c = bj/Uc to 0 do
5: v[cpf(k, c)]←2w[cpf(k, c)]+
2−4(xj − z[cpf(k, c)]yj)
6: end for
7: zj−4 ← sel÷(v[cpf(k, 0)])
8: for c = bj/Uc to 0 do
9: w[cpf(k, c)]← v[cpf(k, c)]− zj−4y[cpf(k, c)]
10: end for
11: z[cpf(k, bj/Uc)][j mod U ]← zj−4
12: end for
Output: serially generated quotient z
experimentation, we can guarantee alignment across iterations
through the appropriate selection of initial inputs.
C. Digit Computation Scheduling
Given a generic online delay δ made up of latencies from
a pipeline (or replicated pipelines operating in parallel) of
one or more operators implementing the body of an iterative
algorithm, restrictions are imposed on the order in which digits
can be calculated. δ impacts us in two ways:
• Calculation of the first output digit requires the prior input
of the first δ+1 input digits. Thereafter, each subsequent
output digit requires one additional input digit in order
to be computed.
• The ith output digit is generated δ cycles after the ith input
digit is presented.
In general, digits of the same approximant can be calcu-
lated indefinitely, while those across iterations must be se-
quenced such that they obey these δ-imposed limitations.
When scheduling digit z(k)i ’s generation, we must ensure that
t
(
z
(k)
i+1
)
> t
(
z
(k)
i
)
, t
(
z
(k+1)
i
)
> t
(
z
(k)
i+δ
)
for all approximant indices k ≥ 1 and digit indices i ≥ 0,
where t is the time at which a generation event occurs.
While we have the freedom to trade off between iteration
count and precision within the bounds of these dependencies,
we always assume a mapping from current to next digit of
the form depicted in Fig. 4. The groups of digits shown,
each δ in size, are processed ‘downwards’ and ‘leftwards,’
with slope dependent on δ and control snapping back to the
first approximant once digit position i = 0 has been reached.
Fixing the granularity of digit generation to δ allows for
control path simplification—as will be elaborated upon in Sec-
tion III-E—and limits transitions between approximants. The
latter is beneficial since, as will be explained in Section III-G,
switching between approximants leads to the incursion of
performance penalties under some circumstances.
6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
0 δ 2δ 3δ i
1
2
3
•
•
• • • • • • •
• • • •
• •
k
Fig. 4. Proposed digit generation pattern without don’t-change digit elision
for generic iterative computation using online operators.
q δ
x(k−2):
x(k−1):
x(k):
Fig. 5. A proof sketch showing why it is sound to omit don’t-change digits. If
the two hatched regions contain the same q+δ digits, the three thick boxes are
guaranteed to contain the same q digits, hence x(k)’s calculation can begin
from digit index q.
D. Don’t-change Digit Elision
Thanks to the use of online arithmetic, when advancing
downwards in our iteration-precision space, we can avoid the
recalculation of don’t-change digits, i.e. those of later approx-
imants that have stabilised. This is generally not possible in
LSD-first architectures, in which carries can propagate from
LSD to MSD. Don’t-change digit elision is guaranteed to be
an error-free transformation: it induces no approximation.
The concept behind this optimisation is straightforward.
Before beginning to calculate the digits of approximant k, we
examine the digits of the previous two approximants. If these
approximants are equal in their most-significant q + δ digits,
it is guaranteed that approximant k will be equal to its two
predecessors in its first q digits. Hence, we do not need to
calculate them; we can skip directly to digit q’s generation.
The soundness of this optimisation can be justified by
appealing to the digit dependencies of online arithmetic. Fig. 5
provides some graphical intuition. Given that each approxi-
mant depends only on the value of its immediate predecessor,
and recalling the definition of online delay from Section II-B,
we emphasise that the first q digits of one approximant depend
only upon the first q+δ digits of the previous approximant [2].
Hence, if approximants k−2 and k−1 are equal in their first
q+ δ digits, approximant k is guaranteed to be equal to them
in its first q digits.
During the generation of approximant k, we compare digits
on the fly with those generated for approximant k − 1, previ-
ously stored in RAM. Based on the number of digits found to
be equal, we store a pointer indicating whence approximant
k + 1’s, i.e. the next approximant’s, generation should begin.
Pointer storage requires a small amount of extra memory but,
as will be elaborated upon in Section V-F, this overhead is
small and amortised out the more RAM is instantiated for
• • • •• • • •
• • • •
× ×
i
k
0 δ 2δ 3δ
1
2
3
Fig. 6. Digit generation pattern with don’t-change digit elision. Groups of
digits in the shaded region were found to be identical at runtime, allowing
computation of the first group to be skipped in the subsequent iteration.
Dashed lines are scheduled paths not taken and ×s are digits therefore elided.
storing digit vectors. Since we have elected to process digits
in groups of δ, it makes sense to also limit our don’t-change
digit elision to this granularity. We thus avoid the processing
of entire groups of digits, where possible.
As a result of the introduction of don’t-change digit elision,
our scheduling pattern becomes dynamic. Fig. 6 shows an
example. This is similar to Fig. 4, but, due to the identification
of the third approximant’s first group of MSDs as stable, we
can advance into the iteration-precision space more quickly
than had we not elided them, increasing compute efficiency.
Along with increased performance, the elision of don’t-
change digits also enables us to increase memory efficiency.
Defining ψ as the number of digits guaranteed not to
have changed within the current approximant, as determined
through the runtime comparison of MSDs within the preceding
two approximants, we can substitute
cpf(k, cˆ) =
(k + cˆ) (k + cˆ+ 1)
2
+ cˆ,
for (1), where cˆ = b(i−ψ+1)/Uc. By doing so, stable digits
no longer need to be recomputed or stored. In common with
its predecessor, this optimised storage strategy guarantees no
memory wastage through the surjectivity of its mapping from
approximant and chunk indices to memory addresses.
E. Control Logic
Given a particular (k, i), we can compute the subsequent
(k, i), (k′, i′), needed to realise scheduling patterns such as
those shown in Figs 4 and 6 with the finite-state machine
(FSM) depicted in Fig. 7. Therein, we present state transition
conditions both with and without don’t-change digit elision
functionality. When elision is enabled, the conditions shown
in boxes are evaluated in addition to those outside.
The states’ functionality is as follows.
• Digit generation: Manages the propagation and storage of
δ-digit groups across iterations. When remaining within
this state, only digit index i must be evaluated to deter-
mine changes needed to k and i without don’t-change
digit elision. When enabled, ψ must also be considered.
• Accumulation: Assuming that the constructed datapath
contains at least one multiplier or divider, we must
account for the variable latency of those operators. The
throughput of the datapath as a whole is determined by
LI et al.: ARCHITECT: ARBITRARY-PRECISION HARDWARE WITH DIGIT ELISION FOR EFFICIENT ITERATIVE COMPUTE 7
Start
Digit
generation Accumulation
/ k ← 1, i← 0
i mod δ = δ − 1 ∧ i ≥ δ ∧ i− ψ 6= δ − 1 / i← i− 2δ + 1, k ← k + 1
i mod δ 6= δ − 1 / i← i+ 1
i = δ − 1 ∨ (i mod δ = δ − 1 ∧ i ≥ δ ∧ i− ψ = δ − 1) / k ← 1, i← i+ kδ + 1
γ > 0 / γ ← γ − 1
i ≥ U / γ ← αbi/Uc − 1
γ = 0
Fig. 7. FSM for digit computation scheduling. Transition edges are labelled with conditions and actions separated by slashes (/). If the datapath consists only
of adders, the accumulation state is never entered. Otherwise, α = 2 if the datapath contains one or more dividers, and is 1 in all other cases. Boxed conditions
apply only when don’t-change digit elision is active; they are otherwise ignored. Termination occurs either on demand or following memory exhaustion.
the slowest operator. Since ARCHITECT’s multiplication
and division operators have dissimilar accumulation func-
tionality, as was explained in Section III-B, the number
of clock cycles consumed by each is different. If the
datapath contains at least one divider, advancement must
be inhibited for 2bi/Uc − 1 cycles per generated digit. If
it does not, but does contain at least one multiplier, this
factor is bi/Uc−1 instead. Counter γ sequences the return
to the digit generation state. Since i is variable, this loop
cannot be unrolled. In the case that the datapath contains
only adders, entry into this state never occurs.
F. Accuracy Bounds
Let us assume the existence of a target result defined by its
approximant index and precision (K,P ). To reach it, we are
required to compute for at least K iterations and to at least
P -digit precision. We emphasise that ARCHITECT does not
necessitate its users to specify K or P up-front, while other
approaches require either one or both of these—usually P—to
be determined before beginning to iterate. Since don’t-change
digits are identified at runtime, the analysis herein applies to
ARCHITECT without digit elision. Enabling this optimisation
will therefore increase the bounds that follow.
As shown in Fig. 8, we define the number of iterations
resulting from computation to target (K,P ) as Kres and the
final precision of the first approximant—always the most
precise—as Pres. Kres is bounded to no more than Kmax,
while Pres is similarly bounded by Pmax, both of which are
determined by the size of the available memory. The latter
therefore determines the maximum approximant index and
precision—and consequently accuracy—that can be reached
through the use of our approach. Thus, if higher accuracy is
required, more memory must be instantiated.
Upon termination, the precision of approximant k will be
p(k) =

δ
(⌈
P
δ
⌉
+K − k) if k < K
P if k = K
δ(Kres − k) otherwise,
•
•
•
••• iP Pres Pmax
K
Kres
Kmax
k
×
Target
Fig. 8. How the final precision and iteration count (Kres, Pres) are constrained
by the desired result (K,P ) and the available memory (Kmax, Pmax).
where Kres can be geometrically deduced to be
Kres =
{⌈
P
δ
⌉
+K − 1 if P > δ
K otherwise
and Pres = p(1).
For each arbitrary-precision digit vector to be stored, Kmax
and Pmax are fixed by RAM depth D (in U -digit words).
Analysis of our pairing function in (1) allows us to derive
Pmax = U
(
1 +
⌊
3/2
(√
1 + 8/9D − 1
)⌋)
,
Kmax =
{
Pmax
U + 1 if D ≥
(
Pmax
U + 1
)
Pmax
2U
Pmax
U otherwise.
G. Compute Time
Given a particular target (K,P ), and hence a certain Kres
and Pres, we can calculate the number of clock cycles required
to compute the desired result. Let us first assume that don’t-
change digit elision is disabled. This total time T can be
broken down into the following three components such that
T = T1 + T2 + T3.
8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
• Initial online delay: We must wait δ clock cycles before
each approximant’s result begins to appear, thus the delay
across all iterations is simply
T1 = δKres.
• Digit generation: Across all iterations performed, the total
time for digit generation is either
T2 =
Kres−1∑
k=0
p(k)
(
2n(k) − 1
)
− Un(k)
(
n(k) − 1
)
− δ,
if the datapath contains one or more dividers, or
T2 =
Kres−1∑
k=0
n(k)
(
p(k) − U
(
n(k) − 1)
2
)
− δ
if it contains one or more multipliers. n(k) =
⌈
p(k)/U
⌉
and represents the number of chunks within the given
approximant upon termination of the algorithm. In the
case that the datapath contains only one or more adders,
T2 =
Kres−1∑
k=0
p(k) − δ.
p(0) and n(0) are the numbers of digits and chunks,
respectively, that must be read from the initial guess.
• Digit-serial addition: Recall that a serial online adder
has δ+ = 2. When switching between iterations, adders,
if present, require two cycles to recalculate the preced-
ing approximant’s residuals in order to produce a new
digit [2]. This ensures that the calculated digit aligns with
its truncated digit vectors. For this,
T3 = β
(
K2res −Kres + 2K − 2
)
,
where β is the number of serial adders present along the
highest-online delay path within the circuit.
When enabled, don’t-change digit elision generally allows
computation time to be reduced below T . Since the amount
of achievable reduction is input-dependent, however, it is not
practicable to determine such reductions analytically.
H. Digit-parallel Addition Optimisation
It is possible to eliminate the final T component in Sec-
tion III-G, resulting in T3 = 0, by using three-digit parallel
online adders in place of serial ones. We store consecutive
digit-vector words in alternating memory banks for speed. By
ensuring that RAM width U > 1, i.e. that each word contains
at least two digits, we can always read the three contiguous
digits required by these adders in a single cycle. No additional
memory is needed for this optimisation.
IV. BENCHMARKS
In order to evaluate ARCHITECT, we implemented two
widely used iterative algorithms—the Jacobi method (to solve
systems of linear equations) and Newton’s method (for the
solution of nonlinear equations)—in hardware following the
aforementioned principles. We chose Jacobi and Newton to
exemplify a large class of iterative methods with linear and
quadratic convergence properties, respectively. Except where
otherwise stated, ARCHITECT implementations featured all
of the previously described optimisations: don’t-change digit
elision, its related memory-addressing and digit-scheduling
schemes and serial-to-parallel online adder substitutions.
A. Jacobi Method
The Jacobi method seeks to solve the system of N linear
equations Ax = b. If A is decomposed into diagonal and
remainder components such that A = D + R, x can be
computed through the repeated evaluation of
x(k+1) = D−1
(
b−Rx(k)
)
,
or, expressed in element-wise fashion,
x
(k+1)
i =
1
aii
bi − ∑
j 6=i∈[0,N)
aijx
(k)
j
 ∀i ∈ [0, N) ,
where k is the approximant index. Since D’s only non-zero
elements lie along its diagonal, D−1 is trivial to calculate.
Note that x(k+1) relies only upon the previously computed
value of x; the calculation can therefore be parallelised by
computing each x(k+1)i independently. A convergence crite-
rion,
∥∥Ax(k) − b∥∥ < η, can be used in order to determine if
the solution has been found to great enough accuracy.
Such a system is guaranteed to be soluble when A is strictly
diagonally dominant, i.e. if the condition |aii| >
∑
j 6=i |aij |
holds for all i. Although strict diagonal dominance is not a
necessity in every case, we assume this condition to always
be satisfied for simplicity.
A metric used to quantify the sensitivity of a particular
linear system to error is the condition number ofA [24], where
κ(A) = ‖A‖ ∥∥A−1∥∥.
Perturbations in x(k), caused by rounding, lead to errors in
x(k+1) whose magnitude is dependent, in part, on κ(A); a
high condition number indicates that A is sensitive to error
and therefore ill-conditioned [25]. We can expect to need at
least ω additional digits of precision in order to compute a
system with κ(A) = 2ω than required if κ(A) were 1 [26].
Without loss of generality, the datapath we developed to
solve systems with dimensionality N = 2 is depicted in
Fig. 9a, featuring ARCHITECT numerical operators as de-
scribed in Section III-B. Jacobi solvers with N > 2 could
have been built with additional multipliers and adders, but
this is not the emphasis—demonstrating arbitrary-accuracy
iterative calculation—of this work. Note that runtime division
is unnecessary since A and b are constants and that simple
rearrangement transforms subtraction into addition.
B. Newton’s Method
Newton’s method is a root-finding algorithm, commonly
employed to approximate the zeroes of a real-valued function
f . The iterative process is
x(k+1) = x(k) − f
(
x(k)
)
f ′
(
x(k)
) ,
LI et al.: ARCHITECT: ARBITRARY-PRECISION HARDWARE WITH DIGIT ELISION FOR EFFICIENT ITERATIVE COMPUTE 9
× ×
+ +
RAM
−a01
a00
b0
a00
−a10
a11
b1
a11
22
22
22
6 6 6 6
x0(k, i)
x1(k, i)
x0(k
′, i′) x1(k
′, i′)
(a) Jacobi method (δ = 3).
× ÷
+
RAM
−1
2
− 3
2a
2
2
22
6 6
x(k, i)
x(k′, i′)
(b) Newton’s method (δ = 4).
Fig. 9. ARCHITECT benchmark datapaths. Adders, multipliers and dividers
are arbitrary-precision radix-2 signed-digit online operators. Use of three-digit
adders reduces online delay by 2 over their serial equivalents.
where f ′ is the first derivative of f . Assuming that f(x) = 0
is soluble and f ′(x) is Lipschitz continuous, convergence is
quadratic if x(0) is sufficiently close to the solution [27].
We implemented the datapath shown in Fig. 9b, again with
ARCHITECT operators, as a second case study. This can solve
equations of the form f(x) = ax2 − 3 = 0:
x(k+1) =
x(k)
2
+
3
2ax(k)
.
Since the solution of f(x) = 0 is irrational for some choices of
a (e.g. 1), we consider this to be a particularly good showcase
of ARCHITECT’s arbitrary-precision capabilities.
V. EVALUATION
We conducted theoretical analysis and performed experi-
ments to investigate how ARCHITECT scales and performs
versus competing arithmetic implementations, both traditional
(LSD-first) and online, using the Jacobi and Newton’s methods
as benchmarks. Performance is evaluated in terms of latency,
which, for all implementations considered in this article, is the
multiplicative inverse of throughput.
The closest study to this work is that presented by Zhao
et al. [4], which we compare against directly. For comparison
against traditional arithmetic, we chose to implement parallel-
in serial-out (PISO) operators since ARCHITECT operates in a
similar digit-serial fashion. PISO sits at the midpoint between
fully serial (SISO) and parallel (PIPO) in terms of area and
performance [28]. With increase in precision P—which, for
traditional arithmetic, can solve problems requiring precision
TABLE III
COMPLEXITIES OF ITERATIVE SOLVER IMPLEMENTATIONS.
Area Memory Solve time
PISO O(N2P ) O(NP ), O(P )1 O(log(N)KP )
Zhao et al. [4] O(N2K) O(N2KP ) O(P (log(N)K + P ))
ARCHITECT O(N2) O(N2(K + P )2) O( (log(N)K+P )3log(N) )
1 N -dimensional Jacobi method, N th-order Newton’s method.
up to P—PISO suffers less from area growth and operating
frequency fmax degradation than PIPO [29] while also being
dramatically faster than SISO [30]. While we focus exclusively
on hardware implementations, the limitations revealed for
PISO apply equally to software libraries since precision must
be chosen prior to iterative algorithmic commencement.
A. Complexity Analysis
In Table III, we present the results of asymptotic complexity
analysis—in terms of circuit size, memory requirements and
latency—performed for ARCHITECT and its competitors. For
PISO, we assume the repeated evaluation of an iterative
expression using datapaths composed of standard numeric
operators. For each arithmetic, we further assume latency-
optimal datapath implementations featuring minimal-depth
adder (for Jacobi) and multiplier (Newton) trees. Complexities
for Zhao et al.’s implementations were derived from analytical
expressions provided by the authors [4].
Since we have chosen to analyse latency-optimised datap-
aths, area scales with the required number of multipliers (New-
ton) and adders (Jacobi), which themselves grow quadratically
with N . For PISO, area also scales linearly with the width of
its input operands, controlled by P , while the size of Zhao et
al.’s implementations instead scales linearly with the number
of iterations to be performed, K. The area of an ARCHITECT
implementation, however, scales with neither K nor P , since
the same arithmetic operators compute every approximant, to
any precision, for the chosen iterative method.
As with area, a PISO implementation’s memory footprint
scales linearly with P ; for the Jacobi method, scaling is
also linear in N due to the size of the computed vector.
Both Zhao et al.’s implementations and ARCHITECT require
residue storage within their multipliers and dividers; memory
occupancy therefore scales with area for the arbitrary-precision
architectures. For the former, use of memory also scales with
P as residues are stored to the same precision as its input
data. Since ARCHITECT effectively collapses approximant and
precision indices into a single dimension via its CPF, the
memory requirements for each operator are determined by
the maximum value of (1) during computation to the target
(K,P ). They thus scale quadratically with K + P .
PISO’s latency grows linearly with K and P , but logarith-
mically with N due to our aforementioned choice of adder
(and multiplier) structures. Zhao et al.’s speed is bottlenecked
by the growth of precision—quadratically—as well as the fre-
quency of pipeline flushes, which grows asO(log(N)KP ) [4].
For ARCHITECT, given that each datapath’s highest cumulative
10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
Iterative algorithm
IP cores: LSD-first,
existing online or
ARCHITECT operators
Software
implementation
Digit
scheduling
Datapath State machine
Hardware
implementation
Verification
Performance
evaluation
HDL
FPGA
CPU
Fig. 10. Experimental setup for the evaluation of ARCHITECT.
online delay δ is logarithmically related to N , its latency com-
plexity can be determined by solving for T2 in Section III-G.
Note that T2 dominates T1 in all cases and T3 = 0 since we
assume the use of digit-parallel adders.
At first glance, it appears that ARCHITECT behaves more
poorly than its competitors in terms of memory use and
solve time when scaled. We emphasise, however, that these
complexities represent worst-case scenarios for ARCHITECT:
optimisations including digit elision do not factor into its
asymptotic behaviour but do significantly improve the average
case. They also do not take fundamental limitations of the
alternatives into account. In particular, exact computation to
a given (K,P ) is rarely possible with P -digit LSD-first
arithmetic due to rounding errors introduced in earlier approx-
imants; only MSD-first architectures are capable of producing
exact results for every approximant. Additionally, they do not
account for ARCHITECT’s unique ability to compute results
to any required accuracy, effectively allowing the necessary
(K,P ) to be determined, on a problem-by-problem basis, at
runtime. In contrast, a PISO implementation’s precision is
always bounded, while the same is true of iteration count for
Zhao et al.’s proposal. In the remainder of this section, we
empirically explore the implications of these issues.
B. Experimental Particulars
We targetted a Xilinx Virtex UltraScale FPGA (XCVU190-
FLGB2104-3-E) for all experiments detailed henceforward,
with implementation performed using Vivado 16.4. The cor-
rectness of results obtained in hardware was verified via
comparison against those produced by golden models executed
in software. Fig. 10 captures our experimental process.
C. Qualitative Performance Comparison
To evaluate performance for the Jacobi method, we consid-
ered systems in which
Am =
 1 1− 2−m
1− 2−m 1
 , b =
b0
b1
 , x(0) = 0,
with b0 and b1 randomly selected from a uniform distribution
in the range [0, 1). As m increases, condition number κ(Am)
also increases, indicating that higher precision P will be
required to generate a result of great-enough accuracy. We
set accuracy bound η = 2−6 and experimentally determined
that the most ill-conditioned matrix requiring P = 32, a
commonly encountered traditional arithmetic data width, to
solve the associated system was that with m = 25, so we
limited our experiments to m ∈ [0, 25]. We postulate that
ARCHITECT should ‘win,’ i.e. compute the required result in
less time, versus PISO either when the latter’s precision P
is high and Am is well conditioned or when P is too low
for an ill-conditioned Am to allow convergence at all. For
ARCHITECT, we used RAM size (U,D) =
(
8, 210
)
. Latencies
were calculated using frequencies taken from Section V-E.
Fig. 11a captures the latency ratio between ARCHITECT and
PISO with a fixed precision of 32 bits (LSD-32) necessary to
compute results for matrices with low m. Here, PISO can
be said to have over-budgeted precision; results take longer
to compute than had a smaller P been chosen. For the most
well conditioned matrices (m ≤ 0.15), ARCHITECT takes less
time to reach the target accuracy. For larger m, however, the
opposite is true: lower-indexed approximants are computed to
greater precision than those of PISO, taking more time. Had
a lower choice of P been made for PISO, ARCHITECT would
have been at a disadvantage for the more well conditioned
matrices, but it would also have been able to compute the
results of systems featuring ill-conditioned matrices that PISO
could not. As shown in Fig. 11c, with P = 8 (LSD-8),
ARCHITECT can solve systems with m > 2, where PISO’s
precision is under-budgeted; here, even if PISO ran indefinitely
it would never be able to converge to an accurate-enough
solution. We can conclude that ARCHITECT requires less time
to generate results either when P is small and convergence is
fast or when P is too large for PISO to ever converge.
For Newton’s method, we experimented with a ∈ [1, 231].
As a increases, 3/2a decreases, thus greater precision will
be required for its representation. We calculated under ter-
mination condition
∣∣f(x(k))∣∣ < η, with η again set to 2−6.
a ∈ [1, 231] was chosen since, to solve f(x) with a = 231,
the worst-case precision requirement was again P = 32.
Figs 11b and 11d show the performance of our ARCHITECT-
based Newton’s method benchmark versus 32-bit and 8-bit
PISO in the same form as Figs 11a and 11c, respectively. The
results achieved for Newton’s method are broadly similar to
those for Jacobi. ARCHITECT requires a ≤ 2.8 to beat LSD-32
in terms of compute time, while only our proposed iterative
solver can solve systems with a > 8 when PISO has P =
8. Identical conclusions regarding under- and over-budgeted
precisions can therefore be drawn for Newton’s method.
D. Area & Frequency Scalability
Implementational results are presented in Fig. 12 for our
Jacobi and Newton’s method benchmarks, including area and
maximum operating frequency fmax. Each of the four plots
features D, the RAM depth used for storage of each digit
vector, on the x-axis, and RAM width U was 8 in all cases.
LI et al.: ARCHITECT: ARBITRARY-PRECISION HARDWARE WITH DIGIT ELISION FOR EFFICIENT ITERATIVE COMPUTE 11
0.01 0.1 1
1
2
5
10
m
=
0
.1
5
(a)
L
at
en
cy
ve
rs
us
L
SD
-3
2
(×
) Jacobi
20 21 22 23 24
1
2
3
a
=
2.
8
(b)
Newton
0.01 0.1 1 10
1
2
5
10
20
40
(c)
m
L
at
en
cy
ve
rs
us
L
SD
-8
(×
)
20 21 22 23 24 25 26
1
2
4
8
(d)
a
Fig. 11. Performance comparisons of our proposal against LSD-first arithmetic
for the Jacobi and Newton’s methods. (a) and (b) show how the conditioning of
input matrix Am (Jacobi) and input value a (Newton) affect the solve time of
our proposal compared to LSD-32. ARCHITECT computes more quickly than
LSD-32 when m ≤ 0.15 for Jacobi and a ≤ 2.8 for Newton. (c) and (d) show
that, even though our proposal leads to a slowdown compared to LSD-8, there
are nevertheless points—at m > 2 (Jacobi) and a > 8 (Newton)—whence
LSD-8 does not converge at all, hence our speedup is effectively infinite.
Lookup table (LUT) and flip-flop (FF) use are not shown
since the numbers are insignificant compared to those of on-
chip block RAM (BRAM)—from 0.17% to 0.66% for LUTs
and 0.045% to 0.21% for FFs for the smallest (D = 210)
and largest (D = 219) Jacobi designs implemented, and from
0.22% to 0.86% (LUTs) and 0.040% to 0.23% (FFs) for the
Newton datapath. Memory use grows with D, as expected;
the higher Kres and Pres one wishes to be able to reach,
the more RAM must be instantiated. With 90% and 77%
of BRAMs allocated for the Jacobi and Newton methods,
respectively, we can reach Kmax = 1023 and Pmax = 8184: the
maxima for our targetted FPGA with power-of-two choices of
D. The small increases in non-memory resources noted can
be attributed to the additional control logic and multiplexing
required to address larger memories. The fmax plots show that
our implementations are able to run at between 120 MHz, for
the smallest D tested, to around 50 MHz for the largest of both
benchmarks. Subtle increases are due to compilation noise.
ARCHITECT gives its users the freedom to trade off area
and computation time directly by varying RAM width U .
When U is changed, so are the widths of the parallel online
adders used in the datapath. While a design with narrower
adders is just as able to compute a particular result as one
capable of performing wider additions, it will also consume
more clock cycles in return for demanding lower resource use.
Comparisons between U = 8 and U = 64 with the same D,
0
20
40
60
80
100
(a)
B
R
A
M
s
(%
)
Jacobi
0
20
40
60
80
100
(b)
Newton
2
10
2
11
2
12
2
13
2
14
2
15
2
16
2
17
2
18
2
19
0
50
100
(c)
f m
ax
(M
H
z)
2
10
2
11
2
12
2
13
2
14
2
15
2
16
2
17
2
18
2
19
0
50
100
(d)
RAM depth D (words)
Fig. 12. Resource use and maximum clock rate of ARCHITECT Jacobi and
Newton benchmarks versus RAM depth D. Area is reported in terms of
BRAMs only; LUT and FF use were below 1% for all design points.
TABLE IV
AREA-SPEED TRADEOFF VIA SELECTION OF RAM WIDTH U .
U LUTs FFs BRAMs
fmax
(MHz)
Accumulation
latency (cycles)
Ja
co
bi 8 1827 964 28 121
⌈
p(k)/8
⌉
64 6964 2551 88 93
⌈
p(k)/64
⌉
N
ew
to
n 8 2316 866 26 120 2
⌈
p(k)/8
⌉
− 1
64 6102 1710 83 95 2
⌈
p(k)/64
⌉
− 1
in this case 210, are shown in Table IV to exemplify this for
both of our benchmarks. Note that the accumulation latency for
Newton’s method is higher than Jacobi’s due to the former’s
use of division; as was explained in Section III-B2, division
requires more cycles to produce each output digit than are
needed for multiplication. Table V shows the area breakdown
and minimum clock period (critical path delay) for each of
our individual arithmetic components for an example (U,D).
TABLE V
ARCHITECT OPERATOR FEATURES WITH RAM SIZE (U,D) =
(
8, 210
)
.
LUTs FFs BRAMs Minimum clockperiod (ns)
+ 4 3 – 2.0
× 250 141 4 5.0
÷ 255 93 6 5.6
12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
E. Quantitative Area & Frequency Comparison
In order to compare the resource use and fmax of ARCHI-
TECT against its competitors, we now assume that we wish
to compute to particular (K,P ) targets. We emphasise that,
since ARCHITECT iterates exactly while LSD-first arithmetic-
based solvers do not, latency cannot be fairly compared when
considering computation to a particular (K,P ).
We chose to set targets of
(
100, 211
)
(for the Jacobi
method) and
(
10, 211
)
(Newton). Thus, at their 100th and 10th
iterations, respectively, we wish to obtain a result with 2048-
digit precision. Fewer iterations were targetted for Newton’s
method due to its quadratic convergence. Using U = 8, for
ARCHITECT, the resultant iteration counts and precisions for
the two methods are (Kres, Pres) = (509, 2545) (Jacobi) and
(351, 2106) (Newton). To successfully perform computation
to (K,P ), we must ensure that Kmax ≥ Kres and Pmax ≥ Pres.
We can determine that, by setting RAM depth D = 217, we are
able to reach Kmax = 512 and Pmax = 4088, which satisfies
these requirements for both benchmarks.
Fig. 13 presents a side-by-side comparison of the architec-
tures implemented following the principles presented herein
and those using PISO operators as well as the online im-
plementation published by Zhao et al. [4]. Most strikingly,
the latter demonstrates area inefficiency, with resource use
scaling linearly with iteration count K; ARCHITECT consumes
57× fewer LUTs and 59× fewer FFs than Zhao et al.’s
proposal requires in order to execute 100 iterations of the
Jacobi method. When executing 10 iterations of Newton’s
method, these factors are 8.4 and 13, respectively. fmax is
comparable between the two since the underlying arithmetic is
largely equivalent, although ARCHITECT’s is slightly inferior
principally due to reductions caused by the introduction of
don’t-change digit elision logic. For PISO, we can see that,
while its fmax is initially much higher—over 300 MHz for
P = 24—than ARCHITECT’s, it falls as P increases; the
crossover occurs at P ≈ 1400. Taking Newton’s method as
an example, with a high precision requirement, such as 211
digits, ARCHITECT is able to outperform its PISO counterpart
in terms of fmax by a factor of 1.5. Corresponding decreases
in LUT and FF use were also found: when computing to
P = 210, again for Newton, ARCHITECT consumes 1.8× and
3.3× fewer of each than PISO, while for 211 these factors
increase to 3.6 and 6.5. Similar conclusions can be made for
our implementation of the Jacobi method. Since the proposed
designs are able to calculate to any K ≤ Kmax and P ≤ Pmax,
their area and fmax are constant.
F. Performance Improvement Breakdown
We conducted further analysis to investigate how the eli-
sion of don’t-change digits and use of parallel online adders
individually improve the performance and memory efficiency
of ARCHITECT. Overall, Figs 14a and 14b show that solve
time can be significantly reduced when enabling these op-
timisations. As expected, don’t-change digit elision leads to
the majority of our design’s efficiency savings over ‘vanilla’
ARCHITECT (that without digit elision or parallel addition).
0
5
10
15 (a)
L
U
T
s
(%
)
Jacobi
0
1
2
(b)
Newton
0
2
4 (c)
FF
s
(%
)
0
0.2
0.4
0.6
0.8
(d)
25 26 27 28 29 210211
0
100
200
300 (e)
f m
ax
(M
H
z)
25 26 27 28 29 210211
0
100
200
300 (f)
Precision P (digits)
Fig. 13. Resource use and performance comparison of Jacobi and Newton’s
method implementations using Zhao et al.’s ( ), PISO ( ) and our
( ) approaches versus required result precision P .
The gaps between the don’t change-plus-parallel online addi-
tion and parallel addition-only lines widen as η is reduced,
indicating that consideration of don’t-change digits becomes
more important with higher accuracy requirements. The subtle
jump present in Fig. 14a is due to the δ-digit granularity of
elision. With respect to using parallel online addition only,
performance is improved for higher η since it leads to clock
cycle savings when switching between iterations. For higher-
accuracy cases, this optimisation does not contribute much
to solve time speedup, however. This makes sense since, as η
falls, more iterations are required to achieve convergence, thus
more cycles are required for the production of each new digit.
This also affords much greater opportunity for don’t-change
digit elision, however, hence the high overall speedups seen
on the right-hand side of, in particular, Fig. 14b.
The speedups we observed for Newton’s method were
far more significant than those for Jacobi: up to 16× for
the former. Relatively low performance improvements were
expected for the Jacobi benchmark due to the method’s linear
convergence. Far fewer don’t-change digits are detected and
elided during computation than for the quadratic-convergence
Newton’s method, hence the less-significant latency reductions
seen in Fig. 14a than Fig. 14b.
Figs 14c and 14d show the memory efficiency improvements
afforded through the use of don’t-change digit elision for both
LI et al.: ARCHITECT: ARBITRARY-PRECISION HARDWARE WITH DIGIT ELISION FOR EFFICIENT ITERATIVE COMPUTE 13
1
1.2
1.3
(a)
Sp
ee
du
p
(×
)
Jacobi
0
5
10
15
20
(b)
Newton
2
−8
2
−16
2
−32
2
−64
2
−12
8
2
−25
6
2
−51
2
0
0.5
1
1.5
2
(c)
M
em
or
y
re
du
ct
io
n
(×
)
2
−16
2
−32
2
−64
2
−12
8
2
−25
6
2
−51
2
2
−10
24
2
−20
48
2
−40
96
0
0.5
1
1.5
2
(d)
Accuracy bound η
Fig. 14. Solve time speedup for (a) the Jacobi and (b) Newton’s methods using
both don’t-change digit elision and parallel online addition ( ) and parallel
addition only ( ) versus ARCHITECT with both optimisations disabled. (c)
and (d) show the corresponding memory requirement reductions for Jacobi
and Newton, respectively, facilitated through digit elision.
benchmarks. We present these as the ratio of the number of
BRAM blocks that must be instantiated on our targetted FPGA
for the solution of equations to particular accuracies with and
without digit elision enabled. The jaggedness of these plots
is due to the granularity of memories, for which we only
used whole numbers of BRAMs. For lower-accuracy cases,
both pairs of designs require approximately the same amount
of memory, although that considering don’t-change digits is
slightly inferior due to the overheads involved in comparison
and subsequent elision. However, don’t-change digit elision
allows our optimised Newton design to use the same amount
of memory for η ≤ 2−512, while vanilla ARCHITECT starts to
consume more memory when η = 2−449. For the test cases
we evaluated, we observed up-to 1.5× and 1.9× memory
savings for the Jacobi and Newton’s methods, respectively.
Beyond those shown in Fig. 14, there are particularly high-
accuracy cases—η ≥ 2−874 for Jacobi and η ≥ 2−7169 for
Newton—vanilla ARCHITECT cannot reach before it exhausts
its available memory, while that with digit elision can. The
advantages of this scheme and its efficient memory addressing
therefore come to the fore with higher accuracy requirements.
VI. CONCLUSION & FUTURE WORK
In this article, we proposed the first hardware architecture
capable of executing iterative algorithms to produce results
of arbitrary accuracy by combining increasing iteration count
with precision while using constant compute resources. We
named this technique ARCHITECT. Our proposal employs
online arithmetic to generate its results MSD first and a
Cantor pairing function within its digit-storage mechanism to
facilitate the simultaneous growth of iteration count and pre-
cision. Using digit dependency analysis, we identified stable
‘don’t-change’ digits across iterations, excluding them from
calculation. This technique holds for any iterative method
implemented using online arithmetic and was realised in
hardware using simple runtime detection and digit-scheduling
logic. We also proposed the replacement of serial online adders
within iterative datapaths with parallel equivalents, facilitating
latency reduction and consequent improvements in throughput.
We evaluated ARCHITECT on FPGAs using the Jacobi and
Newton’s methods in order to verify its accuracy and establish
its scalability and efficiency. These benchmarks showcased the
key advantage of our approach: removing the burden of having
to determine and fix the precision of arithmetic operators in
advance. By doing so, we showed that datapaths constructed
from ARCHITECT operators are superior to their traditional
arithmetic equivalents in scenarios where the latter’s precision
is either overly high for the problems being solved or too low
for results to converge at all. A single ARCHITECT datapath
is able to compute results to any accuracy, with the only limit
being imposed by the size of the available RAM.
Our experiments revealed 12× LUT and 24× FF reductions
over 2048-bit conventional parallel-in serial-out arithmetic,
along with 57× LUT and 59× FF decreases versus the state-
of-the-art online arithmetic implementation, when executing
100 Jacobi iterations. For Newton’s method run for 10 iter-
ations, these factors were 3.6, 6.5, 8.4 and 13, respectively.
Versus ARCHITECT with the proposed don’t-change and par-
allel addition optimisations disabled, we were able to achieve
up-to 16× decreases in solve time.
While we prototyped our designs on FPGAs owing to
the costs and lead times associated with full-custom im-
plementation, we note that these devices are optimised for
the implementation of conventional arithmetic operators. In
particular, FPGAs’ hardened carry chains suit the construction
of fast LSD-first adders. Our proposals cannot take advantage
of such structures at present. We are confident that, should AR-
CHITECT see application-specific integrated circuit (ASIC) im-
plementation, however, much more competitive performance
would be achievable. Higher-radix (r > 2) online arithmetic
could instead (or additionally) be employed to exploit high-
performance adders, including on FPGAs, which we anticipate
would also lead to speedups. We leave the exploration of such
optimised implementations to future work.
Beyond this, we will extend our benchmarking to cover
additional iterative algorithms, including Krylov subspace
methods such as conjugate gradient descent. Finally, we en-
visage that the arbitrary-precision computation enabled by
ARCHITECT can be combined with high-level synthesis to
enable faster hardware specialisation.
ACKNOWLEDGEMENTS
The authors are grateful for the support of the United King-
dom EPSRC (grant numbers EP/P010040/1, EP/R006865/1
and EP/K034448/1), Imagination Technologies, the Royal
Academy of Engineering and the China Scholarship Council.
14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
Supporting data for this article are available online at
https://doi.org/10.5281/zenodo.3378800.
REFERENCES
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,” Nature, vol. 521,
no. 7553, 2015.
[2] M. D. Ercegovac and T. Lang, Digital Arithmetic. Elsevier, 2004.
[3] K. Shi, D. Boland, and G. A. Constantinides, “Efficient FPGA Imple-
mentation of Digit Parallel Online Arithmetic Operators,” in Interna-
tional Conference on Field Programmable Technology (FPT), 2014.
[4] Y. Zhao, J. Wickerson, and G. A. Constantinides, “An Efficient Imple-
mentation of Online Arithmetic,” in International Conference on Field
Programmable Technology (FPT), 2016.
[5] H. Li, J. J. Davis, J. Wickerson, and G. A. Constantinides, “ARCHI-
TECT: Arbitrary-precision Constant-hardware Iterative Compute,” in
International Conference on Field Programmable Technology (FPT),
2017.
[6] ——, “Digit Elision for Arbitrary-accuracy Iterative Computation,” in
IEEE Symposium on Computer Arithmetic (ARITH), 2018.
[7] M. Benzi, T. M. Evans, S. P. Hamilton, M. L. Pasini, and S. R.
Slattery, “Analysis of Monte Carlo-accelerated Iterative Methods for
Sparse Linear Systems,” Numerical Linear Algebra with Applications,
vol. 24, no. 3, 2017.
[8] J. Sun, G. D. Peterson, and O. O. Storaasli, “High-performance Mixed-
precision Linear Solver for FPGAs,” IEEE Transactions on Computers,
vol. 57, no. 12, 2008.
[9] Q. Liu, R. Sang, and Q. Zhang, “FPGA-based Acceleration of Davidon-
Fletcher-Powell Quasi-Newton Optimization Method,” Transactions of
Tianjin University, vol. 22, no. 5, 2016.
[10] A. Roldao-Lopes, A. Shahzad, G. A. Constantinides, and E. C. Kerrigan,
“More Flops or More Precision? Accuracy Parameterizable Linear Equa-
tion Solvers for Model Predictive Control,” in IEEE International Sym-
posium on Field-programmable Custom Computing Machines (FCCM),
2009.
[11] D. Boland and G. A. Constantinides, “An FPGA-based Implementation
of the MINRES Algorithm,” in International Conference on Field-
programmable Logic and Applications (FPL), 2008.
[12] G. Constantinides, A. Kinsman, and N. Nicolici, “Numerical Data
Representations for FPGA-based Scientific Computing,” IEEE Design
& Test of Computers, vol. 28, no. 4, 2011.
[13] D. H. Bailey, R. Barrio, and J. M. Borwein, “High-precision Com-
putation: Mathematical Physics & Dynamics,” Applied Mathematics
Computation, vol. 218, no. 20, 2012.
[14] D. H. Bailey and J. M. Borwein, “High-precision Arithmetic in Mathe-
matical Physics,” Mathematics, vol. 3, no. 2, 2015.
[15] B. Serpette, J. Vuillemin, and J.-C. Herve´, “BigNum: A Portable
and Efficient Package for Arbitrary-precision Arithmetic,” Digital Paris
Research Laboratory, Tech. Rep., 1989.
[16] MPFR, “The GNU MPFR Library,” http://www.mpfr.org, 2017.
[17] F. Johansson, “Arb: Efficient Arbitrary-precision Midpoint-radius Inter-
val Arithmetic,” IEEE Transactions on Computers, vol. 66, no. 8, 2017.
[18] F. de Dinechin and B. Pasca, “Designing Custom Arithmetic Data Paths
with FloPoCo,” IEEE Design & Test of Computers, vol. 28, no. 4, 2011.
[19] X. Fang and M. Leeser, “Open-source Variable-precision Floating-
point Library for Major Commercial FPGAs,” ACM Transactions on
Reconfigurable Technology and Systems, vol. 9, no. 3, 2016.
[20] M. Jaiswal and H. So, “Area-efficient Architecture for Dual-mode Dou-
ble Precision Floating Point Division,” IEEE Transactions on Circuits
and Systems I: Regular Papers, vol. 64, no. 2, 2017.
[21] P. Ce´gielski and D. Richard, “Decidability of the Theory of the Natural
Integers with the Cantor Pairing Function and the Successor,” Theoret-
ical Computer Science, vol. 257, no. 1-2, 2001.
[22] P. Tu and M. D. Ercegovac, “Design of On-line Division Unit,” in IEEE
Symposium on Computer Arithmetic (ARITH), 1989.
[23] S. F. Obermann and M. J. Flynn, “Division Algorithms and Implemen-
tations,” IEEE Transactions on Computers, vol. 46, no. 8, 1997.
[24] E. K. Miller, “A Computational Study of the Effect of Matrix Size and
Type, Condition Number, Coefficient Accuracy and Computation Preci-
sion on Matrix-solution Accuracy,” in IEEE Antennas and Propagation
Society International Symposium, 1995.
[25] A. H.-D. Cheng, “Multiquadric and its Shape Parameter—A Numerical
Investigation of Error Estimate, Condition Number, and Round-off
Error by Arbitrary Precision Computation,” Engineering Analysis with
Boundary Elements, vol. 36, no. 2, 2012.
[26] E. Cheney and D. Kincaid, Numerical Mathematics and Computing.
Nelson Education, 2012.
[27] C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations.
Society for Industrial and Applied Mathematics, 1995.
[28] K. Javeed, X. Wang, and M. Scott, “Serial and Parallel Interleaved
Modular Multipliers on FPGA Platform,” in International Conference
on Field-programmable Logic and Applications (FPL), 2015.
[29] M. R. Meher, C. C. Jong, and C.-H. Chang, “A High Bit Rate
Serial-serial Multiplier With On-the-fly Accumulation by Asynchronous
Counters,” IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 19, no. 10, 2011.
[30] A. Landy and G. Stitt, “Revisiting Serial Arithmetic: A Performance and
Tradeoff Analysis for Parallel Applications on Modern FPGAs,” in IEEE
International Symposium on Field-programmable Custom Computing
Machines (FCCM), 2015.
He Li is a PhD student in the Department of
Electrical and Electronic Engineering at Imperial
College London. He received the MS degree from
the Department of Microelectronics at Tianjin Uni-
versity in 2016. His main research interests are
FPGA arithmetic, custom computing and hardware
security. He received the Best Paper Presentation
Award at FPT 2017.
James J. Davis is a Research Fellow in the De-
partment of Electrical and Electronic Engineering’s
Circuits and Systems group at Imperial College Lon-
don. He received a PhD in Electrical and Electronic
Engineering from Imperial College London in 2016.
His research is focussed on the exploitation of FPGA
features for cutting-edge applications, driving up
performance, energy efficiency and reliability. Dr
Davis serves on the technical programme commit-
tees of the four top-tier reconfigurable computing
conferences (FPGA, FCCM, FPL and FPT) and is a
multi-best paper award recipient. He is a Member of the IEEE and the ACM.
John Wickerson received a PhD in Computer Sci-
ence from the University of Cambridge in 2013. He
is a Lecturer in the Department of Electrical and
Electronic Engineering at Imperial College London.
His research interests include high-level synthesis,
the design and implementation of programming lan-
guages and software verification. He is a Senior
Member of the IEEE and a Member of the ACM.
George A. Constantinides received the PhD de-
gree from Imperial College London in 2001. Since
2002, he has been with the faculty at Imperial
College London, where he is currently Professor of
Digital Computation and Head of the Circuits and
Systems research group. He was General Chair of
the ACM/SIGDA International Symposium on Field-
programmable Gate Arrays in 2015. He serves on
several programme committees and has published
over 200 research papers in peer-refereed journals
and international conferences. Prof. Constantinides
is a Senior Member of the IEEE and a Fellow of the British Computer Society.
