Real-time estimation of zero crossings of sampled signals for timing using cubic spline interpolation by Aliaga, Ramón J.
 
Document downloaded from: 
 

























Aliaga, RJ. (2017). Real-time estimation of zero crossings of sampled signals for timing
using cubic spline interpolation. IEEE Transactions on Nuclear Science. 64(8):2414-2422.
https://doi.org/10.1109/TNS.2017.2721103
https://doi.org/10.1109/TNS.2017.2721103
Institute of Electrical and Electronics Engineers
Real-time estimation of zero crossings of sampled
signals for timing using cubic spline interpolation
Ramón J. Aliaga
Abstract—A scheme is proposed for hardware estimation of
the location of zero crossings of sampled signals with sub-sample
resolution for timing applications, that consists in interpolating
the signal with a cubic spline near the zero crossing and
then finding the root of the resulting polynomial. An iterative
algorithm based on the bisection method is presented that obtains
one bit of the result per step and admits an efficient FPGA
implementation using fixed-point representation. In particular,
the root estimation iteration involves only two additions, and
the initial values can be obtained from FIR filters with certain
symmetry properties. It is shown that this allows online, real-time
estimation of timestamps in free-running sampling detector sys-
tems with improved accuracy with respect to the more common
linear interpolation. The method is evaluated with simulations
using ideal and real timing signals and estimates are given for
the resource usage and speed of its implementation.
Index Terms—Digital arithmetic, digital circuits, digital timing,
FPGA, interpolation, signal processing algorithms, splines, time
estimation, time resolution.
I. INTRODUCTION
T IME pick-off, i.e. the process of assigning an univocallydetermined time mark to detected events, is one of
the fundamental functions of electronic detector and data
acquisition systems [1]. Time marks usually correspond to
captured signals crossing certain predetermined values. The
simplest procedure, known as Leading Edge Discrimination
[2], consists in determining the time when an input signal s(t),
usually a pulse with a fast edge, crosses a certain threshold
value v; equivalently, the assigned time mark is the zero
crossing of y(t) = s(t) − v. A more refined method is
Constant Fraction Discrimination (CFD), where the time mark
corresponds to the zero crossing of a bipolar signal generated
from the original signal s(t) and a delayed copy of it [3].
These methods have been traditionally implemented with
analog electronics, using fast comparators to generate digital
pulses as the crossings take place. However, the current trend
is to digitize detector signals as early as possible in order to
preserve signal integrity and implement an increasing part of
the required signal processing in digital devices. Established
time pick-off methods have been translated to the digital
domain, using a discrete signal that is obtained from the input
Manuscript received Feb 14, 2017; revised May 21, 2017 and Jun 20, 2017;
accepted Jun 26, 2017.
This work was partially supported by the Generalitat Valenciana, Spain,
under Grant PROMETEOII/2014/019.
The author is with the Instituto de Fı́sica Corpuscular (CSIC-UV), C/
Catedrático José Beltrán 2, 46980 Paterna, Spain, and with the Instituto
Universitario de Matemática Pura y Aplicada, Universitat Politècnica de
València, Camino de Vera S/N, 46022 Valencia, Spain.
E-mail: raalva@upvnet.upv.es.
samples and may depend on reconfigurable parameters such
as those in digital CFD [4].
Digital timestamping can thus be interpreted as the problem
of recovering the zero crossing of a continuous signal y(t)
from its sampled values yn. The location of the zero crossing
can be narrowed down to a single sampling interval by
looking for sign changes between consecutive samples, but
applications such as PET [4] and others based on time-of-flight
measurements between detectors in the same array [5] usually
demand a precision better than the typical sampling periods of
4–10 ns, and require methods for sub-sample resolution. The
simplest option is linear interpolation between the samples at
the edges of the selected interval to estimate the zero location
[6]. This procedure can theoretically achieve good timing
resolution [7] and is simple enough that it can be implemented
online on FPGA for real-time processing [8], [9].
A more sophisticated option is to use a higher order
polynomial to interpolate the signal in the interval of interest
and then estimate the location of the root of the resulting
polynomial. This has been shown to achieve better results,
especially for low sampling frequencies that result in loss
of fine signal details like inflection points [10]. In particular,
cubic spline interpolation may be employed; it has been used
for image processing for a long time [11] and the interpolation
procedure i.e. computation of the polynomial coefficients has
been shown to admit efficient hardware implementations [12],
[13]. However, the zero search procedure is more complex
and is generally implemented on software [5], [14]. Mixed
approaches are possible such as [15], where an upsampled
spline interpolation is obtained first and the zero crossing is
determined by linear interpolation on the upsampled signal.
In this paper, we present an efficient algorithm for esti-
mation of the zero crossing of the cubic interpolation. Each
iteration yields one bit of the result and only requires two
additions and bit shift operations, making it particularly suit-
able for real-time implementation in FPGA with very low
computational complexity and limited area requirements. In
timing applications, this allows fast online computation of fine
timestamps of detected events very early in the data acquisition
chain, so they can be used for early trigger and filtering
decisions. After reviewing the more common method of linear
interpolation, the extension to the cubic case is analyzed and
its properties are described. The method is then evaluated
using simulations with ideal and real timing signals, and area
and speed estimates of its possible hardware implementations
are obtained. Some theoretical properties of the algorithm are
stated without proof; their proofs have been obtained but have
not been included in the paper for the sake of brevity.
II. LINEAR INTERPOLATION
We consider a continuous-time signal y(t) and the problem
of estimating the time t0 of its zero crossing from its samples
yn = y(n). We assume that the samples have been obtained
with an ADC with F bit precision and uniform period which
we normalize to 1 and that, after proper scaling, the samples
are quantized and represented digitally as numbers yn ∈
[−1, 1], using two’s complement fixed-point representation
with one sign bit, F − 1 fractional bits and no integer bits,
so that their values are integer multiples of ulp = 2−(F−1),
where “ulp” stands for Unit in the Last Place [16]. We also
assume that the sampling is dense enough that each interval
between samples contains at most one zero crossing of y and
so zero crossings can be put into one-to-one correspondence
with those intervals where y changes its sign, i.e. where
sign yn 6= sign yn+1.
The first step of the solution consists in locating the first
sign change in the sample stream in order to constrain t0 to a
sampling interval; we will assume that this happens between
samples y0 and y1. The second, more complex step then
involves estimating the location of t0 within [0, 1]. To this
end, a model f(t) of the original signal y(t) is constructed
that is intended to be a good approximation of it on [0, 1],
and the zero crossing t? of f(t) is obtained instead as an
estimation of the true zero crossing t0. We restrict ourselves
to the case where f interpolates y at the interval end-points,
i.e. it is constrained to f(0) = y0 and f(1) = y1.
The simplest option is to consider a linear approximation
f(t), which is completely determined by the two known end-








This value may be obtained online using just an adder and
a divider circuit, but here we examine a different approach:
estimating t? using the well-known bisection method for root
finding [17]. This method consists in iteratively evaluating the
continuous function f at the midpoint of an interval where
f changes its sign, and using signs to determine the half-
interval where the root is located. More specifically, let a0 = 0
and b0 = 1, and suppose we begin the n-th iteration with an
interval [an−1, bn−1] such that sign f(an−1) 6= sign f(bn−1).




(an−1 + bn−1). (2)
If sign f(µn) = sign f(bn−1) then a zero crossing must
exist within [an−1, µn], otherwise the root is known to lie in
[µn, bn−1]; the appropriate subinterval is selected for the next
iteration. This procedure is repeated for M iterations until an
interval [aM , bM ] of length 2−M that contains t? is obtained;
aM is then an estimation of t? with a precision of 2−M .
Application of the bisection method to a linear function
f admits an efficient digital implementation for two reasons.
First, evaluation of f at the midpoint of an interval only




[f(an−1) + f(bn−1)]. (3)
Algorithm 1 Iterative linear interpolation
1: Fixed data: Number of iterations/bits M
2: Input: Stream yi of signal samples
3: Detect a zero crossing as sign y0 6= sign y1
4: Initialize a0 ← 0, F0(a0)← y0, F0(b0)← y1
5: for n = 1 to M do
6: Fn(µn)← Fn−1(an−1) + Fn−1(bn−1)
7: if signFn(µn) = signFn−1(an−1) then




12: an ← {an−1, 0}
13: Fn(an)← 2Fn−1(an−1)
14: F (bn)← Fn(µn)
15: end if
16: end for
17: Output: aM , estimation of t
⋆
1. It only requires an (F +1)-bit adder for the computation of
Fn(µn) and two (F + 1)-bit registers for storage of Fn(an),
Fn(bn), besides an M -bit register to store the estimation an.
Note that the adder output is F + 2 bits wide, but one of
them can be safely discarded as guaranteed by bound (6). All
registers in Fig. 1 are shift registers: an shifts every cycle,
while the Fn either shift or load a parallel value depending on
the sign bit of Fn(µn). Multiplexers are included that allow
loading the initial values. The control circuitry and signals are
very simple and are omitted for clarity.
III. CUBIC SPLINE INTERPOLATION
A. Cubic splines
Cubic approximations f(t) are considered now. The most
obvious choice is probably the Lagrange polynomial inter-
polating y at the four points y−1, y0, y1 and y2 [18, §3.1].
However, this approach carries important drawbacks in that the
interpolating polynomials are prone to introducing oscillations,
which is known as Runge’s phenomenon [20]. A preferred
approach is to use cubic spline approximations [18, §3.5], i.e.
piecewise polynomial functions f that are defined as a dif-
ferent cubic polynomial in each interval between interpolation
nodes in such a way that f , f ′ and f ′′ are continuous in the
whole definition interval. Splines do not suffer from Runge’s
phenomenon as they are essentially local approximations; in
fact, they are the smoothest possible approximations of third
degree in a certain sense [11], [12]. More ve , they provide
uniform approximations not only of y but also of its derivatives
[21].
The case will be considered where the spline interpolation is
obtained for a set of 2N consecutive samples yk, −(N−1) ≤
k ≤ N centered at the zero crossing interval, for N ≥ 2
(N = 1 is just the linear case). A different cubic polynomial is
then defined for each interval Ik; however, only the polynomial
f(t) = p0(t) = c3t
3 + c2t
2 + c1t+ c0 (7)












Fig. 1. Hardware implementation of the zero crossing estimator using linear
bisection. Control signals are not shown.
The definition of cubic splines given so far does not deter-
mine them completely, as there are still two free parameters.
The choice of different additional conditions defines different
types of cubic spline. Two of them that do not involve
additional data are considered here: natural splines, for which
f ′′ = 0 is imposed at the two extreme nodes, and parabolically
terminated splines, such that the polynomials at the first and
last interval are quadratic instead of cubic. For both types of
splines, the coefficients ci of f may be obtained as fixed linear
combinations of the set of samples yk under consideration.
More specifically, if the notation
c =
[





y−(N−1) y−(N−2) . . . yN
]T
(9)
is used, then we have:
Theorem 1: For natural and parabolically terminated
splines, there is a 4 × 2N matrix M, which depends only
on N , such that c = My. Moreover, there is an integer D
such that DM consists of integer elements.
This implies that, if N is fixed beforehand, the ci can be
computed by applying appropriate FIR filters with constant
coefficients (i.e. the rows mi of M) to the stream of input
samples. Moreover, if the coefficients Dci of the polynomial
g(t) = Df(t) are computed instead, then the resulting FIR
filters have only integer coefficients. The root finding proce-
dure can then be applied to g instead of f , as they have the
same roots. The formula (39) for M is derived in the proof
of Theorem 1 included in the Appendix.
B. Bisection based implementation
We will now see that efficient computation of f(µn) is still
possible when f is cubic, so the bisection method can still be
applied. To this end, we apply the expansion formula





2 + c1ξ) (10)
to µn = an−1+1/2n and bn−1 = an−1+1/2n−1, in order to
evaluate the error incurred when approximating f(µn) as (4).






(c2 + 3c3µn) (11)
Fig. 1. Implementation of the zero crossing estimator using linear bisection.
Hence, it is only necessary to keep track of the values of
f(an) and f(bn) after each iteration, and computation only
requires an addition. Second, the comparison of signs and
choice of subinterval in the n-th iteration directly yields the
n-th fractional bit of t?: it is a 0 if t? is located in the left half
[an−1, µn] or a 1 otherwise. The binary representation of t?
thus consists of the (possibly inverted) sign bits of successive
values of f(µn), an is built one bit at a time until the first
M bits are obtained.
One possible hardware realization of the algorithm may be
based on storing f(an) and f(bn) in registers with fixed-point
representation, using more than the initial F bits, and then
iteratively computing f(µn) and updating the f(an) and f(bn)
accordingly. However, this is inefficient due to the variation
over time in the dynamic range of the values of f ; specifically,
f(an) and f(bn) are O(2−n). This implies that the least
significant fractional bits in these registers would be unused
initially, i.e. equal to zero, then start being taken into account
progressively as the iteration number increases at a rate of
one bit per iteration. As an and bn converge to t?, the most
significant bits would then become unused eventually.
Instead, it makes more sense to store the value of Fn = 2nf ,
which is equivalent to a “moving-point” representation where
the radix point is shifted one position to the left at a time. The
iteration step (3) then takes the form
Fn(µn) = Fn−1(an−1) + Fn−1(bn−1). (4)
In particular, this involves no division or shift and no trun-
cation takes place; hence, the operation introduces no finite
precision errors. Moreover, if xn denotes one of the abscissae
an, bn or µn, then |xn − t?| ≤ 2−n and the bound
|Fn(xn)| = 2n |f(xn)| = 2n |∆| |xn − t?| ≤ |∆| ≤ 2 (5)
is btained. Hence, only one additional integer bit is needed
per register in order to store Fn(an) and Fn(bn) without risk
of overflow. This representation with minimal register width
has the benefits of reduced area and power consumption and
increased operating frequency by shortening the adder size and
the critical data path [18].
The hardwa e implem ntatio of this procedure is outlined
in Fig. 1. It only requires an (F + 1)-bit adder for the
computation of Fn(µn) and two (F + 1)-bit registers for
storage of Fn(an), Fn(bn), plus an M -bit register to store
the estimation an. Note that the adder output is F + 2 bits
wide, but one of them can be safely discarded as guaranteed
by bound (5). All registers in Fig. 1 are shift registers: an shifts
every cycle, while the Fn either shift or load a parallel value
depending on the sign bit of Fn(µn). Multiplexers are included
that allow loading the initial values. The control circuitry and
signals are very simple and are omitted for clarity.
III. CUBIC SPLINE INTERPOLATION
A. Cubic Splines
Cubic approximations of the form
f(t) = c3t
3 + c2t
2 + c1t+ c0 (6)
are considered now. Since f(0) and f(1) are fixed, there are
two degrees of freedom for their selection. The most obvious
choice is the cubic polynomial interpolating y at the four
points y−1, y0, y1 and y2, which has been used previously
e.g. in [10]; however, this is prone to introducing unwanted
oscillations [19]. An alternative approach is to use cubic spline
approximations [17], i.e. piecewise polynomial functions f
that are defined as a different cubic polynomial in each interval
between interpolation nodes in such a way that f , f ′ and
f ′′ remain continuous. Splines are the smoothest possible
approximations of third degree in a certain sense [11], [12] and
provide uniform approximations not only of y but also of its
derivatives [20]. The case will be considered where the spline
interpolation is obtained for a set of 2N consecutive samples
yk, −(N −1) ≤ k ≤ N centered at the zero crossing interval,
for N ≥ 2. A different cubic polynomial is then defined for
each interval, but only the polynomial (6) corresponding to
[0, 1] is of interest to the root-finding procedure.
The definition of cubic splines given so far does not deter-
mine them completely, as there are still two free parameters.
The choice of different additional conditions defines different
types of cubic spline. Two of them that do not involve
additional data are considered here: natural splines, for which
f ′′ = 0 is imposed at the two extreme nodes, and parabolically
terminated splines, such that the polynomials at the first and
last interval are quadratic instead of cubic.
For both types of splines, the coefficients ci of f may be
obtained as fixed linear combinations of the set of samples yk
under consideration. More specifically, if the notation
c =
[





y−(N−1) y−(N−2) . . . yN
]T
(8)
is used, then there is a 4×2N matrix M, which depends only
on N , such that c = My. This matrix can be expressed as




ω 1 0 · · · 0 0
1 4 1 · · · 0 0







0 0 0 · · · 4 1






−ζ ζ 0 · · · 0 0
−3 0 3 · · · 0 0







0 0 0 · · · 0 3







z 2 −2 z
z −3 3 z
z 0 0 z
z 1 0 z

 , V =


z 1 1 z
z −2 −1 z
z 1 0 z





0 . . . 0
]
is a vector of N − 1 zeros, with
parameters ω = 2, ζ = 3 for natural splines and ω = 1,
ζ = 2 for parabolically terminated splines. This implies that
if N is fixed beforehand then the coefficients ci = mi ·y can
be computed by applying appropriate FIR filters with constant
coefficients, given by the rows mi of M, to the stream of input
samples. Moreover, there is an integer D = detA such that
DM consists of integer elements; hence, if the coefficients
Dci of the polynomial g(t) = Df(t) are computed instead,
then the resulting FIR filters have only integer coefficients.
The root finding procedure can then be applied to g instead
of f , as they have the same roots.
B. Bisection Based Implementation
We now show that efficient computation of f(µn) is pos-
sible when f is cubic so the bisection method can still be
applied. To this end, we apply the expansion formula





2 + c1ξ) (11)
to µn = an−1 + 1/2n and bn−1 = an−1 + 1/2n−1, in order to
evaluate the error incurred when approximating f(µn) as (3).






(c2 + 3c3µn) (12)
is obtained.
Two changes are now applied with the digital implementa-
tion in mind. First, we perform bisection on g = Df instead of
f , so that we can use integer filters. Second, the issue discussed
in Section II regarding storage of values of g of decreasing
magnitude still applies in this case, and suggests basing the
algorithm on Gn = 2ng instead. We thus multiply (12) with
2nD to obtain the iteration formula
Gn(µn) = Gn−1(an−1) +Gn−1(bn−1) +Kn−1 (13)




(Dc2 + 3Dc3µn+1). (14)
This term can be computed iteratively, too, by noticing that




where the sign is − or + if the bit of t? obtained in the n-th


























(Kn−1 ± Ln−1) (16)
Algorithm 1 Iterative cubic interpolation
1: Fixed data: Number of iterations/bits M ; number of
nodes 2N ; coefficients D, k and l
2: Input: Stream yn of signal samples
3: Detect a zero crossing as sign y0 6= sign y1
4: Initialize K0 ← k · y, L0 ← l · y
5: Initialize a0 ← 0, G0(a0)← Dy0, G0(b0)← Dy1
6: for n = 1 to M do
7: Gn(µn)← Gn−1(an−1) +Gn−1(bn−1) +Kn−1
8: if signGn(µn) = signGn−1(an−1) then
9: an ← {an−1, 1}




14: an ← {an−1, 0}




19: Ln ← Ln−1/4
20: end for






This results in the procedure described in Algorithm 1,
where the notation {a, β} stands for concatenation of bit
β ∈ {0, 1} to the right of a, and the circuit depicted in Fig.
2, which is an extension of the one used for linear bisection.
It requires registers for Gn(an), Gn(bn), Kn and Ln plus
the result an; except for Kn, all of them are shift registers
(Ln shifts two bits per iteration). Their optimal widths are
shown; they will be obtained in Section III-D. The circuit also
requires a three input adder for Gn(µn) using (13) and a two
input adder for Kn using (16). Note that the second addition
is dependent on the first one, because the sign bit of Gn(µn)
determines whether Ln is added or subtracted in (16). Since
both adders have comparable size, an adequately pipelined
version of the circuit is obtained by registering the partial
result Gn(µn) (indicated by a dashed box). This increases its
operating frequency, but also reduces its throughput from one
estimation every M clock cycles to one every 2M cycles.
C. Computation of Initial Values
Function values are initialized as G0(a0) = g(a0) = Dy0
and G0(b0) = g(b0) = Dy1, and coefficients are initialized as
K0 = k · y and L0 = l · y where
k = − 12Dm2 − 34Dm3 (18)
l = − 38Dm3 (19)
according to (14) and (17), since µ1 = 12 ; recall that m3 and
m2 are the first two rows of M. It is thus only necessary to
implement FIR filters with the coefficients in k and l instead
of all the mi. These filters turn out to have certain symmetry
Algorithm 2 Iterative cubic interpolation
1: Fixed data: Number of iterations/bits M ; number of
nodes 2N ; coefficients D, k and l
2: Input: Stream yi of signal samples
3: Detect a zero crossing as sign y0 6= sign y1
4: Initialize K0 ← ky, L0 ← l y
5: Initialize a0 ← 0, G0(a0)← Dy0, G0(b0)← Dy1
6: for n = 1 to M do
7: Gn(µn)← Gn−1(an−1) +Gn−1(bn−1) +Kn−1
8: if signGn(µn) = signGn−1(an−1) then
9: an ← {an−1, 1}




14: an ← {an−1, 0}




19: Ln ← Ln−1/4
20: end for
21: Output: aM , estimation of t
⋆
is finally obtained.
Two changes are now applied with the digital implementa-
tion in mind. First, we perform bisection on g = Df instead
of f , so that we can use integer filters. Second, the problems
discussed in section II regarding inefficient use of resources by
storing values of g of decreasing magnitude also arise in this
case, and suggests basing the algorithm on Gn = 2
ng instead.
Hence, we multiply (11) with 2nD to obtain the iteration
formula
Gn(µn) = Gn−1(a −1) +Gn−1(bn−1) +Kn−1 (12)




(Dc2 + 3Dc3µn+1). (13)
These equations show that it is possible to compute Gn(µn)
iteratively using only one multiplication per iteration i.e.
(3Dc3) × µn+1, keeping track only of the values of Gn(an)
and Gn(bn) after each iteration.
The multiplication operation can be removed altogether by
computing Kn iteratively, too. In order to do so, notice that




where the sign is − or + if the bit of t⋆ obtained in the n-th
















































Fig. 2. Hardware implementation of the zero crossing estimator using cubic
bisection. The circuit can be optionally pipelined by registering Gn(µn).






This results in the procedure described in Algorithm 2 and
the circuit depicted in Fig. 2, which is an extension of the
one used for linear bisection. It requires registers for Gn(an),
Gn(bn), Kn and Ln plus the result an; except for Kn, all of
them are shift registers (Ln shifts two bits per iteration). Their
optimal widths are shown; they will be derived in subsection
III-D. The circuit also requires a three input adder for Gn(µn)
using (12) and a two input adder for Kn using (15). Note
that the second addition is dependent on the first one, because
the sign bit of Gn(µn) determines whether Ln is added or
subtracted in (15). Since both adders have comparable size,
an adequately pipelined version of the circuit is obtained by
registering the partial result Gn(µn) (indicated by a dashed
box); this increases its operating frequency, but also its latency
to 2M clock cycles instead of M .
C. Computation of initial values
Function values are initialized as G0(a0) = g(a0) = Dy0
and G0(b0) = g(b0) = Dy1, and coefficients are initialized as
K0 = ky and L0 = l y where
k = − 12Dm2 − 34Dm3 (17)
l = − 38Dm3 (18)
according to (13) and (16), since µ1 =
1
2 ; here m3 and m2 are
the first two rows of M. It is thus only necessary to implement
FIR filters with the coefficients in k and l instead of all the mi.
These filters turn out to have certain symmetry properties that
allow very efficient implementations. Before stating them, let
us fix some notation: if x is a vector, x× denotes the vector
obtained by reversing the order of elements of x, and x is
said to be even (or symmetrical) if x = x×, and odd (or
antisymmetrical) if x = −x×.
Fig. 2. Implementation of the zero crossing estimator using cubic bisection.
The circuit can be optionally pipelined by registering Gn(µn).
properties that allow efficient implementations. In order to
state them, we fix some notation: if v is a vector, v× denotes
the vector obtained by reversing the order of elements of v,
and v is said to be even (or symmetrical) if v = v×, and odd
(or antisymmetrical) if v = −v×. Then it is possible to prove
that k is even and l is odd. Moreover, their coefficients are
related and they can be expressed as
k = AK ·
[
α β β α×
]
(20)
l = AL ·
[
α γ −γ −α×
]
(21)
where α is an integer vector, β and γ are integers, and AK
and AL are integers divided by powers of two.
Table I shows the values of D, k and l in the form (20),
(21) for N ≤ 5 and both types of splines. In each case, the
smallest integer value of D is given that results in integer
vectors except for a negative power of two, which has no
impact on hardware implementation. Only half of k and l
are shown; they are meant to be extended evenly and oddly,
respectively. Note that coefficient values grow with N and are
impractical for N > 5.
Different implementations of the FIR filters k, l have been
considered. The most obvious choice is the so-called direct or
tapped delay line form shown in Fig. 3, where samples are
stored in a shift register and the filter outputs are computed
directly as weighted sums [21]. Filter symmetry allows a
reduction in the amount of constant coefficient multipliers
by adding or subtracting pairs of samples before weighting.
However, the fact that k and l exhibit different kinds of
symmetry prevents us from taking advantage of their common
co f cients α. Moreover, this structure includes a very long
combinational path featuring adder trees with N inputs, whose
delay increases with N .
These disadvantages are typically overcome by imple ent-
ing the transposed form of the filters [21]. In that case, the
current sample is weighted by each coefficient separately and
a shift register with partial filter sums is employed instead.
This allows sharing the multipliers between both filters, and
also reduces the critical path; moreover, the resulting structure
TABLE I
INTEGER VECTORS FOR INITIAL CUBIC BISECTION COEFFICIENTS
N Natural splines Parabolically terminated splines
2











1 −3 . . .
]











1 −3 . . .
]
3











1 −6 13 . . .
]











1 −7 16 . . .
]
4











1 −6 24 −49 . . .
]











1 −7 30 −62 . . .
]
5











1 −6 24 −90 183 . . .
]



































+ + + ×
AK
K0




Fig. 3. Direct (tapped delay line) FIR filter structure for simultaneous



















Fig. 4. Bidirectional transposed FIR filter structure, for N = 4.
Theorem 2: For natural and parabolically terminated
splines, k is even and l is odd. Moreover, they have the form
k = AK ·
[
α β β α×
]
(19)
l = AL ·
[
α γ −γ −α×
]
(20)
where α is an integer vector, β and γ are integers, and AK
and AL are integers divided by powers of two.
Table I shows the values of D, k and l in the form (19), (20)
for N ≤ 5 and both types of splines. In each case, the smallest
integer value of D is given that results in integer vectors
except for a negative power of two, which has no impact on
hardware implementation. Only half of k and l are shown;
they are meant to be extended evenly and oddly, respectively.
Coefficients are much smaller for parabolically terminated
splines (not taking powers of two into account), owing to the
fact that detA tends to be divisible by high powers of two
in that case; this suggests that their implementation may be
more efficient. Note that coefficient values grow with N and
are already impractical for N > 5.
Different implementations of the FIR filters k, l have
been considered. The most obvious choice is the so-called
direct or tapped delay line form shown in Fig. 3, where
samples are stored in a shift register and the filter outputs are
computed directly as weighted sums. Filter symmetry allows
a reduction in the amount of constant coefficient multipliers
by adding or subtracting pairs of samples before weighting.






























Fig. 5. Custom filter structure for simultaneous computation of the initial
bisection values.
symmetry prevents us from taking advantage of their common
coefficients α. Moreover, this structure includes a very long
combinational path featuring adder trees with N inputs, whose
delay increases with N .
These disadvantages are typically overcome by implement-
ing the transposed form of the filters [22, §9.3]. In that case,
the current sample is weighted by each coefficient separately
and a shift register with partial filter sums is employed instead.
This allows sharing the multipliers between both filters, and
also reduces the critical path; moreover, the resulting structure
featuring cascaded adder-register elements is particularly effi-
cient for FPGA implementations. The drawback in that case is
that separate register chains are needed for k and l, and thus
more registers are needed, which moreover need to be wider
as they store multiplication-accumulation results instead of the
raw samples as in the direct case.
This problem can be partially solved by using a custom
structure that takes advantage of the common coefficients α
and the symmetry simultaneously as follows: express the initial
values K0, L0 as
K0 = k · y = AK · [d+ β(y0 + y1) + r] (21)
L0 = l · y = AL · [d+ γ(y0 − y1)− r] (22)
where
d = α ·
[
y−(N−1) y−(N−2) · · · y−1
]T
= α0y−(N−1) + α1y−(N−2) + . . .+ αN−2y−1, (23)
r = α× ·
[
y2 y3 · · · yN
]T
= αN−2y2 + αN−3y3 + . . .+ α0yN (24)
are the result of applying the same filter α to the first and
last N − 1 samples in y, first in direct order (d) and then
in reverse order (r). The transposed FIR filter structure can
be modified to allow for bidirectional operation as shown in
Fig. 4, where a common signal controls all multiplexers and
determines whether the data flow is from left to right (direct)
or vice versa (reverse); the same circuit can then be used
to compute d and r, as they correspond to non-overlapping
windows of the input signal. The full, custom filter structure
is depicted in Fig. 5; control signals are omitted for clarity.
D. Error analysis and dimensioning
In the implementation of linear interpolation proposed in
section II there is no error due to finite precision effects. That
Fig. 3. Direct (tapped delay li e) FIR filter structure for simultaneous























+ + + ×
AK
K0




Fig. 3. Direct (tapped d lay line) FIR filter structure for simultaneous



















Fig. 4. Bidirectional transposed FIR filter structure, for N = 4.
Theorem 2: For natural and parabolically terminated
splines, k is ven and l is od . ver, they have the form





l = AL ·
[
α γ γ −α×
]
(20)
where α is an integer vector, β and γ are integers, and AK
and AL are integers divided by powers of two.
Table I shows the values of D, k and l in the form (19), (20)
for N ≤ 5 and both types of splines. In each case, the smallest
integer value of D is given that results in integer vectors
except for a negative power of two, which has o impact on
hardwa e implementation. Only half of k and l are shown;
they are eant to be extended evenly and oddly, respectively.
Coefficients are much smaller for parabolically terminated
splines (not taking powers of two into account), owing to the
fact that detA tends to be divisible by high powers of two
in that case; this suggests that their implementation may be
more efficient. Note that coefficient values grow with N and
are already impractical for N > 5.
Different implementations of the FIR filters k, l have
been considered. The most obvious choice is the so-called
direct or tapped delay line form shown in Fig. 3, where
sample are stored in a shift register and the filter outputs are
computed directly as we ghted sums. Filter symmetry allows
a reduction in the amount of constant coefficient multipliers
by adding or subtracting pairs of samples before weighting.






























Fig. 5. Custom filter structure for simultaneous computation of the initial
bisection values.
symmetry prevents us from taking advantage of their common
coefficients α. Moreover, this structure includes a very long
combinational path featuring adder trees with N inputs, whose
delay increases with N .
These disadvantages are typically overcome by implement-
ing the transposed form of the filters [22, §9.3]. In that case,
the current sample is weighted by each coefficient separately
and a shift regi ter with partial filter sums is employed instead.
This allows sharing the multipliers between both filters, and
also reduces the critical path; moreover, the resulting structure
featuring cascaded adder-register elements is particularly effi-
cient for FPGA implementations. The drawback in that case is
that separate register chains are needed for k and l, and thus
more registers are needed, which moreover need to be wider
as they store multiplication-accumulation results instead of the
raw samples as in the direct case.
This problem can be partially solved by using a custom
struct re that takes advantage of the common coefficients α
and the symmetry simultaneously as follows: expre s the initial
values K0, L as
K0 = k · y = AK · [d+ β(y0 + y1) + r] (21)
L0 = l · y = AL · [d+ γ(y0 − y1)− r] (22)
where
d = α ·
[
y−(N−1) y−(N−2) · · · y−1
]T
= α0y−(N−1) + α1y−(N−2) + . . .+ αN−2y−1, (23)
r = α× ·
[
y2 y3 · · · yN
]T
= αN−2y2 + αN−3y3 + . . .+ α0yN (24)
are the result of applying the same filter α to the first and
last N − 1 samples in y, first in direct order (d) and then
in reverse order (r). The transposed FIR filter structure can
be modified to allow for bidirectional operation as shown in
Fig. 4, where a common signal controls all multiplexers and
determines whether the data flow is from left to right (direct)
or vice versa (reverse); the same circuit can then be used
to compute d and r, as they correspond to non-overlapping
windows of the input signal. The full, custom filter structure
is depi ted in Fig. 5; control signals are omitte f r clarity.
D. Error analysis and dimensioning
In the implementation of linear interpolation proposed in
section II there is no error due to fin te precision effects. That
Fig. 4. Bidirectional transposed FIR filter structure, for N = 4.
featuring cascaded adder-register elements is particularly effi-
cient for FPGA implementations. The drawback in that case is
that separate register chains are needed for k and l, and thus
more registers are needed, which moreover need to be wider
as they store multiplication-accumulation results instead of the
raw samples as in the direct case.
This problem can be partially solved by using a custom























+ + + ×
AK
K0




Fig. 3. Direct (tappe delay line) FIR filter structure for simultaneous

















Fig. 4. Bidirectional transposed FIR filter structure, for N = 4.
Theorem 2: For natural and parabolically terminated
splines, k is even and l is odd. Moreover, they have the form
k = AK ·
[
α β β α×
]
(19)
l = AL ·
[
α γ −γ −α×
]
(20)
where α is an integer vector, β and γ are integers, and AK
and AL are integers divided by powers of two.
Table I shows the values of D, k and l in the form (19), (20)
for N ≤ 5 and both types of splines. In each case, the smallest
integer value of D is given that results in integer vectors
exc pt for a egative pow r of two, which has no impact on
hardware impl mentation. Only half of k and l are shown;
they are meant to be extended evenly and oddly, respectively.
Coefficients are much smaller for parabolically terminated
splines (not taking powers of two into account), owing to the
fact that detA tends to be divisible by high powers of two
in that case; this suggests that their implementation may be
more efficient. Note that coefficient values grow with N and
are alr ady impractical for N > 5.
Different implementations of the FIR filters k, l have
b en consi ered. The most bvious choice is the s -called
direct o tapp delay line form sh wn i Fig. 3, wh re
sampl s are stored in a s ift register and the filter outputs are
computed directly as weighted sums. Filter sym etry allows
a re uction in the amount of constant coeffici nt multipliers
by adding or subtrac ing pairs of samples before weighting.






























Fig. 5. Custom filter structure for simultaneous computation of the initial
bisection values.
symmetry prevents us from taking advantage of their common
coefficients α. Moreover, this structure includes a very long
ombinational path featu ing adder tre s with N i puts, whose
delay increases with N .
These disadvantages are typically overcome by implement-
ing the transposed form of the filters [22, §9.3]. In that case,
the current sample is weighted by each coefficient separately
and a shift register with partial filter sums is employed instead.
This allows sharing the multipliers between both filters, and
also reduces the critical path; moreover, the resulting structure
featuring casca ed add -register elements is particularly effi-
cient for FPGA implementations. The drawback in that case is
that separate register chains are needed for k and l, and thus
more registers are needed, which moreover need to be wider
as they store multiplication-accumulation results instead of the
raw samples as in the direct case.
This problem can be partially solved by using a custom
structure that takes advantage of the common coefficients α
and the symmetry simultaneously as follows: express the initial
values K0, L0 as
K0 = k · y = AK · [d+ β(y0 + y1) + r] (21)
L0 = l · y = AL · [d+ γ(y0 − y1)− r] (22)
where
d = α ·
[
y−(N−1) y−(N−2) · · · y−1
]T
= α0y−(N−1) + α1y−(N−2) + . . .+ αN−2y−1, (23)
r = α× ·
[
y2 y3 · · · yN
]T
= αN−2y2 + αN−3y3 + . . .+ α0yN (24)
are the result of applying the same filter α to the first and
last N − 1 samples in y, first in direct order (d) and then
in reverse order (r). The transposed FIR filter structure can
be modified to allow for bidirectional operation as shown in
Fig. 4, where a common signal controls all multiplexers and
determines wh ther the data flow is from left to right (direct)
or vice vers (rever e); the same circuit ca th n be used
to compute d and r, as they correspond to non-overlapping
windows of the input signal. The full, custom filter structure
is depicted in Fig. 5; control signals are omitted for clarity.
D. Error analysis and dimensioning
I the implementation of linear interpolation proposed in
section II there is no error due to finite precision effects. That
Fig. 5. Custom filter structure for simultaneous computation of the initial
bisection values.
and the symmetry simultaneously as follows: express the initial
values K0, L0 as
K0 = k · y = AK · [d+ β(y0 + y1) + r] (22)
L0 = l · y = AL · [d+ γ(y0 − y1)− r] (23)
where
d = α ·
[
y−(N−1) y−(N−2) · · · y−1
]T
= α0y−(N−1) + α1y−(N−2) + . . .+ αN−2y−1, (24)
r = α× ·
[
y2 y3 · · · yN
]T
= αN−2y2 + αN−3y3 + . . .+ α0yN (25)
are the result of apply ng the same filter α to the first and
last N − 1 sa ples in y, first in direct order (d) and then
in reverse order (r). The transpos d FIR filter structure can
be modified to allow for bidirectional operation as shown in
Fig. 4, where a common signal controls all multiplexers and
determines whether the data flow is from left to right (direct)
or vice versa (reverse); the same circuit can then be used
to compute d and r, as they correspond to non-overlapping
indows of the input signal. The full, custom filter structure
is depicted in Fig. 5; control signals are omitted for clarity.
D. Error Analysis and Dimensioning
Unlike the linear case, values stored in the proposed cubic
method converge to 0 at different speeds, so truncation errors
appear eventually. An analysis of finite precision effects is thus
necessary for proper circuit dimensioning. It seems reasonable
to store the values Gn(an), Gn(bn), Kn and Ln in registers
with the same precision, as they are added with each other
directly in (13) and (16). Suppose that they use ulp = 2−Q,
where Q is allowed to be smaller than F −1 or even negative,
with the meaning that the lowest −Q integer bits are discarded
in the latter case. Then it is possible to prove that the absolute
error in the evaluation of g is bounded by 1/2Q−1, and that
the algorithm always converges to some t̂? such that
∣∣g(t̂?)
∣∣ ≤
1/2Q−1. An appropriate design choice is then to choose Q
such that the evaluation error can be neglected by forcing it to
be smaller than the quantization error in g, which is bounded
by D/2F−1. This criterion yields a minimum value
Q = F − blog2Dc (26)
where bxc denotes the greatest integer not exceeding x. Notice
that Q decreases as N and D increase: since the samples are
amplified by a factor of D before bisection, the ulp may be
increased while maintaining the same relative precision.
Maximum values of Kn, Ln and Gn(xn) also need to be
estimated for register sizing. It is easy to see that Kn and
Ln are both bounded by S = max {‖k‖1 , ‖l‖1}, where ‖v‖1
denotes the sum of the absolute values of the elements of
vector v; the values of S are included in Table I, and it
can be observed that they are always bounded by 2D for
all values of N under consideration. Regarding Gn, it can
be proven rigorously that its value is always bounded by
8D if one assumes that g is monotonic in the approximation
interval; we conjecture that the stronger bound 3
√
3D ≈ 5.2D
actually applies and that the hypothesis of monotonicity is not
necessary, but we were not able to prove it. We remark that
the assumption that f is monotonic is reasonable in timing
applications: linearization around the zero crossing shows that




≈ σy|f ′(t?)| (27)
where σy is the noise in the input signal. Hence, proper design
will try to force the timing signals to cross the zero level with
the steepest possible slope. If the sampling is dense enough,
the signal will therefore be monotonic on the zero crossing
interval; in fact, its derivative should not vary much and may
be estimated as ∆.
It follows from these bounds that the amount of integer
bits for the registers holding Gn and Kn/Ln need not be
higher than 3 + dlog2De and 1 + dlog2De, respectively, to
avoid overflow; here, dxe denotes the smallest integer which
is not smaller than x. Combining this with (26), the optimal
register widths are F + 5 and F + 3 bits for Gn and Kn/Ln,
respectively. Hence the size of the zero crossing estimator
circuit in Fig. 2 is completely independent of N : it only
depends on the bit resolution F of the original signal yn.
IV. ALGORITHM EVALUATION
The validity of the proposed algorithms was evaluated by
applying them to CFD signals offline using commercial soft-
ware (MATLAB 2014b, The Mathworks, Inc., Natick, MA).
Ideal, simulated detector pulses and digitized signals from a
real detector setup were both tested. A database of captured
timing pulses was used which had been acquired using the
experimental setup and DAQ system described in detail in
[9]. The setup consisted of a pair of continuous scintillating
crystals coupled to position-sensitive photomultiplier tubes
working in coincidence, with a 511 keV point source placed
close to one of them. The last dynode signals from both PMTs
were shaped with a pole-zero cancellation filter and sampled at
156.25 MHz with 12-bit ADCs. A set of 28000 pairs of timing
signals corresponding to position-, energy- and coincidence-
filtered events was used for the tests.
Simulated noiseless detector pulses with similar parameters
as the real ones were used first, for two reasons. First,
this provided us with the whole waveform instead of just
the captured samples, and thus the exact zero crossing for
algorithm error evaluation. Second, it allowed us to generate
a larger amount of simulated signals. Pulses were modeled as
s(t) =
{
0 , t < 0
A · t2e−t/τsh , t ≥ 0 (28)
as proposed in [22]. The shaping constant τsh was considered
to be uniformly distributed in [1, 1.5], which corresponds
approximately to the filter used for acquisition of the real
signals. The CFD signals
y(t) = s(t− τ)− α · s(t) (29)
were generated with fixed parameters τ = 4 and α = 0.5; am-
plitudes A were chosen so that max |y(t)| would be distributed
uniformly on [0.2, 0.95]. The signals were then sampled as
yn = y(n − δ) with a delay δ that was uniformly distributed
on [0, 1] in order to model arbitrary sampling phase, and
quantized with F = 12 bit precision (i.e. with ulp = 2−11).
We simulated 107 different pulses. For each one, the true
zero crossing t0 of the signal was obtained from the continuous
signal y(t). All 9 proposed bisection methods (linear, and both
types of spline for N between 2 and 5) were then applied to
the samples yn to get the estimated zero crossings t? with a
precision of M = 10 bits, using the fixed point precisions Q
given in Table II; they correspond to the optimum values (26)
for each case. An example of the simulated pulses is pictured
in Fig. 6 (a), together with the 6-node parabolically terminated
spline interpolation around the zero-crossing interval.
The estimation error |t0 − t?| was evaluated, and the mean
and maximum errors obtained for each method are shown
in Table II; for comparison, the precision is around 10−3.
The mean error is reduced by roughly one third when using
cubic instead of linear interpolation, with both types of spline
yielding equivalent results, and no significative change is
observed when increasing N beyond 3. Fig. 6 (a) includes
a close-up of the actual zero crossing and the linear, 4-node
and 6-node interpolation, and shows how the interpolated zero
crossings get closer to the real one as N increases.
TABLE II
ALGORITHM PERFORMANCE ON SIMULATED SIGNALS
Type N Q Mean error Max. error Reg. bound
Linear 4.08 · 10−2 1.82 · 10−1 43.5%
Natural 2 9 3.01 · 10−2 1.29 · 10−1 11.8%
3 5 2.65 · 10−2 1.10 · 10−1 11.7%
4 1 2.67 · 10−2 1.09 · 10−1 11.7%
5 −3 2.65 · 10−2 1.08 · 10−1 11.7%
Parabolically 2 13 3.03 · 10−2 1.37 · 10−1 11.6%
terminated 3 10 2.72 · 10−2 1.14 · 10−1 11.7%
4 5 2.66 · 10−2 1.09 · 10−1 11.7%
5 3 2.66 · 10−2 1.08 · 10−1 11.7%
TABLE III
ALGORITHM PERFORMANCE ON CAPTURED SIGNALS
Type N FWHM coincidence Reg. bound
resolution
Linear 3.46 ns 39.8%
Natural 2 3.25 ns 10.3%
3 3.17 ns 10.1%
4 3.17 ns 10.1%
5 3.18 ns 10.1%
Parabolically 2 3.28 ns 10.3%
terminated 3 3.19 ns 10.2%
4 3.17 ns 10.1%
5 3.18 ns 10.1%
For each real stored signal, a digital CFD signal was
obtained using (29) with the same parameters τ = 4 and
α = 0.5; an example is pictured in Fig. 6 (b) that shows
a similar behaviour to the simulated pulses. For each of
the proposed algorithms, the zero crossing timestamps were
estimated for each timing signal, their pairwise differences
were histogrammed, and a gaussian fit was used to estimate the
coincidence resolution. The results are summarized in Table
III and reveal a similar pattern as observed in the simulated
signals, although the performance improvement is lower due
to the presence of noise and distortion in the samples.
The maximum values stored in registers are also shown in
Tables II and III as a fraction of the theoretical bound, i.e.
max |Fn(xn)| /2 for linear and max |Gn(xn)| /8D for cubic
interpolation. No overflow occurred, and in fact the observed
extrema are significantly lower than the theoretical bounds, so
that register widths may safely be shortened.
V. HARDWARE IMPLEMENTATION
We synthesized all of the proposed variations of the cubic
interpolation circuit on FPGA technology, i.e. for natural and
parabolically terminated splines, 2 ≤ N ≤ 5, and using
both FIR filter structures (direct and customized), in order
to analyze the trade-offs involved in varying the amount of
interpolation nodes. The linear bisection circuit was also syn-
thesized for reference, as well as the standalone zero crossing
estimators. The HDL code for each circuit was simulated first
in order to check its correctness of operation. A Stratix IV
EP4SGX110 device was chosen for implementation, the same
one that is present in the DAQ system employed for acquisition


















Fig. 6. Illustration of the method on simulated and real timing signals. Each
plot includes the original detector signal (gray, dashed), the CFD signal (black,
dashed), and the 6-node spline interpolation (black, solid). For the simulated
signals, a close-up of the zero crossing is shown that includes, from left to
right: linear interpolation (dotted), 4-node spline (dot-dashed), 6-node spline
(dashed) and CFD signal (solid). All splines are parabolically terminated.
of the waveforms used in Section IV [9]. In all cases, we
collected the area occupation and the delay of the critical
register to register path in the placed and routed circuit. The
results are summarized in Table IV; the area occupation is
specified in terms of Adaptive Logic Modules (ALMs), the
basic building blocks of the Stratix IV that consist of two
LUTs, two registers, and additional adder and chain logic. All
ALMs involved in the circuits are considered, even if they are
only partially populated.
In general, the circuits become larger and slower when N
increases as the amount and size of the filter coefficients
grow, and are always smaller for parabolically terminated
splines. The customized filter structure yields vastly improved
performance in the FPGA implementations compared to the
direct structure. The main reason is that the FPGA architecture
is optimized for two-input adders such as those found in the
custom filters, rather than the combinational adder trees that
are necessary for multiple operand addition in the direct filter.
Pipelining the zero crossing estimator decreases its critical
path from 6.30 ns to 4.36 ns but doubles its latency so it results
in an overall slower circuit that is only needed if a particularly
high operating frequency is mandatory. The critical path is
located in the FIR filter in most cases; hence, the figures
corresponding to the pipelined version have been omitted.
Like most modern FPGAs, the Stratix IV hosts a number
of DSP blocks containing 18 × 18 bit hardware multipliers.
Circuits were generated with and without allowing the use
of these blocks in order to evaluate their impact. Using the
TABLE IV
IMPLEMENTATION RESULTS ON FPGA
w/ mults w/o mults
Type N Area Mults Delay Area Delay
(ALMs) (DSPs) (ns) (ALMs) (ns)
Linear (ZC only) 27 0 3.13
Linear (full) 32 0 4.45
Cubic (ZC only) 55 0 6.30
Direct FIR filter
Natural 2 160 0 9.45
3 236 1 13.22 295 13.76
4 344 2 15.02 500 14.95
5 452 2 15.13 729 14.95
Parabolically 2 121 0 7.99
terminated 3 216 0 11.99
4 305 2 15.15 445 15.03
5 471 2 15.32 580 15.29
Custom FIR filter
Natural 2 161 0 6.65
3 202 1 6.86 227 7.36
4 254 2 8.96 369 8.91
5 297 2 8.08 510 9.94
Parabolically 2 133 0 6.81
terminated 3 186 0 5.73
4 201 2 7.52 341 8.01
5 276 2 9.75 431 8.45
dedicated hardware multipliers results in large area gains but
does not have a significant impact on circuit speed. However,
we remark that no attempt was made to optimize the FIR filters
by fine tuning the DSP block configuration options.
VI. CONCLUSIONS AND DISCUSSION
A method has been presented for real-time estimation of the
zero crossing locations of sampled signals consisting in two
steps: interpolation by a cubic spline, and iterative estimation
of its root. The interpolation step requires only two small FIR
filters, and a custom filter structure has been proposed for
it that benefits from symmetry properties. The zero location
algorithm only requires two additions per iteration and lends
itself to a very efficient hardware implementation. Optimal
number representations have been determined that make the
precision error negligible. The exposition has focused on
FPGA implementation using fixed-point representation, but the
algorithm can also improve the speed of zero estimation in
other settings such as DSP or software since it does not require
divisions or multiplications by variable factors.
Different algorithm configurations have been studied. A
preliminary evaluation on simulated and real data suggests
that the choice between natural and parabolically terminated
splines does not have a significant impact and that there is no
real gain from using more than 6 nodes for interpolation, so
this amount should be used since the circuits become less
efficient as the number of nodes increases. Note that this
value has also been used without justification in previous
work such as [5]. The results also suggest that this method
might improve time resolution by a setting-dependent factor
between 10 % and 30 % with respect to linear interpolation.
A more comprehensive evaluation of this method and other
tuning options, such as the variation of CFD parameters, in
an experimental setup is currently planned.
APPENDIX
PROOFS OF THE THEORETICAL RESULTS
Theorem 1: For natural and parabolically terminated
splines, there is a 4 × 2N matrix M, which depends only
on N , such that c = My. Moreover, there is an integer D
such that DM consists of integer elements.
Proof: Let the spline s consist of the polynomials pk :
[0, 1] → R that interpolate y at the interval Ik = [k, k + 1],







−(N−2) · · · y′N
]T
. (30)
Notice that pk can be expressed in terms of its values and
those of its first derivative at its end points as
pk(t) = (2yk − 2yk+1 + y′k + y′k+1)t3
+ (−3yk + 3yk+1 − 2y′k − y′k+1)t2 + y′kt+ yk. (31)
In particular, for k = 0 the relationship c = Uy + Vy′ is
obtained with the 4 × 2N matrices U and V given by (10).
Thus, it suffices to show that y′ = Ey for a 2N × 2N matrix
E.
The continuity of s′′ implies coincidence of the second
derivatives at all interpolation nodes except the first and last







k−1 = 3yk+1 − 3yk−1 (32)
for −(N − 2) ≤ k ≤ N − 1, by substitution into (31). Two
conditions remain to be imposed. In the case of natural splines,
s′′ = 0 at the extreme nodes, resulting in
y′−(N−2) + 2y
′
−(N−1) = 3y−(N−2) − 3y−(N−1) (33)
2y′N + y
′
N−1 = 3yN − 3yN−1. (34)
These two equations, together with (32), can be written in
matrix form as Ay′ = By, where A and B are the 2N ×2N
matrices (9) with parameters ω = 2 and ζ = 3. Hence we
finally obtain
M = U + VE = U + VA−1B (35)
where E = A−1B and the expressions of matrices A, B, U
and V are given in (9) and (10).
Similarly, for parabolically terminated splines, the cubic




−(N−2) = 2y−(N−2) − 2y−(N−1) (36)
y′N−1 + y
′
N = 2yN − 2yN−1 (37)
which result in the same equations Ay′ = By but with
parameters ω = 1 and ζ = 2. The same formula (35) is thus
obtained for M.
Finally, notice that taking D = detA ∈ Z results in DM
having only integer elements because it is obtained as addition
and product of integer matrices, since the elements of DA−1
are minors of A which is itself integer. 
Theorem 2: For natural and parabolically terminated
splines, k is even and l is odd and they have the form (20)
and (21), where α is an integer vector, β and γ are integers,
and AK and AL are integers divided by powers of two.
Proof: Let us generalize the notation in the text as follows:
for a m× n matrix X, denote by X× the m× n matrix that
is obtained by reversing the order of its rows and columns,
i.e. such that its (i, j)-th element is x×ij = xm+1−i,n+1−j ,
and let us call a matrix X ×-symmetric or ×-antisymmetric if
X× is equal to X or −X, respectively. Notice that both types
of symmetry are closed under addition and multiplication by
scalars. Notice also that if X is m×n, Y is n×p and Z = XY,
then













so that (XY)× = X×Y×. In particular, if X is square and
invertible then I = I× = (XX−1)× = X×(X−1)× and so
(X−1)× = (X×)−1; hence, both types of symmetry are also
preserved by matrix inversion.
Since A is ×-symmetric and B is ×-antisymmetric, E =
A−1B is ×-antisymmetric. Hence, if eN and eN+1 are the
N -th and (N + 1)-th rows of E then eN+1 = −e×N so that
eN + eN+1 is odd and eN − eN+1 is even. Now, (10) and
(35) imply that
k = − 12Dm2 − 34Dm3 = 14D(eN − eN+1) (39)
is even, whereas
l = − 38Dm3
=
[
z − 34D 34D z
]
− 38D(eN + eN+1) (40)
is a sum of odd vectors and therefore odd itself.
For the second part of the theorem, we will use the following






= detX · detY (41)

















where I are identity matrices of appropriate size. We will also




1 0 0 · · · 0 0
4 1 0 · · · 0 0







0 0 0 · · · 1 0






4 1 0 · · · 0 0
1 4 1 · · · 0 0







0 0 0 · · · 4 1




and denote rn = detHn−1/ detHn; note that detGn = 1.
Let Q = A−1, and express the first N elements in its two


































for m ≤ N , where (41) has been used. It follows that
qN+1,m/qN,m = −rN does not depend on m, and thus the
left halves of rows N and N + 1 of A−1 are proportional.
Since B is tridiagonal, the last N elements in its first N − 1
columns are all zero, and it follows that eN+1,m/eN,m = −rN
for m < N , i.e. the first N − 1 elements in eN and eN+1 are
proportional. By (40) and (39), so are the first N−1 elements
in k and l, as was to be proved. The fact that the elements are
integers except possibly for a negative power of two follows
immediately from (18), (19) and Theorem 1. 
Theorem 3: The absolute error in the evaluation of g is




Proof: Let Ĝn, K̂n, L̂n denote the values of Gn, Kn, Ln
estimated by the algorithm implementation. Then L̂n is Ln
truncated to a multiple of 2−Q and thus
∣∣∣L̂n − Ln
∣∣∣ ≤ 1/2Q
for all n. Similarly,
∣∣∣K̂0 −K0
∣∣∣ ≤ 1/2Q. We now prove that∣∣∣K̂n −Kn
∣∣∣ ≤ 1/2Q−1 by induction: assume it to be true for







L̂n−1 − εn (46)
where εn represents the LSB of the sum which is discarded
when shifting one bit to the right to divide by two; thus, it is

























as was to be shown.








For n = 0, G0 is just a truncated value of g so ε0 ≤ 1/2Q.
For higher n, one of an, bn is equal to µn and the other one



















In particular, for n = 1
∣∣∣Ĝ1(µ1)−G1(µ1)





whereas 2ε0 ≤ 1/2Q−1, so (49) implies ε1 ≤ 3/2Q.




(2n+1 − 1) (52)
for all n by induction. This has already been shown for n = 0
and n = 1. Let n ≥ 2 and assume that (52) holds for n−1 and
n− 2. Notice that one of an−1, bn−1 is equal to µn−1 while
the other one holds its value from the (n− 2)-th iteration; for
the former, the evaluation error is bounded by εn−1, but for
the latter, the stronger bound 2εn−2 applies. Then (50) yields
∣∣∣Ĝn(µn)−Gn(µn)









(2n+1 − 1). (53)
On the other hand
2εn−1 ≤ 2 ·
1
2Q
(2n − 1) < 1
2Q
(2n+1 − 1). (54)
Equations (49), (53) and (54) show that (52) holds for n and
complete its proof. Finally, since g(µn) = 2−nGn(µn), the ab-
solute error in its evaluation is bounded by 2−nεn < 1/2Q−1
for all n, and the first part of the theorem is proved.
For the second part, notice that the algorithm always yields
intervals [an, bn] ⊂ [an−1, bn−1] such that sign ĝ(an) 6=
sign ĝ(bn). If evaluation errors never cause ĝ(µn) to have the
wrong sign, then subintervals are always selected correctly
and the algorithm converges to t?. Otherwise, sign ĝ(µm) 6=
sign g(µm) for some iteration m and t? lies outside of [an, bn]
for all n ≥ m. Even in this case, the algorithm still converges
to some t̂? = lim an = lim bn because an is non-decreasing,
bn is non-increasing, and bn − an = 2−n → 0. Since t? is
the only root of g in [0, 1], sign g(an) = sign g(bn) must hold
for all n ≥ m, so one of g(an), g(bn) undergoes a change
of sign when evaluated; hence, its absolute value cannot be
higher than 1/2Q−1. The continuity of g then implies that∣∣g(t̂?)
∣∣ ≤ 1/2Q−1. 
Theorem 4: If g is monotonic in [0, 1], then |Gn(an)|,
|Gn(bn)| and |Gn(µn)| are all bounded by 8D for all n.
Proof: The proof is based on the following classical result
by Bernstein (Theorem 15.5.3 in [23]): Let p be a polynomial









for all t ∈ [−1, 1]. In our case, n = 3 and the polynomial
is rescaled to [0, 1] so the derivative is doubled and g′(t) ≤
8 max |g| ≤ 8D for all t ∈ [0, 1]. Now, if xn is one of an, bn
or µn, then |xn − t?| ≤ 2−n and so







≤ 2n |xn − t?| ·max |g′| ≤ 8D.  (56)
REFERENCES
[1] G. F. Knoll, Radiation detection and measurement, 3rd ed. John Wiley
& Sons, 2000, pp. 659–679.
[2] T. J. Paulus, “Timing electronics and fast timing methods with scintil-
lation detectors,” IEEE Trans. Nucl. Sci., vol. 32, no. 3, pp. 1242–1249,
Jun. 1985.
[3] D. A. Gedcke and W. J. McDonald, “A constant fraction of pulse height
trigger for optimum time resolution,” Nucl. Instr. Meth. A, vol. 55, pp.
377–380, 1967.
[4] J. M. Monzó, R. Esteve, C. W. Lerche, N. Ferrando, J. Toledo, R. J.
Aliaga, V. Herrero, and F. J. Mora, “Digital signal processing techniques
to improve time resolution in Positron Emission Tomography,” IEEE
Trans. Nucl. Sci., vol. 58, no. 4, pp. 1613–1620, Aug. 2011.
[5] V. Modamio, J. J. Valiente-Dobón, G. Jaworski, T. Hüyük, A. Triossi,
J. Egea, A. Di Nitto, P.-A. Söderström, J. Agramunt Ros, G. de Angelis,
G. de France, M. N. Erduran, S. Ertürk, A. Gadea, V. González,
J. Kownacki, M. Moszynski, J. Nyberg, M. Palacz, E. Sanchı́s, and
R.Wadsworth, “Digital pulse-timing technique for the neutron detector
array NEDA,” Nucl. Instr. Meth. A, vol. 775, pp. 71–76, Mar. 2015.
[6] M. A. Nelson, B. D. Rooney, D. R. Dinwiddie, and G. S. Brunson,
“Analysis of digital timing methods with BaF2 scintillators,” Nucl. Instr.
Meth. A, vol. 505, pp. 324–327, 2003.
[7] E. Delagnes, “What is the theoretical time precision achievable using a
dCFD algorithm?” arXiv:1606.05541, 2016.
[8] P. Guerra, J. E. Ortuño, G. Kontaxakis, M. J. Ledesma-Carbayo, J. J.
Vaquero, M. Desco, and A. Santos, “Real-time digital timing in positron
emission tomography,” IEEE Trans. Nucl. Sci., vol. 55, no. 5, pp. 2531–
2540, Oct. 2008.
[9] R. J. Aliaga, V. Herrero-Bosch, J. M. Monzo, A. Ros, R. Gadea-Girones,
and R. J. Colom, “Evaluation of a modular PET system architecture with
synchronization over data links,” IEEE Trans. Nucl. Sci., vol. 61, no. 1,
pp. 88–98, Feb. 2014.
[10] L. Bardelli, G. Poggi, M. Bini, G. Pasquali, and N. Taccetti, “Time
measurements by means of digital sampling techniques: a study case of
100 ps FWHM time resolution with a 100 MSample/s, 12 bit digitizer,”
Nucl. Instr. Meth. A, vol. 521, pp. 480–492, 2004.
[11] H. S. Hou and H. C. Andrews, “Cubic splines for image interpolation and
digital filtering,” IEEE Trans. Acoust., Speech, Signal Process., vol. 26,
no. 6, pp. 508–517, Dec. 1978.
[12] M. Unser, “Splines: a perfect fit for signal and image processing,” IEEE
Signal Process. Mag., vol. 16, no. 6, pp. 22–38, Nov. 1999.
[13] Y. Huang and Y. Ding, “Cubic B-spline approximation in scalers: effi-
cient design and implementation,” in 11th IEEE Singapore International
Conference on Communication Systems (ICCS), Nov. 2008, pp. 342–
346.
[14] N. Brekke, D. Röhrich, K. Ullaland, and R. Gruner, “Trigger per-
formance simulation of a high speed ADC-based TOF-PET read-out
system,” IEEE Trans. Nucl. Sci., vol. 59, no. 5, pp. 1910–1914, Oct.
2012.
[15] H. Semmaoui, M.-A. Tétrault, R. Lecomte, and R. Fontaine, “Signal
deconvolution concept combined with cubic spline interpolation to
improve timing with phoswitch PET detectors,” IEEE Trans. Nucl. Sci.,
vol. 56, no. 3, pp. 581–587, Jun. 2009.
[16] I. Koren, Computer Arithmetic Algorithms, 2nd ed. A. K. Peters, 2002.
[17] R. L. Burden and J. D. Faires, Numerical analysis, 8th ed. Thomson,
2005, pp. 46–52 and 137–164.
[18] D. Lee, R. C. C. Cheung, W. Luk, and J. D. Villasenor, “Hardware
implementation trade-offs of polynomial approximations and interpola-
tions,” IEEE Trans. Comput., vol. 57, no. 5, pp. 686–701, May 2008.
[19] J. F. Epperson, “On the Runge example,” American Mathematical
Monthly, vol. 94, no. 4, pp. 329–341, Apr. 1987.
[20] C. A. Hall and W. W. Meyer, “Optimal error bounds for cubic spline
interpolation,” Journal of Approximation Theory, vol. 16, no. 2, pp. 105–
122, Feb. 1976.
[21] J. G. Proakis and D. G. Manolakis, Digital signal processing: principles,
algorithms and applications, 4th ed. Prentice Hall, 2007, pp. 565–601.
[22] J. M. Monzó, R. J. Aliaga, V. Herrero, J. D. Martı́nez, F. Mateo,
A. Sebastiá, F. J. Mora, J. M. Benlloch, and N. Pavón, “Accurate
simulation testbench for nuclear imaging systems,” IEEE Trans. Nucl.
Sci., vol. 55, no. 1, pp. 421–428, Feb. 2008.
[23] Q. I. Rahman and G. Schmeisser, Analytic theory of polynomials.
Oxford University Press, 2002.
(c) 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be 
obtained for all other users, including reprinting/ republishing this material for advertising or 
promotional purposes, creating new collective works for resale or redistribution to servers or 
lists, or reuse of any copyrighted components of this work in other works. 
Final version available at http://dx.doi.org/10.1109/TNS.2017.2721103 
This preprint includes an appendix with the proofs of all theoretical results that are stated 
throughout the text. Please note that the appendix is not included in the final version of the 
paper. 
 
 
