FPGA-Embedded Linearized Bregman Iterations Algorithm for Trend Break
  Detection by Calliari, Felipe et al.
1FPGA-Embedded Linearized Bregman Iterations
Algorithm for Trend Break Detection
Felipe Calliari, Gustavo C. Amaral, Michael Lunglmayr Member, IEEE
Abstract
Detection of level shifts in a noisy signal, or trend break detection, is a problem that appears in several research fields, from
biophysics to optics and economics. Although many algorithms have been developed to deal with such problem, accurate and low-
complexity trend break detection is still an active topic of research. The linearized Bregman Iterations have been recently presented
as a low-complexity and computationally-efficient algorithm to tackle this problem, with a formidable structure that could benefit
immensely from hardware implementation. In this work, a hardware architecture of the Linearized Bregman Iterations algorithm
is presented and tested on a Field Programmable Gate Array (FPGA). The hardware is synthesized in different sized FPGAs and
the percentage of used hardware as well as the maximum frequency enabled by the design indicate that an approximately 100 gain
factor in processing time, with respect to the software implementation, can be achieved. This represents a tremendous advantage
in using a dedicated unit for trend break detection applications.
Index Terms
Linearized Bregman Iterations, Trend Break Detection, FPGA.
I. INTRODUCTION
TREND BREAK detection in the presence of noise is a broad problem that can be found across different research fields[1]–[5]. For that reason, several different methodologies have been proposed in the literature [6]–[8], with the ones that
make use of `1 regularization to counter the problem’s inherent high-dimensionality arguably figuring as the most successful
ones [9], [10]. Such an approach is required for highly reliable estimation results [8]. Even though such regularization allows
the problem to be solved in a computationally efficient manner (usually associated to a complexity which is proportional to
a polynomial function of the number of inputs), the fact that a computer can solve the problem does not necessarily mean
that the result is achieved quickly, practically speaking. In certain contexts, achieving elapsed algorithm times in the order of
seconds as opposed to minutes may yield a substantial impact on the application [11], [12].
It is a widespread notion that certain problems, despite their complexity, may be accelerated depending on the implementation;
parallel programming, in which several parts of the same procedure are processed independently and simultaneously, is one of the
most celebrated examples [13]. Field Programmable Gate Arrays (FPGAs) are extremely versatile hardware structures that offer
[14]–[16]: great flexibility to design high speed high-density digital hardware; easiness of programability and reconfiguration;
energy efficiency; high resource utilization; low cost; and the possibility to combine parallel processing structures with serial
control units. FPGAs haven been used as a versatile computing platform accelerating algorithms through dedicated and carefully
designed architectures in a wide range of fields such as cryptography [17], image processing[18], and machine learning [19].
Recently, linearized Bregman Iterations (LBI), a class of implementation-efficient and low-complexity algorithms, has been
presented as an extremely attractive solution for trend break detection [1]. In particular, both the structure of the trend break
detection problem and of the LBI algorithm’s allow for simple hardware units, relying mainly on adders and efficient memory
management, to conduct the core procedure, thereby avoiding hardware-complex multiplication and division operations [20].
In this work, the hardware implementation of the LBI algorithm is studied in depth and is simulated and synthesized for
different FPGAs. A novel hardware architecture is presented and its main processing units are discussed. VHDL simulation
environments enables a step-by-step comparison and validation of the processing stages referenced by the computer algorithm
implementation [1]. Hardware synthesis results allow determination of both device usage with different FPGA sizes and
maximum clock frequency; the former, combined with the average number of clock cycles per iteration loop, make total
processing time calculation possible for different problem instance sizes. A reduction factor on the elapsed algorithm time of
approximately 100 is achieved, which represents a substantial upgrade and warrants usage of dedicated hardware for trend
break detection.
F. Calliari is with the Center for Telecommunications Studies, Pontifical Catholic University of Rio de Janeiro, RJ, Brazil (e-mail:
felipe.calliari@opto.cetuc.puc-rio.br).
M. Lunglmayr is with the Institute of Signal Processing, Johannes Kepler University, Linz, Austria (e-mail: michael.lunglmayr@jku.at).
G. C. Amaral is with the Center for Telecommunications Studies, Pontifical Catholic University of Rio de Janeiro, RJ, Brazil and with the QC2DLab, Kavli
Foundation, Technical University of Delft, The Netherlands (e-mail: gustavo@opto.cetuc.puc-rio.br).
Copyright (c) 2017 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained
from the IEEE by sending a request to pubs-permissions@ieee.org.
ar
X
iv
:1
90
2.
06
00
3v
1 
 [e
es
s.S
P]
  1
5 F
eb
 20
19
2Due to the limited memory resources available in an embedded processing unit as opposed to a standard personal computer,
different data formatting was employed in the hardware implementation. This not only allows for increased data length
manipulation inside the embedded unit (with estimated ≥ 60000 data points for a mid-sized FPGA), but also avoids the
usage of complex and often slower arithmetic units to handle floating point representation [21]. The comparison between
results using the 20-bit fixed-point (SFIXED) [22] format and standard 64-bit double representation have been conducted in
this work and negligible discrepancies were observed.
The paper is divided as follows. In Section II, a brief review of the LBI algorithm for trend break detection is performed,
including the structure of the candidate matrix and the pseudocode based on which the hardware architecture is developed.
Section III presents the digital hardware architectural concept as well as focused descriptions of its main units; the estimated
number of clock cycles until the algorithm elapses is derived based on this architecture. In Section IV, comparative results
between the simulated hardware implementation and the Julia code of the LBI are discussed. Synthesis parameters for two
target FPGAs (ALTERA CYCLONE V and ALTERA STRATIX V) are also reported. Case studies and implementation results are
discussed in Section V, and Section VI concludes the paper.
II. THE LINEARIZED BREGMAN ITERATIONS ALGORITHM FOR TREND BREAK DETECTION
Under the assumption that the trend break detection problem is a sparse one, i.e., the number of candidate vectors that
describe the signal of interest is much smaller than the number of observations, it can be cast into the combined `1/`2 problem
of the form [1]:
min
β
λ||β||1 + 1
2
||β||22 s.t. Aβ = y, (1)
where A is the dictionary, with each candidate vector stored in a column, β is the vector containing the coefficients of the
weighted linear combination of dictionary vectors that will approximate the signal of interest represented by the data series y,
and λ is a parameter that adjusts the weight of the `1 versus the `2 norm. Adaptation of the Linearized Bregman Iterations
algorithm to trend break detection has been presented in [1] in a context where a linear trend is also expected in the signal of
interest. In order to simplify and generalize the implementation, this linear trend is not considered in the current implementation.
Incorporating the linear trend in the proposed architecture can, however, be easily done.
Throughout the manuscript, the length, in data points, of the signal of interest y will be defined as N , i.e., y and β are
N-dimensional vectors and A is an N × N matrix. The Linearized Bregman Iterations algorithm has a periodic structure,
involving, in a single iteration, an approximate gradient descent (AGD) followed by a non-linear shrink function of the form:
shrink (v, λ) = max (|v| − λ, 0) · sign (v) [23]. Due to the special structure of the candidate dictionary matrix A for the trend
break detection problem, namely:
A =

1 0 0 · · · 0 0
1 1 0 · · · 0 0
1 1 1 · · · 0 0
...
...
...
. . .
...
...
1 1 1 · · · 1 0
1 1 1 · · · 1 1

, (2)
its storage is not necessary for the AGD calculation, as the latter can be rewritten as
v(i+1) = v(i) +
ak
||ak||22
(
yk − aTkβ(i)
)
= v(i) +
ak
||ak||22
(
yk −
k+1∑
s=1
βs
) (3)
where the ak represent columns of the candidate matrix, the superscripted i represents the iteration index, and the index
k ∈ [1 : N ] controls the cyclic re-use of rows of A as the iteration index evolves, i.e., k = mod ((i− 1) , N) + 1.
The ak, in turn, have an interesting structure that allows the AGD to be further optimized and the calculation to be performed
only for those indices where ak,j 6= 0. In other words (and also considering the fact that ||ai||22 = k),
v
(i+1)
j =
{
v
(i)
j +
1
k
(
yk −
∑k+1
s=1 βs
)
, ak,j = 1
v
(i)
j , ak,j = 0
, (4)
which, considering computational implementation, translates into accessing and manipulating only those values of vector v(k)
up to index j. A final observation of the structure of matrix A (namely, the fact that it is a square matrix) reveals that a single
index k is sufficient to control an iteration of the algorithm. The resulting procedure, presented as a pseudocode in Algorithm
1, efficiently solves the trend break detection problem with low memory usage.
3Algorithm 1 Linearized Bregman Iteration for Trend Break Detection
Input: Measurement vector y, λ, βstart, vstart, L
Output: Estimated βˆ
1: β(0) ← βstart
2: v(0) ← vstart
3: i← 1
4: while i < L do
5: k ← mod ((i− 1), N) + 1 . cyclic re-use of rows of A
6: µk ← 1k
7: e←
(
yk −
∑k+1
s=1 β
(i)
s
)
. instantaneous error with inner product
8: d← µke
9: for j = 1..k do
10: v(i+1)j ← v(i)j + d
11: β(i+1)j ← shrink
(
v
(i+1)
j , λ
)
12: end for
13: i← i+ 1
14: end while
It is important to note that only the computation-heavy part of the algorithm is depicted in the pseudocode of Algorithm 1.
Its purpose is to identify the relevant non-zero values of the βˆ vector that compose the output or, in other words, reduce the
dimension of the detection space focusing on the subspace spanned by the relevant candidate vectors. After this procedure, it
is usual to perform an Ordinary Least Square (OLS) in this reduced subspace in order to remove any biasing introduced by
the algorithm; operating on the reduced subspace found by the LBI drastically reduces the complexity of the OLS. This step,
which involves matrices transposing and inverting, can be efficiently conducted in a standard personal computer and, even
though this could also be implemented in the same hardware structure that contains the core algorithm [1], the goal of this
work is to present the latter and the OLS step is left as a post-processing to be performed in a different processing unit.
Also left as a pre-processing step is the scaling of the data vector y, which is necessary to ensure the correct behaviour
of step 7 in Algorithm 1 when using the 20-bit fixed-point format; namely, that no overflow of the arithmetic dynamic range
is observed when performing the summation of β values. The scaling is intimately connected to the available arithmetical
dynamic range, which, in turn, is connected to the memory resources of the FPGA board, thereby constituting a design-related
compromise relationship: in case of over-scaling, the arithmetic dynamic range will be hindered; to overcome this, a higher
number of bits can be assigned to the data points, which then increases the resources necessary in the FPGA board; on the
other hand, in case of under-scaling, the results may overflow, creating errors that can jeopardize the algorithm’s convergence.
A scaling factor consistent with the algorithm’s convergence can be determined according to the following considerations.
The major source for overflows is the sum calculation of the β values in line 7 of Algorithm 1. Empirical tests conducted
based on the testbench developed in [1] indicated that a scaling based on dividing the data vector y by its maximum value
allowed to obtain the results shown in Sect. IV without harming overflow effects. This is due to the firmly non-expansive
property [24] of the shrink function as well as the negative feedback of the error between
∑k+1
s=1 β
(i)
s and yk (line 7 of Algorithm
1). To clarify the negative feedback effect, one could multiply both sides of line 7 by -1, i.e., −e = yk −
∑k+1
s=1 β
(i)
s . This
procedure would require µk, in line 10, to also be multiplied by -1.
The algorithm can be then be interpreted as a stabilizing loop on the values of v(i)j with the mentioned negative feedback
on the deviation between the sum of β values (functions of v(i)j ) and the corresponding yk, which causes overshoots of the
sum of β values to be immediately corrected in the next iterations. This leads to the fact that, even in worst case scenarios
(multiple up and down trend breaks in the measurements), the sum of β values scarcely goes above unit, considering the
above mentioned normalization procedure. Moreover, even though its ratio of occurrence is negligible, in the case an overshoot
occurs, the excess value would be small thereby not compromising the convergence of the algorithm, as the performance of
the quantized version in Sect. IV demonstrates. The closeness of these results to the ones obtained using double precision
floating point shows that, practically, harming effects due to overflow can be neglected.
Another important observation is that the stopping criterion developed in [1] has been presently neglected, and a fixed
number of iterations has been considered; the reason for that is twofold. First, the context in [1] was that of fiber fault location,
and the stopping criterion was developed based on a specific phenomenological observation that cannot be extended to more
general trend break detection problems. Secondly, with a fixed number of iterations, the standard 64-bit double results can be
compared to those of our 20-bit fixed point implementation in a fair scenario. In other words, this quantization could potentially
influence the effect of the stopping criterion making the comparison biased.
4Fig. 1. Hardware implementation of a single BRAM slice in the LBI core structure. The parallel structures are pictorially depicted in three-dimensional depth.
The Pipelined Multiplexer Tree (PMT) is synchronized to the PAT such that, after a summation, the correct value of y is ready for subtraction. The value of
µk – refer to Algorithm 1 – is calculated in a pipelined CORDIC structure.
III. FPGA ARCHITECTURE
Even though the iterative nature of the Linearized Bregman Iterations algorithm does not allow for parallelization over
the iterations, two core operations that permit parallel pipelining can be identified within a single iteration, as presented in
Algorithm 1: the summation of k entries of the vector β; and the processing (including update, shrinkage, and storage) of
vectors v and β. By instantiating parallel memory structures, both operations, that represent computational bottlenecks of the
algorithm’s iterations, can be optimized. On one hand, the summation can be efficiently performed in a so-called parallel adder
tree (PAT) (logarithmic number of time steps) given that the data can be accessed in parallel. On the other, parallel processing
of the data in vectors v and β can also be accelerated if storage can be performed in parallel. Since the algorithm relies on
the computation of several iterations to converge, optimizing these two procedures allows for substantial gains in processing
time.
A. Memory Structure
In order to harness the parallel speedup of the PAT, the entries of vector β must also be accessed in parallel, which can be
accomplished through the instantiation of parallel Block RAMs (BRAMs). The data storage is structured as follows:
BRAM(1) BRAM(2) BRAM(M)
t
x

β[TM + 1] β[TM + 2] · · · β[TM +M ]
...
... · · · ...
β[2M + 1] β[2M + 2] · · · β[2M +M ]
β[M + 1] β[M + 2] · · · β[M +M ]
β[1] β[2] · · · β[M ]
 ,
−−−−−→
m
(5)
where M is the number of parallel BRAMs available in the FPGA. In such a structure, a single arbitrary BRAM, say m, will
contain entries:
[m+ tM ]∀t ∈ [0;T ],m ∈ [1;M ] : T = ⌈NM ⌉,
where the ceiling operator is denoted as d·e.
5Vector β, however, is not the only vector stored throughout processing: vectors y and v are also necessary. Since all these
contain the same number N of entries, the data is sectioned such that the address depth of each BRAM is divided in three
slices with address pointers (ap) associated with β (βap), y (yap), and v (vap); βap is arbitrarily set to zero. Under this rationale,
entries of vectors v and y would appear at addresses t+vap and t+yap, respectively, even though, for simplicity, only entries
of vector β are shown in Eq. 5. Using this data storage structure, all positions [tM + 1 : tM +M ] of either vectors can be
accessed from parallel BRAMs within a clock cycle; such data segment will henceforth be referred to as a parallel row, with
t the parallel row pointer following its definition in Eq. 5. A block diagram of the digital hardware architecture depicting a
single BRAM and including the major structures of the LBI algorithm hardware implementation is presented in detail in Fig. 1.
Apart from the PAT, a Pipelined Multiplexer Tree (PMT) is used to select the specific value of the data series y, namely
y (k), from which the result of the beta sum is subtracted from (refer to line 7 of Algorithm 1). The architecture of the PMT
is such that the number of stages meets that of the PAT, so synchronization between the two outputs is naturally ensured.
Furthermore, the selection key that acts on each stage of the PMT is derived from the cyclic iteration index, k. The value of
µk =
1
k , which involves a computation-heavy division, has been delegated to a pipelined CORDIC structure; the pipeline’s
stages are pre-filled before the iterations are started and stage propagation is enabled at each new iteration, ensuring that the
correct value is always available.
Based on this memory structure, the amount of clock cycles necessary to complete the calculation of d depends both on the
number of data points and on the depth of the PAT, which, in turn, depends on the number of instantiated (or available) parallel
BRAMs in the hardware structure. For an arbitrary iteration cycle, with cyclic index k, the equation that relates these values
to the total number of clock cycles is C ′r =
⌈
k
M
⌉
+ dlog2Me, where the subscript refers only to the reading and processing
of β values up to the output of the PAT. Taking into account also the subsequent subtraction and multiplication steps – refer
to Fig. 1 –, each taking one clock cycle, the total number of clock cycles amounts to Cr = tˆ+ dlog2Me+2, where tˆ =
⌈
k
M
⌉
denotes the maximum value of t during an iteration.
B. PAT Input Control
Even though a parallel row is accessible at each clock cycle due to the parallel instantiation of the BRAMs, clearly not all
values in the row will be used during a given iteration with index k. For that reason, a multiplexer (PAT input MUX in Fig. 1)
is connected immediately after the BRAM ouput with its remaining input connected to a null value. Due to the additive identity
property of 0, the output of the multiplexer can be directed to the PAT without corruption of the result while accomodating
the parallel storage structure.
The selection signal that controls the PAT input MUX is derived based on the fact that replacing BRAM outputs by 0 is only
necessary during the last parallel row access, i.e., when t =
⌈
k
M
⌉
= tˆ. Selection is thus based on an auxiliary counter that
records the aforementioned value and on a so-called unary code (or thermometer code), which encodes the last column index
that contributes to the sum. Fig. 2 depicts the control unit responsible for handling the BRAM input address and selection of
PAT input.
C. β and v Storage
An indispensible step of the algorithm is the correct storage of the vectors β and v after processing. According to Algorithm
1, all the elements of vector β are processed by the shrink function right after processing of the vector v. As previously pointed
out, acceleration of the storage procedure tackles one of the algorithm’s bottlenecks: the number of clock cycles taken by the
storage procedure scales the total number of clock cycles necessary for the algorithm to elapse. Both the fact that the BRAMs
allow for writing and reading from two independently addressed ports and that if λ is set to zero in the shrink function it
implements the identity transformation have been harnessed to perform data storage optimization, as it is detailed as follows.
One of the BRAM’s ports (taken as B without loss of generality in Fig. 1) is responsible for reading the values of v from
the memory while the other port (A) is responsible for storing the values of β and v. The addresses are controlled such that,
on the first clock cycle, values of v in a parallel row are read (through port B), processed in the 20-bit full adder, and sent to
the shrink function with λ = 0. Therefore, at the following clock cycle, the stable value of v can be stored (through port A)
at the same time as the value of λ is changed in the shrink function and processes the values of v being read (through port
B). In the third clock cycle, a stable value of β is stored (through port A) while the values of v from the following parallel
row are accessed (through port B), initiating a new storage cycle for a subsequent parallel row.
D. Total Clock Cycle Estimation
The net amount of clock cycles per parallel row storage is, thus, two if one does not compute the very first and last accesses;
therefore, the number of clock cycles necessary at an arbitrary iteration with cyclic index k is Cs = 2
⌈
k
M
⌉
+ 2 = 2tˆ+ 2. Two
extra clock cycles are also necessary for the hand-shaking protocol between the iteration control unit (presented in Fig. 2) –
whose control over the BRAMs address is releaved – and the writing unit that takes over control and stores vectors β and v,
i.e., C ′s = 2tˆ + 4. Combining this value with the previously determined Cr for the reading and processing of β values in the
PAT, the total number of clock cycles spent in an arbitrary iteration with index k is CT = 3(tˆ+ 2) + dlog2Me.
6Fig. 2. Iteration control architecture. A NEW ITER flag generated by a higher-level unit and the clock signal are the necessary inputs. The cyclic iteration
index counter k is implemented through a simple counter with parallel load dependent on the comparison with the signal length N . The unary code propagates
at each new iteration and, when the M + 1th stage is reached, it auto-resets while also incrementing the tˆ counter. The unary code acts on PAT input MUX
when t = tˆ; only the selection for m = 1 is depicted for clarity. The different address pointers are combined with the counter t to produce the correct BRAM
address.
The total number of clock cycles taken by the algorithm to elapse can be easily derived from this equation by summation
over L, the total number of iterations:
C=F+
L∑
i=1
[
3
(⌈
((i− 1)%N) + 1
M
⌉
+2
)
+dlog2Me
]
. (6)
The factor F in Eq. 6 accounts for pre- and post-processing instructions performed by the control unit such as: master resets;
granting control over the BRAMs; and, most importantly, preemptively filling up the pipelined CORDIC that calculates µk.
However, as will be described in the next Section, the value of F is much smaller than the total number of clock cycles taken
by the core procedure.
Fig. 3 presents the dependence of the total number of clock cycles until the algorithm elapses with both the number of
available parallel BRAMs for a fixed number of data points and with the number of data points for a fixed number of available
BRAMs. In both cases, the iterations per data point (defined as L/N ) is fixed at 650, a realistic value which will be discussed
further in the next Section. Considering a realistic scenario, where the maximum clock frequency achievable in the target FPGA
is around 100 MHz, a 10000-point data series would be processed in less than two seconds, which represents an approximately
100 gain factor when compared to the Julia implementation reported in [1].
IV. VALIDATION AND SYNTHESIS
Comparison between the software-defined hardware implementation of the Linearized Bregman Iterations algorithm using the
architecture presented in the previous Section and its software implementation counterpart [1] permits validating the former. In
order to provide a bit-true validation, the SFIXED standard used in the VHDL simmulation was implemented in Julia allowing
one to accompany, step-by-step, the evolution of the algorithm on both platforms and identify any discrepancies. Due to the
fact that the rounding procedure is the same for both, no such discrepancies were observed; the fixed point Julia simulation
code outputs exactly the same values as of the hardware implementation. The validation of the hardware implementation and
the demonstration of its equivalence to the Julia SFIXED implementation creates a versatile tool to estimate the performance
of the FPGA results on a software environment.
7Fig. 3. Estimate of total number of clock cycles necessary for the algorithm to elapse considering the presented architecture. (a) Dependence with respect to
number of available parallel BRAMs for a fixed number of 10000 data points and 650 iterations per data point. The gray-shaded areas highlight the transition
between powers of 2, which manifests as sharp increases in the calculated value of C. (b) Dependence with respect to the number of data points for a fixed
number of available BRAMs and 650 iterations per data point. The highlighted curve corresponds to 2000 BRAMs, which is the maximum available for the
largest target FPGA studied here.
For the simulation of the hardware implementation, the MODELSIM VHDL simulation environment was employed. In such
an environment, both the evolution of the algorithm as well as the number of clock cycles necessary to run each iteration can be
extracted, so the results of Eq. 6 can also be ascertained. Even though an extremely reliable and versatile tool, VHDL simulation
offers a drawback in terms of running time: simulating a high number of BRAMs or a large dataset can be extremely time-
consuming. For this reason, a predetermined set of parameters (data points, iterations, and number of BRAMs) were chosen
to showcase the validity of the hardware implementation.
Table I contains the information regarding the simulation of the hardware structure under the different parameter conditions,
where B stands for the number of BRAMs, and L and N follow the previously defined notation. The estimated number of
clock cycles based on Eq. 6 that appear in Table I take into account the required F = 21 extra clock cycles for initialization
8and control. The asterisk in the last column indicates that 2048 BRAMs is actually above the 2000 maximum available number
of BRAMs with 20 bit-wide data entries in the target ALTERA STRATIX V FPGA.
TABLE I
VHDL SIMULATION – VALIDATION
N
=
10
B
=
4
L
=
10
N
=
10
B
=
4
L
=
10
0
N
=
10
0
B
=
4
L
=
10
0
N
=
10
0
B
=
4
L
=
10
00
N
=
10
00
B
=
4
L
=
10
00
C (Eq. 6) 155 1361 4721 47021 384521
Clk. Cyc. 155 1361 4721 47021 384521
N
=
10
00
B
=
12
8
L
=
10
00
N
=
50
00
B
=
10
24
L
=
50
00
N
=
10
00
0
B
=
10
24
L
=
10
00
0
N
=
15
00
0
B
=
10
24
L
=
15
00
0
N
=
15
00
0
B
=
20
48
∗
L
=
15
00
0
C (Eq. 6) 26269 124301 321781 592461 442989
Clk. Cyc. 26269 124301 321781 592461 442989
The results of Table I are in excellent agreement with the expectations, which translate into: validation of the hardware
implementation as well as demonstration of its equivalence to the Julia SFIXED software implementation; and verification of
the validity of Eq. 6, which, in turn, is a validation of the results of Fig. 3. The concluding step of this Section is, then, to
synthesize the hardware so that the maximum achievable clock frequency can be extracted. As previously commented, the
clock frequency, combined with the total number of clock cycles necessary for the algorithm to elapse, can be used to estimate
the amount of time the algorithm will take to execute. Furthermore, as a by-product of the synthesis results, it is possible
to assess the percentage of FPGA resources occupied by the architecture, which, in turn, provides the means for selecting
the target FPGA for hardware implementation. The results are summarized in Table II, where the INTEL QUARTUS PRIME
synthesis software was used.
TABLE II
TARGET FPGAS SYNTHESIS AND TIMING RESULTS
PROCESSING TIMES CALCULATED FOR N = 10000 AND WITH L = 6.5× 106
Stratix V: 5SGSMD5K2F40C2 Cyclon V: 5CSXFC6D6F31C6
BRAMs ALMs Registers Memory [Bits] Max. Clk. Proc. ALMs Registers Memory [Bits] Max. Clk. Proc.
out of 172,600 out of 41,246,720 Freq. [MHz] Time [s] out of 41,910 out of 5,662,720 Freq. [MHz] Time [s]
1024 135,571(79%) 88,938 20,971,520(51%) 109.9 1.91 −1 −1 −1 −1 −1
512 68,135(39%) 45,595 10,485,760(25%) 166.97 1.78 −1 −1 −1 −1 −1
256 36,098(21%) 25,147 5,242,880(13%) 173.28 2.78 36,548(87%) 24,741 2,621,440(46%) 81.13 5.94
128 19,196(11%) 13,839 2,621,440(6%) 182.12 4.70 20,178(48%) 13,475 1,310,720(23%) 95.37 8.98
16 4,982(3%) 4,012 327,680(< 1%) 182.35 33.83 5,000(12%) 3,928 163,840(3%) 92.94 66.37
4 3,435(2%) 2,925 81,920(< 1%) 187.86 130.08 3,427(8%) 2,929 40,960(< 1%) 91.73 266.40
1 : Design too large to fit into device.
Up to 1024 BRAMs could be instantiated in the STRATIX V, with as high as 109 MHz maximum clock frequency yielding
a 1.91 second processing time for 10000-long data series considering 650 iterations per sample (overall, 6.5 million iterations).
This result represents a 100 speedup factor in processing time with respect to software implementations under the same data
conditions but running on a INTEL XEON CPU E5-2690 V4 at 2.6 GHz and 512 GB RAM, a major achievement, which
advocates for the dedicated hardware solution for trend break detection problem. It is also interesting to note that, for a smaller
FPGA, the CYCLON V, a ∼ 16 gain factor with respect to the software implementation was achieved, which is interesting in
the sense that smaller FPGAs exhibit, generally, significantly lower costs, but could still deliver processing times in the range
of a few seconds.
The results from Table II also indicate that a compromise between instatiation of higher number of BRAMs (which reduces
the total number of clock cycles necessary for the algorithm to elapse as determined by Eq. 6) and the maximum achievable
clock frequency exists. In fact, the processing time for 512 instantiated BRAMs was lower than that of 1024 because the gain
in clock frequency superseeded that of the reduction of clock cycles. It should be mentioned, however, that the place and
9routing problem is an extremely complex one and the algorithms that solve it may not always reach the best possible solution,
so the clock frequency values obtained should be interpreted as lower bounds. Finally, to put the results into an application
proned perspective, fiber profiles as long as 50 km could be analyzed in search for breaks in under 10 seconds [1].
V. CASE STUDY RESULTS
Validation of the software Julia SFIXED implementation performed in Section IV allows one to investigate aspects of
the hardware implementation in a more suitable simulation environment. This is important due to the amount of simulation
workload necessary to yield statistically relevant results. Point in fact is the investigation of the quality of estimation as L, the
total number of iterations, increases, for which the results of over 15000 different testbench simulated profiles, analyzed with
the hardware-validated bit-true Julia SFIXED implementation, is presented in Fig. 4.
The objective of this extensive study is twofold: firstly, as it was already mentioned, to empirically determine the interval of
number of iterations per sample for which the estimation quality reaches a reliable level; and, secondly, to compare the quality
of estimation between the 64-bit double implementation, and the 20-bit SFIXED implementation for profiles with different
number of data points and with different number of iterations per sample. The profiles analyzed with the two versions of the
algorithm (64-bit double and 20-bit SFIXED) were created following the rationale of the testbench profile generation presented
in [1], i.e., noiseless sparse vectors with randomly sorted magnitudes (βideal) are used to create profiles using the candidate
matrix A, to which white gaussian noise is added. The result of the estimation is βest, which is then compared to βideal using
the squared error norm; the closer to zero error between the estimated and ideal vectors, the better is the estimation. This
procedure is similar, but not equivalent, to the one described in [1] since no slopes are considered and the noise parcel is not
derived based on any phenomenological observations.
From the results of Fig. 4, it is possible to conclude that the differences in estimation between the SFIXED and 64-
bit double implementations are negligible, i.e., the hardware implementation will have no problems achieving comparable
estimation accuracy as the software version, e.g. in [1]. Furthermore, it becomes clear that, after 450 iterations per sample, the
accuracy stabilizes, with a averaged squared error norm value in the order of 0.5. This indicates that the results from Fig. 3
with 650 iterations per sample are indeed realistic. Finally, this result is also useful when interpreted along with those of Fig.
3: a compromise between the total number of clock cycles before the algorithm elapses and the quality of the estimation can
be found and, in specific cases, one of these can be sacrificed (increasing the processing time or allowing for a worst estimate)
to boost the other (faster results or extremely precise estimation).
VI. CONCLUSIONS
Trend break detection, or level-shift detection, is a problem that permeates several science fields, and an efficient, accurate,
and highly reliable processing unit to solve it is desirable. Combining the flexible hardware design tools of Field Programmable
Gate Arrays and the efficient Linearized Bregman Iterations algorithm allowed the development of such unit. The manipulation
of the data storage structure as well as the algorithm flow and control in hardware yielded an up to 100 times gain in
processing time when compared to a personal computer while maintaining all the observed qualities of the algorithm, such as
low estimation error and high level-shift detection precision.
The proposed hardware architecture can be implemented in different sized-FPGAs, with the main distinctions being the
amount of available dual-block RAMs and maximum achievable clock frequency, characteristics which are hardware-dependent.
On a middle-sized chip such as the ALTERA CYCLONE V, the hardware supports up to 256 parallel BRAMs with a maximum
clock frequency of 81 MHz and a total processing time of 6 seconds. Such processing prowess can be directed towards on-line
data supervision such as optical fiber monitoring, which constitutes an exciting future point of investigation. Furthermore,
incorporating advanced signal processing techniques into the the hardware design in order to eliminate any pre-processing step
while increasing the convergence speed is also a sought-after goal for future studies.
ACKNOWLEDGMENT
Financial support from brazilian agency CNPq is acknowledged by F. Calliari. This work has been supported by the COMET-
K2 “Center for Symbiotic Mechatronics” of the Linz Center of Mechatronics (LCM) funded by the Austrian federal government
and the federal state of Upper Austria.
REFERENCES
[1] M. Lunglmayr and G. C. Amaral, “Linearized bregman iterations for automatic optical fiber fault analysis,” IEEE Transactions on Instrumentation and
Measurement, vol. –, no. –, p. to appear, 2018.
[2] M. Basseville and A. Benveniste, “Design and comparative study of some sequential jump detection algorithms for digital signals,” Acoustics, Speech
and Signal Processing, IEEE Transactions on, vol. 31, no. 3, pp. 521–535, 1983.
[3] L. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286,
1989.
[4] M. Souto, J. D. Garcia, and G. C. Amaral, “l1 adaptive trend filter via fast coordinate descent,” in Sensor Array and Multichannel Signal Processing
Workshop (SAM), 2016 IEEE. IEEE, 2016, pp. 1–5.
10
Fig. 4. Estimation results for both SFIXED and floating point implementations for different profile lengths and different numbers of iterations per sample;
L is simply this value multiplied by the number of points (N) in the profile. The results are divided into two panels for ease of visualization.
[5] D. A. Lorenz, S. Wenger, F. Scho¨pfer, and M. Magnor, “A sparse kaczmarz solver and a linearized bregman method for online compressed sensing,” in
Image Processing (ICIP), 2014 IEEE International Conference on. IEEE, 2014, pp. 1347–1351.
[6] W. S. Rea, M. Reale, C. Cappelli, and J. A. Brown, “Identification of changes in mean with regression trees: an application to market research,”
Econometric Reviews, vol. 29, no. 5-6, pp. 754–777, 2010.
[7] M. Storath, A. Weinmann, and L. Demaret, “Jump-sparse and sparse recovery using potts functionals,” IEEE Transactions on Signal Processing, vol. 62,
no. 14, pp. 3654–3666, 2014.
[8] J. P. von der Weid, M. H. Souto, J. D. Garcia, and G. C. Amaral, “Adaptive filter for automatic identification of multiple faults in a noisy otdr profile,”
Journal of Lightwave Technology, vol. 34, no. 14, pp. 3418–3424, 2016.
[9] S.-J. Kim, K. Koh, S. Boyd, and D. Gorinevsky, “`1 Trend Filtering,” SIAM review, vol. 51, no. 2, pp. 339–360, 2009.
[10] M. Lunglmayr and M. Huemer, “Efficient linearized bregman iteration for sparse adaptive filters and kaczmarz solvers,” in Sensor Array and Multichannel
Signal Processing Workshop (SAM), 2016 IEEE. IEEE, 2016, pp. 1–5.
[11] F. Calliari, L. E. Y. Herrera, J. P. von der Weid, and G. C. Amaral, “High-dynamic and high-resolution automatic photon counting otdr for optical
fiber network monitoring,” in Proceedings of the 6th International Conference on Photonics, Optics and Laser Technology - Volume 1: PHOTOPTICS,,
INSTICC. SciTePress, 2018, pp. 82–90.
[12] L. E. Herrera, F. Calliari, J. D. Garcia, G. C. do Amaral, and J. P. von der Weid, “High resolution automatic fault detection in a fiber optic link via
photon counting otdr,” in Optical Fiber Communication Conference. Optical Society of America, 2016, pp. M3F–4.
[13] D. E. Rumelhart and J. L. McClelland, “Parallel distributed processing: explorations in the microstructure of cognition. volume 1. foundations,” 1986.
[14] J. Xie, P. K. Meher, M. Sun, Y. Li, B. Zeng, and Z.-H. Mao, “Efficient fpga implementation of low-complexity systolic karatsuba multiplier over gf
(2m) based on nist polynomials,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 64, no. 7, pp. 1815–1825, 2017.
[15] P. Greisen, M. Runo, P. Guillet, S. Heinzle, A. Smolic, H. Kaeslin, and M. Gross, “Evaluation and fpga implementation of sparse linear solvers for
video processing applications,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 8, pp. 1402–1407, 2013.
[16] L. Cong and W. Xiaofu, “Design and realization of an fpga-based generator for chaotic frequency hopping sequences,” IEEE Transactions on Circuits
and Systems I: Fundamental Theory and Applications, vol. 48, no. 5, pp. 521–532, 2001.
[17] J. Hu, W. Guo, J. Wei, and R. C. C. Cheung, “Fast and generic inversion architectures overGF(2m)using modified itoh-tsujii algorithms,” IEEE
Transactions on Circuits and Systems II: Express Briefs, vol. 62, no. 4, pp. 367–371, April 2015.
11
[18] F. Cardells-Tormo and P. . Molinet, “Area-efficient 2-d shift-variant convolvers for fpga-based digital image processing,” IEEE Transactions on Circuits
and Systems II: Express Briefs, vol. 53, no. 2, pp. 105–109, Feb 2006.
[19] C. Zhang, , , and and, “Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks,” in 2016 IEEE/ACM
International Conference on Computer-Aided Design (ICCAD), Nov 2016, pp. 1–8.
[20] U. Meyer-Baese, Digital signal processing with field programmable gate arrays. Springer Verlag, 2004.
[21] J. Hormigo and J. Villalba, “Hub floating point for improving fpga implementations of dsp applications,” IEEE Transactions on Circuits and Systems
II: Express Briefs, vol. 64, no. 3, pp. 319–323, 2017.
[22] D. Bishop, “Fixed point package user’s guide,” available online: http://web.archive.org/web/20151124101646/http://www.vhdl.org/ fphdl/Fixed ug.pdf
accessed 23-November-2018.
[23] M. Lunglmayr, B. Hiptmair, and M. Huemer, “Scaled linearized bregman iterations for fixed point implementation,” in 2017 IEEE International
Symposium on Circuits and Systems (ISCAS), May 2017, pp. 1–4.
[24] H. H. Bauschke, R. S. Burachik, P. L. Combettes, V. Elser, D. R. Luke, and H. Wolkowicz, Fixed-Point Algorithms for Inverse Problems in Science and
Engineering. Springer Publishing Company, Incorporated, 2013.
