XBioSiP: A Methodology for Approximate Bio-Signal Processing at the Edge by Prabakaran, Bharath Srinivas et al.
XBioSiP: A Methodology for Approximate Bio-Signal
Processing at the Edge
Bharath Srinivas Prabakaran, Semeen Rehman, Muhammad Shafique
Vienna University of Technology (TU Wien), Vienna, Austria
{bharath.prabakaran,semeen.rehman,muhammad.shafique}@tuwien.ac.at
ABSTRACT
Bio-signals exhibit high redundancy, and the algorithms for
their processing are inherently error resilient. This property
can be leveraged to improve the energy-efficiency of IoT-Edge
(wearables) through the emerging trend of approximate com-
puting. This paper presents XBioSiP, a novel methodology for
approximate bio-signal processing that employs two quality
evaluation stages, during the pre-processing and bio-signal
processing stages, to determine the approximation parameters.
It thereby achieves high energy savings while satisfying the
user-determined quality constraint. Our methodology achieves,
up to 19× and 22× reduction in the energy consumption of a
QRS peak detection algorithm for 0% and < 1% loss in peak
detection accuracy, respectively.
KEYWORDS
Approximate Computing, Arithmetic Units, Internet of Things,
Energy, Adders, Multipliers, Biosignal, ECG, Hardware Design,
Edge Computing.
1 INTRODUCTION
IoT-health sensor nodes [10], wearable patches like sweat sensor-
arrays [6], and electronic devices like Fitbit [3], AppleWatch [11],
portable monitors, etc. are used to collect data related to var-
ious physiological signals such as skin temperature, heart-
rate, pulse and respiration rate, electrocardiogram (ECG), elec-
tromyogram (EMG), and electroencephalogram (EEG). Due to
their small form-factor, these edge nodes are severely resource
and energy constrained, and might not be suitable to perform
highly compute-intensive operations. Furthermore, these kind
of battery-operated sensor nodes are expected to have long
lifespans and perform real-time data processing, which con-
sumes a lot of energy. Therefore, reduced energy consumption
and extreme energy-efficiency are desirable objectives in IoT
edge devices.
The re-emerging approximate computing paradigm can be
seen as a possible solution for building highly energy-efficient
systems. Error-resilient applications have the ability to pro-
duce acceptable output despite having incorrect or approxi-
mate intermediate computations [15]. It is due to four major
factors: (i) noise and redundancy in the real-world datasets
being processed, (ii) error attenuating computational patterns
of applications, (iii) varying levels of perception by different
users, and (iv) non-existence of a unique golden output for
several use-cases [2]. Prior works have shown such resilience
for different applications like pattern recognition, machine
learning, big data mining, and image and video processing.
This resilience towards errors can be leveraged to introduce
approximations across the computing stack [20][21] at various
hardware [8][9][12][19] and software layers [1][4][13][14] to
achieve high efficiency gains in terms of area, power, latency,
and energy consumption.
Therefore, this paper aims at answering the following fun-
damental questions:
(1) If and how can approximate computing be employed to
significantly reduce the energy consumption in IoT edge
devices for bio-signal processing in healthcare applications?
(2) What should be the impact of such approximations on the
output of bio-signal processing applications, which are typ-
ically considered sensitive in nature?
Towards this, we make the followingNovel Contributions:
• We study the error-resilience of bio-signal processing
applications using a motivational analysis on ECG pro-
cessing (see Section 2).
• WeproposeXBioSiP, a novel two-stage quality eval-uation-
based approximation methodology that maximizes the
energy-efficiency of bio-signal processing applications
while satisfying user-defined quality constraints (see Sec-
tion 4).
• Wepropose a novel three-phase design generationmethod-
ology that traverses the design space and generates pos-
sible approximate processing units that offer high energy
reductions while satisfying the user-defined quality con-
straints (see Section 4.3).
• We evaluate the efficacy of our novel methodology using
an ECG processing application as a case study. We use
the Pan-Tompkins algorithm for QRS Peak detection as
the target application[17] (see Section 4.2).
We illustrate the area, latency, power and energy reductions
obtained by synthesizing the approximate processing unit us-
ing an ASIC tool-flow.We achieve energy reductions of ~19.7×
with 0% loss in peak detection accuracy (see Section 6). The RTL
and behavioral models of these approximate adders and mul-
tipliers, including a VDHL implementation of the key stages
present in the Pan-Tompkins algorithm are released as an open-
source library at https://xbiosip.sourceforge.io/. This will help
facilitate researchers to reproduce our work and enable further
research and development in this field.
2 MOTIVATIONAL ANALYSIS
In this section, first, we analyze the total energy consumption
of the bio-signal monitoring sensor nodes to study the poten-
tial for minimizing overall energy consumption. These sensor
nodes are responsible for performing three main functions,
ar
X
iv
:1
90
2.
02
64
9v
1 
 [e
es
s.S
P]
  5
 Fe
b 2
01
9
Accepted for publication at the Design Automation Conference 2019 (DAC’19), Las Vegas, Nevada, USA
namely, (i) sensing and collecting the bio-signal data, (ii) pro-
cessing (targeted region) the real-world data, and (iii) communi-
cating the data to the next network layer for long-term storage.
Fig. 1 illustrates the sensing energy and total energy consump-
tion of five bio-signal monitoring nodes (adapted from studies
presented in [16]). As can be observed, the sensing energy is
at least six orders of magnitude less than the total energy con-
sumed by the device. Furthermore, previous studies have shown
that on-sensor processing energy constitutes of 40%−60% of the
total energy consumed by the sensor nodes [18]. In this work,
we focus on minimizing the energy consumption of on-sensor
processing in order to extend the device lifetime. Next, we ana-
lyze the error-resilience of a target application and the energy
reductions obtained by leveraging the application’s inherent
error-resilience.
Heart Rate Oxygen Sat. Temp. ECG EEG
104
Oxygen 
Saturation
Temp. ECG EEGHeart
Rate
E
n
er
g
y
 C
o
n
su
m
ed
 P
er
 D
ay
 [
J]
L
o
g
a
ri
th
m
ic
 S
ca
le
Total Energy
Sensing Energy102
100
10-2
10-4
10-6
10-8
Processing Energy is 40-60% of 
Total Energy Consumption [18]
Figure 1: Energy Consumption of Five Bio-signal
Measuring Sensor Nodes (adapted from [16][18]).
Target Application: Physiological bio-signals like heart-
rate, blood pressure, skin temperature, ECG, EEG, EMG, etc.
are analog in nature and are converted to the digital domain
for storage and processing. These bio-signals are noisy and per-
meated with sparsity and redundancy. Furthermore, the digital
signal processing applications used to filter and extract viable
information from this data are inherently error resilient due to
their computational patterns that deploy filters and aggregators,
which attenuate errors. Hence, these signal-processing applica-
tions are amenable to approximations that can be exploited to
increase the energy efficiency of the system by approximating
the underlying arithmetic blocks like adders and multipliers
used in these applications. To understand this error-resilience
behavior of bio-signal processing systems and the potential of
approximate computing, we perform the following experimen-
tal study on the Pan-Tompkins algorithm [17].
Case Study: Low Pass Filter. First, we analyze the initial
stage of the Pan-Tompkins QRS peak detection algorithm (pre-
sented in Section 3) to quantify and evaluate its error resilience.
It employs a 10th order, 11-tap, Low Pass Filter (LPF) that com-
prises 10 adders, 11 multipliers, and 10 registers. In this work,
we focus on approximating the arithmetic operators (i.e., the
adders and multipliers) of the filter, to reduce energy consump-
tion (energy reductions). In this experiment, we use the low-
power approximate 1-bit full-adder, Approximate Adder-5, pro-
posed by Gupta et al. [8] [9] and the approximate elementary
2 × 2-multiplier module proposed by Rehman et al. [19], along
with their accurate counterparts, to construct higher bit-width
approximate adder and multiplier blocks. An overview of the
elementary adder and multiplier modules, along with their area
and power properties, is presented in Section 4.1.
The results of this experiment are illustrated in Fig. 2. On the
x-axis, we denote the number of output LSBs approximated in
the LPF. Note, the number of LSBs approximated decides which
of the computationally accurate 1-bit full-adder and elementary
2 × 2 multiplier modules are replaced with their approximate
counterparts. The y-axes denote the achieved reductions in
terms of area, latency, power, and energy compared to the
accurate case, as well as the peak detection accuracy and output
signal quality (in terms of structural similarity index: SSIM).
0
0.2
0.4
0.6
0.8
1
0
1
2
3
4
5
6
7
0 2 4 6 8 10 12 14 160
1
2
3
4
5
6
7 1
0
2 4 6 8 10 12 14 160
LSBs Approximated
M
ag
n
it
u
d
e 
R
ed
u
ct
io
n
s 
[×
1
]
(A
re
a/
P
o
w
er
/E
n
er
g
y
/L
at
en
c
y
)
Energy
Power
Area
Latency
Threshold for Error-Resilience
SSIM Peak Detection Accuracy
S
tr
u
ct
u
ra
l 
S
im
il
ar
it
y
 I
n
d
ex
 (
S
S
IM
)
P
ea
k
 D
et
ec
ti
o
n
 A
cc
u
ra
cy
 [
×
1
0
2
]
Figure 2: Error Resilience of the Low Pass Filter Stage.
From these experiments, we make the following key obser-
vations:
• Increasing the number of approximated LSBs decreases the
area, energy, power, and/or latency.
• The peak detection accuracy is consistently 100% for an increas-
ing number of LSBs approximated, exhibiting high tolerance
towards intermediate approximation errors.
• The error resilience threshold for this stage is 14 LSBs, after
which the peak detection accuracy falls to zero, which implies
that the application is significantly resilient to approximation
errors.
• The output signal quality, obtained after the second stage of
the application (used by physicians for healthmonitoring and
diagnosis), drastically decreases when approximating more
than 2 LSBs, as illustrated by the SSIMmetric. However, if 50%
loss in signal quality can be tolerated, we can approximate
up to 10 LSBs while achieving ~4× energy reductions.
Similarly, the subsequent four stages of the algorithm are
error-resilient as well (see Section 4.2), which can be lever-
aged to further reduce energy consumption at each stage of
the bio-signal processing application. While previous works
have focused on deploying aggressive voltage scaling in memo-
ries [5] to achieve ~5× energy reductions, in this work, we focus
2
Accepted for publication at the Design Automation Conference 2019 (DAC’19), Las Vegas, Nevada, USA
on deploying functional approximations in the processing ele-
ments, thereby limiting the maximum error. Before proceeding
to the main technical contributions of this paper, we provide
a brief overview of our target application (the Pan-Tompkins
algorithm for QRS Peak detection in ECG signals), to the level
of detail necessary for understanding our technique.
3 BACKGROUND: PAN-TOMPKINS
ALGORITHM
The QRS peak detection algorithm proposed by Jiapu Pan and
Willis Tompkins in 1985 [17], still serves as a basic standard for
QRS detection in a wide-range of hospital or portable Holter
monitors and wearable electronic devices. The algorithm is
used to determine the number of heartbeats in the sampled
ECG signal by detecting the QRS complex present in each
correct heartbeat (cardiac arrhythmias produce incorrect ECG
waveforms). The analog ECG signal is sampled at a frequency
of 200 Hz, using a 16-bit ADC. The Pan-Tompkins algorithm is
composed of five key stages (see Fig. 3):
(A) Low Pass Filter: First, to eliminate high frequency noise
due to muscle movement and electrical interference, a low
pass filter is used to eliminate frequencies above 12 Hz.
(B) High Pass Filter: Next, to remove low frequency noise
components from the input, such as respiration and base-
line wander, and obtain signals within the desired range
of 5-12 Hz, a high pass filter with a cut-off frequency (fc )
of 5 Hz is used.
(C) Differentiator Stage:After the initial data pre-processing
and noise filtering, a five-tap digital differentiator is used
to determine the QRS complex slope information.
(D) Squarer Stage: The signal is then squared point-by-point,
which nonlinearly amplifies the output while emphasizing
the higher (ECG) frequencies and renders all data points
positive.
(E) Moving Window Integration: Finally, to extract the
waveform feature information and the slope of the R-wave
(see Fig. 3), a moving average filter is deployed.
In the next section, we describeXBioSiP, our novel methodology
for approximating bio-signal processing at the edge.
4 OUR XBIOSIP METHODOLOGY FOR
APPROXIMATE BIO-SIGNAL
PROCESSING
An overview of our methodology for approximating arithmetic
blocks in bio-signal processing algorithms is presented in Fig. 4.
It is composed of four key steps, starting with the design and
evaluation of the elementary (approximate) arithmetic mod-
ules. Next, we analyze the error-resilience of each stage in the
application by varying the number of LSBs approximated to
Approximate 
Multiplier 
Library
Approximate 
Adder 
Library
Error Resilience Analysis of 
Application Stages
Approximations in Pre-processing
Implementation & Energy 
Characterization of Designs
...
S1
S2
Sn 0
0.2
0.4
0.6
0.8
1
0
1
2
3
4
5
6
7
0 2 4 6 8 10 12 14 160
1
2
3
4
5
6
7 1
0
2 4 6 8 10 12 14 160
LSBs Approximated
M
a
g
n
it
u
d
e 
R
ed
u
ct
io
n
s 
[ 
1
]
(A
re
a
/P
o
w
er
/E
n
er
g
y
/L
at
en
c
y
)
Power
Energy
Latency
Area
Error Resilience Threshold
SSIM Peaks
S
tr
u
ct
u
ra
l 
S
im
il
a
ri
ty
 I
n
d
ex
 (
S
S
IM
)
P
er
ce
n
ta
g
e 
o
f 
P
ea
k
s 
D
et
ec
te
d
 [
 
1
0
2
]
I/P
O/P
Energy vs. Quality: 
LSBList, EnergySavings
Approximations in Signal Processing
Design & Evaluation of Elementary 
Approximate Adders and Multipliers
Energy-sort: AddList, MultList
Design Generation Methodology
Quality > SignalConstraint?
Approximate Design
Approximate Pre-Processing Unit
Design Generation Methodology
Quality > FinalConstraint?
Approximate Design
Approximate Bio-signal Processor
T
F
T
F
Figure 4: Overview of XBioSiP Methodology.
extract its energy-quality trade-off. Using the previous analy-
ses we propose a design generation methodology that explores
a limited number of points in the design space to generate
approximate designs that offer maximum energy reductions
while satisfying the user-defined quality constraints.
Bio-signal processing algorithms are essentially composed
of two sections. The first section is data pre-processing, which
comprises techniques for signal reconstruction, noise filter-
ing, data transformation, reduction and compression, etc. The
second section is responsible for efficiently extracting the rel-
evant information from the pre-processed data. We utilize
the proposed design generation methodology in the data pre-
processing and the main bio-signal processing sections to gen-
erate approximate processing units. We propose to evaluate the
quality of output signals at two stages to ensure fine-grained
quality-control, and to provide a medical physician access to
the user’s accurate health history in case of emergencies. The
key difference between the two stages is the use of a differ-
ent quality metric for evaluating the constraint. The output
obtained from data pre-processing is generally in the form of a
signal whose quality can be estimated using metrics like PSNR
and/or SSIM, whereas the final metric depends on the output
of the bio-signal application targeted for approximation. In our
case, the Pan-Tompkins algorithm is used to estimate the num-
ber of QRS peaks in the signal because of which we consider
peak detection accuracy as the final metric.
4.1 Elementary Approximate Adders and
Multipliers
In this paper, we use the low-power approximate 1-bit full-
adders (FAs) proposed by Gupta et al. [8] [9] and the approxi-
mate 2 × 2 multiplier modules proposed by Kulkarni et al. [12]
Low Pass Filter High Pass Filter Differentiator Stage Squarer Stage Moving Window Int.Raw Signal
I/P
f c
 =
 1
2
 H
z
f c
 =
 5
 H
z
Q
R
S
Q
R
S
 S
lo
p
e
Y
 [
i]
 =
 (
X
 [
i]
)2
Peak 
Detected
Adaptive 
Thresholding
Figure 3: Overview of the Pan-Tompkins QRS Peak Detection Algorithm for ECG Bio-signals.
3
Accepted for publication at the Design Automation Conference 2019 (DAC’19), Las Vegas, Nevada, USA
A(0)
A(1)
B(0)
B(1)
Out(0)
Out(1)
Out(2)
Out(3)
A(0)
A(1)
B(0)
B(1)
Out(0)
Out(1)
Out(2)
Out(3)
A(0)
A(1)
B(0)
B(1)
Out(0)
Out(1)
Out(2)
Out(3)
A
B
Cin
Cout
Sum
A
B
Cin
Cout
Sum
A
B
Cin
Cout
Sum
A
B
Cin
Cout
Sum
A
B
Cin
Cout
Sum
A
B
Cin
Cout
Sum
A
cc
M
u
lt
A
p
p
M
u
lt
V
1
A
p
p
M
u
lt
V
2
AccAdd ApproxAdd1
ApproxAdd2 ApproxAdd3
ApproxAdd4 ApproxAdd5
Figure 5: Approximate Multipliers and
Adders [8][9][12][19].
and Rehman et al. [19]. This library of elementary approximate
arithmetic modules, along with their accurate counterparts,
is illustrated in Fig. 5. These arithmetic modules are used to
construct the larger bit-width approximate adder and multi-
plier blocks. Fig. 6 presents a ripple-carry adder architecture
that is used to construct larger bit-width approximate adders
by replacing the accurate 1-bit FA modules with approximate
ones. We restrict ourselves to deploying approximations at the
LSBs to limit the error magnitude. Similarly, larger bit-width
multiplier blocks are recursively constructed from elementary
multipliers and adders, similar to the structure shown in Fig. 7.
For example, a 16× 16 multiplier is recursively partitioned into
four smaller 8× 8multiplier blocks, whose outputs are accumu-
lated using three 32−bit adders. Similarly, each 8 × 8 multiplier
is sub-partitioned into four smaller 4×4multiplier blocks, each
of which is further sub-partitioned into four elementary 2 × 2
multipliers.
CinFA
AkBk
Sk
 
Cout
XA
A2B2
S2
XA
A1B1
S1
XA
A0B0
S0
FA
Ak-1Bk-1
Sk-1
[N-k] bits
Accurate
k bits
Approximate
Figure 6: Approximate Larger Bit-Width Ripple Carry
Adder.
To obtain the area, latency, power, and energy properties,
we synthesize these arithmetic modules using an ASIC tool-
flow (see Section 5). The corresponding results are presented
in table 1. The elementary arithmetic modules are listed in
descending order of energy consumption, which is considered
8×8 (AL×BL)
4×4 (L×L) 4×4 (H×L) 4×4 (L×H) 4×4 (H×H)
2×2 2×2 2×2 2×2
8×8 (AH×BH)
4×4 (L×L) 4×4 (H×L) 4×4 (L×H) 4×4 (H×H)
2×2 2×2 2×2 2×2
8×8 (AH×BL) 8×8 (AL×BH)
16×16(A×B)N-bitsAH AL
(N/2)-bits(N/2)-bits
N-bits BH BL
(N/2)-bits(N/2)-bits
Figure 7: Larger Bit-Width Recursive Multiplier
Designs.
Table 1: Synthesis Results of Our Elementary
Approximate Adder and Multiplier Library.
Area
[µm2]
Delay
[ns]
Power
[µW ]
Energy
[f J ]
Accurate 10.08 0.18 2.27 0.409
ApproxAdd1 8.28 0.11 1.34 0.147
ApproxAdd2 3.96 0.08 0.61 0.049
ApproxAdd3 3.60 0.06 0.41 0.025
ApproxAdd4 3.24 0.06 0.33 0.020
ApproxAdd5 0.00 0.00 0.00 0.000
Area
[µm2]
Delay
[ns]
Power
[µW ]
Energy
[f J ]
Accurate 14.40 0.16 1.80 0.288
AppMultV1 11.52 0.13 1.67 0.167
AppMultV2 9.72 0.06 1.37 0.137
while generating the approximate designs as our primary goal
is to maximize the energy reductions.
4.2 Error Resilience Analysis of Application
Stages
In this step, we quantify the error resilience of each applica-
tion stage by varying the number of LSBs approximated with
the least energy consuming arithmetic adder and multiplier
modules. We analyze the energy-quality trade-offs obtained
for each application stage, similar to the low-pass filter anal-
ysis presented in Section 2. Fig. 8 (a)-(d) illustrate the error
resilience and energy reductions of the remaining stages in
the application for a varying number of LSBs approximated.
We make the following key observations for each application
stage:
• Low Pass Filter: By applying approximations at this stage, ~5×
energy reductions with 0% loss in peak detection accuracy
can be achieved, when up to 14 LSBs are approximated, as
shown in Fig. 2. Approximating up to 8 LSBs can also achieve
energy reductions of ~3× while tolerating less than 50% loss
in output signal quality (SSIM).
0
0.2
0.4
0.6
0.8
1
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 80
2
4
6
8
10 1
0
2 4 6 80
LSBs Approximated
Threshold for Error-Resilience
16×(c) Squarer Stage
0
0.2
0.4
0.6
0.8
1
0
10
20
30
40
50
60
70
0 2 4 6 8 10 12 14 160
10
20
30
40
50
60
70 1
0
2 4 6 8 10 12 14 160
LSBs Approximated
M
ag
n
it
u
d
e 
S
av
in
g
s 
[×
1
]
(A
re
a/
P
o
w
er
/E
n
er
g
y
/L
at
en
cy
) (a) High Pass Filter
SSIM
Threshold for Error-Resilience
0
0.2
0.4
0.6
0.8
1
1.2
0
2
4
6
8
10
12
0 2 4 6 8 10 12 14 160
2
4
6
8
10
12
1
0
2 4 6 8 10 12 14 160
LSBs Approximated
S
tr
u
ct
u
ra
l 
S
im
il
ar
it
y
 I
n
d
ex
P
ea
k
 D
et
ec
ti
o
n
 A
cc
u
ra
cy
(d) Moving Window Integration
Peak Detection Accuracy
Extreme Error Tolerance
0
0.2
0.4
0.6
0.8
1
0
0.5
1
1.5
2
2.5
3
3.5
4
0 2 40
1
2
3
4 1
0
0 42
LSBs Approximated
Energy
Power
Area
Latency
(b) Differentiator Stage
Peak Detection Accuracy
Peak Detection Accuracy
Figure 8: Error Resilience Analysis of Pan-Tompkins Application Stages.
4
Accepted for publication at the Design Automation Conference 2019 (DAC’19), Las Vegas, Nevada, USA
• High Pass Filter: Due to the number of adders and multipliers
in this stage (31 and 32, respectively), approximating just 8
LSBs can achieve energy reductions of ~60×, with 0% loss
in peak detection accuracy, as shown in Fig. 8(a). However,
the output signal quality (SSIM) drastically decreases after
approximating merely 2 LSBs.
• Differentiator: The magnitude of filter-coefficients at this
stage is very small (2 and 1), and approximating more than
4 LSBs truncates all active paths, effectively connecting the
outputs to either the inputs or to logic ‘0’. Moreover, due to
the importance of this stage in extracting the QRS slope in-
formation, and the limited number of adders and multipliers,
applying approximations in this stage is ineffective and leads
to limited energy reductions, as shown in Fig. 8(b).
• Squarer: This stage only requires a large bit-width multiplier.
Hence, approximating just a few bits can lead to significant
degradation in output quality as illustrated in Fig. 8(c). There-
fore, the approximation potential for this stage is low.
• Moving Window Integration: The final stage is composed
solely of adder blocks and, as illustrated by Fig. 8(d), is ex-
tremely error-resilient, tolerating approximations of up to
16 LSBs, while achieving ~12× energy reductions.
4.3 Approximations in Data Pre-Processing
and Signal Processing
Using the initial analyses and energy reports available from the
earlier steps, we propose a novel design generation method-
ology that effectively explores the design space to generate
designs that offer high energy reductions while satisfying the
user-defined quality constraints. The pseudo-code of our design
generation methodology has been presented in Algorithm 1. It
is used to generate potential approximate processing units for
the target application. All the information gathered from the
previous stages, i.e., the number of elementary adders and mul-
tipliers, maximum number of LSBs that can be approximated
at each stage, quality loss, energy reductions obtained for a
varying number of approximate LSBs, etc. are provided as an
input to the design generation methodology.
The first phase of the methodology starts with initializing
two empty arrays to store the designs that satisfy the quality
constraint. Afterwards, the application stages currently being
approximated are sorted in an ascending order, based on the
maximum energy reductions obtained at each individual stage
(line 3). Next, we start evaluating possible approximations in
the first stage present in StageList, starting from the maximum
possible number of LSBs that can be approximated in the given
stage, using the least energy-consuming adder and multiplier
modules. We construct a behavioral model of the current stage
given the approximation parameters, LSB, Mult, and Add, to
evaluate the output quality of the current design. If the quality
constraint is satisfied by the current approximate design, we
store the current design as a potential approximate architecture
for the current application stage (lines 4−16).
Note, our main goal is to find an approximate design that
offers maximum energy reductions while satisfying the quality
constraint. This is achieved by evaluating limited number of
points in the design space to reduce the exploration time.
Algorithm 1 Design Generation Methodology
Input: {EnerдySavinдs,LSBList} ∀ Staдes, StaдeList
Input: AddList ,MultList
Constraints: QualConst .
Output: Staдes[Architecture]
1: Staдe1 = Array[];
2: Staдe2 = Array[];
3: AscendinдSort(StaдeList ,EnerдySavinдs);
4: Staдe = StaдeList(1); ▷ Initialize First Phase
5: for LSB in Stage[LSBList] do
6: for Mult in MultList do
7: for Add in AddList do
8: Desiдn = Staдe[Architecture,LSB,Mult ,Add];
9: OutputQuality = Evaluate(Desiдn);
10: if OutputQuality ≥ QualConst . then
11: Staдe1.append(Desiдn);
12: go to 17
13: end if
14: end for
15: end for
16: end for ▷ Second Phase
17: for i ← 2 to size(StaдeList) do
18: Staдe = StaдeList(i);
19: for LSB in Reverse(Stage[LSBList]) do
20: for Mult in Reverse(MultList) do
21: for Add in Reverse(AddList) do
22: Desiдn = Staдe[Architecture,LSB,Mult ,Add];
23: OutputQuality = Evaluate(Desiдn);
24: if OutputQuality < QualConst . then
25: go to 32
26: else
27: Staдe2.append(Desiдn);
28: end if
29: end for
30: end for
31: end for ▷ Third Phase
32: while StageList(i-1)[Architecture,LSB]≥2 do
33: for Mult in MultList do
34: for Add in AddList do
35: LSB1 = StaдeList(i − 1)[Architecture,LSB] − 2;
36: LSB2 = StaдeList(i)[Architecture,LSB] + 2;
37: Desiдn1 =
StaдeList(i − 1)[Architecture,LSB1,Mult ,Add];
38: Desiдn2 =
StaдeList(i)[Architecture,LSB2,Mult ,Add];
39: OutputQuality = Evaluate(Desiдn1,Desiдn2);
40: if OutputQuality ≥ QualConst . then
41: Staдe1.append(Desiдn1);
42: Staдe2.append(Desiдn2);
43: end if
44: end for
45: end for
46: end while
47: StaдeList(i)[Architecture] ← Best(Staдe2,Enerдy);
48: StaдeList(i − 1)[Architecture] ← Best(Staдe1,Enerдy);
49: Staдe1 = Staдe2;
50: Staдe2 = Array[];
51: end for5
Accepted for publication at the Design Automation Conference 2019 (DAC’19), Las Vegas, Nevada, USA
Therefore, our methodology does not effectively evaluate all
points to find a pareto-optimal design.
The next phase starts by considering the next stage from
StageList, and iterates over the inverted lists of the approxima-
tion parameters, ranging from least-to-highest approximation.
Since the maximum error magnitude has already been achieved
in the previous stage, iterating over the default lists increases
the number of explored points and might effectively lead to the
same design. Similar to the first phase, the approximation pa-
rameters are used to construct a behavioral model of the current
stage in order to evaluate the output quality. However, since
we are traversing from the lower end of the approximation
spectrum, we only continue if we find any potential for further
maximizing energy reductions. That means, we store the design
only if the quality constraint is satisfied, or we break the loop
and move to the next phase of the methodology (lines 17−31).
Although we have identified the approximation parameters
for the two stages currently under consideration {i, (i − 1)},
there is a possibility that we have missed out on another de-
sign offering better energy reductions compared to the current
designs as we have not explored the higher end approxima-
tion spectrum of the second stage. We address this in the third
and final phase of the methodology by traversing diagonally
across the number of LSBs approximated, i.e., we reduce the
number of LSBs approximated by 2 for the (i − 1)th stage and
increase the number of LSBs approximated by 2 for the (i)th
stage. We reconstruct the behavioral models for these designs
and evaluate their output quality. If the quality constraint is
satisfied, the new designs are stored as a potential approximate
processing unit in the two arrays. This is repeated until the
number of LSBs approximated for the (i − 1)th stage becomes
zero. We then evaluate the energy reductions of each design
in the array, to determine the one which offers the maximum
energy reductions. This design is considered to be the best
approximate unit for the current stage. The second and third
phases are continuously repeated until all stages in StageList
are considered for approximation (lines 32−46).
5 EXPERIMENTAL SETUP
Fig. 9 presents an overview of our experimental tool-flow. The
RTL models (implemented in VHDL) of the different approxi-
mate adders (32-bit) and multipliers (16×16) along with the five
stages (FIR filters) in the Pan-Tompkins algorithm are synthe-
sized using the Synopsys Design Compiler ASIC tool-flow for a
65nm technology library. We generate detailed area, power, la-
tency, and energy reports for analyzing the resource utilization
of the proposed approximate processing units which are then
compared with respect to the accurate design. We also imple-
ment behavioral models of the elementary arithmetic modules
in MATLAB, so that they can be deployed in the target applica-
tion (Pan-Tompkins QRS Peak Detection) for quality evaluation
purposes. The output signal observed after the approximated
High Pass Filtering, is compared with the accurate output to
denote the quality loss, either in terms of PSNR or SSIM, to
ensure the output quality of the intermediate signal. The final
output of the target application needs to be considered while
approximating the arithmetic blocks present in the different
application stages. The number of peaks detected in the sample
duration, or the peak detection accuracy is the final metric used
for quality evaluation in our ECG case study. We use the MIT-
BIH Normal Sinus Rhythm Database (NSRDB) made available
online via PhysionetDB [7].
Hardware Synthesis 
Synopsys 
Design 
Compiler
Area 
Reports
Latency 
Reports
Power 
Reports
Behavioral Modelling
MATLAB
VHDL - Application Processing Units
ModelSim
C
ro
ss
-
V
al
id
at
io
n
Behavioral 
Models
T
es
t 
C
as
e
s
Target Application 
Evaluation
Approx.
Adder
Library
Approx.
Multiplier
Library
PhysioNetDB
NSRDB
Figure 9: Overview of Our Experimental Setup.
6 RESULTS & DISCUSSION
In this section, we illustrate the effectiveness of XBioSiP, by
applying it on the Pan-Tompkins QRS peak detection algorithm.
Before moving on to illustrating the benefits of our proposed
methodology, we first discuss the differences in output signal
quality when using accurate and approximate processing units
in the Pan-Tompkins Algorithm. Fig. 10 illustrates the differ-
ences in signal quality when processing the ECG data-points
using accurate and approximate arithmetic blocks with 4 LSBs
approximated at all five stages. Considering the accurate High
Pass Filtered signal as a reference, the approximated signal has
a PSNR of 19.24, with 100% peak detection accuracy for the
sample duration. This approximate design requires ~7× less
energy as compared to the accurate counterpart, which sig-
nificantly increases the lifetime of battery-operated IoT edge
devices.
Raw ECG Signal
High Pass Filtered Signal
11 Peaks Detected 11 Peaks Detected
High Pass Filtered Signal
Figure 10: Differences in Output Quality Between
Accurate and Approximate Processing Units in
Pan-Tompkins.
6.1 Approximations in Data Pre-Processing
In this subsection, we evaluate the effectiveness of the approx-
imations deployed in the Low Pass and High Pass Filtering
Stages of the Pan-Tompkins Peak Detection Algorithm. For
the sake of simplicity, we currently restrict the design space of
adders and multipliers to ApproxAdd5 and AppMultV1, respec-
tively [9][19]. We also performed an exhaustive exploration of
6
Accepted for publication at the Design Automation Conference 2019 (DAC’19), Las Vegas, Nevada, USA
Table 2: PSNR and Energy Reductions of the Designs Obtained Using the Design Space Generation
Methodology for the Pan-Tompkins Data Pre-processing Stage.
HPF 0 HPF 2 HPF 4 HPF 6 HPF 8 HPF 10 HPF 12 HPF 14 HPF 16
Metrics PSNR Energy PSNR Energy PSNR Energy PSNR Energy PSNR Energy PSNR Energy PSNR Energy PSNR Energy PSNR Energy
LPF 0 12.12 11.78
LPF 2 11.24 08.58
LPF 4 15.50 23.98
LPF 6 15.48 33.44
LPF 8 15.00 34.98
LPF 10 14.98 27.75
LPF 12 14.22 9.24
LPF 14 17.70 1.03 17.30 5.63 11.96 8.84
LPF 16 12.12 1.04
Phase-I
Phase-II
Phase-III
all 9× 9 = 81 possible combinations. The corresponding results
are presented in Table 2. Each simulation consists of an ECG
recording that is obtained from the MIT-BIH Normal Sinus
Rhythm Database (NSRDB) [7]. An ECG recording of 20, 000
samples takes around 300 seconds for filtering and processing.
Therefore, an exhaustive exploration of 81 possible scenarios
takes roughly seven hours. On the other hand, our design gen-
eration methodology successfully generates and evaluates only
11 designs (in ~1 hour), where five satisfy the quality constraint.
We considered a PSNR value of 15 as the user-defined quality
constraint for the signal filtering stage. Out of all possible op-
tions, we choose the design which offers the maximum energy
reductions of ~35×, which obviously satisfies the PSNR value
as shown in Table 2.
Next, we evaluate the effectiveness of our approach (algo-
rithm 1) in reducing the execution time of the design space
search by deploying it on the data pre-processing stage of our
target application (LPF and HPF). The exhaustive search in-
volves varying all three major parameters, namely, (i) number
of LSBs approximated (0 to 16), (ii) number of approximate
adders (AccAdd to AppAdd5), and (iii) number of approximate
multipliers (AccMult to AppMultV2), to explore the entire de-
sign space and identify the design offering maximum energy
savings while satisfying the signal quality constraint. To reduce
the overall size of the design space, we implement thresholds
and limitations on all the major parameters such as utilizing
the same elementary approximate adder and multiplier module
throughout the entire design, which is dubbed as the heuristic.
1
1E+11
1E+22
1E+33
1E+44
1E+55
1E+66
1E+77
1E+88
1E+99
1E+110
1E+121
1E+132
1E+143
1E+154
1E+165
1E+176
1E+187
1E+198
1E+209
1E+220
0
10
20
30
40
50
60
70
80
90
1 2 3 4 5 6
0
10
20
3
40
5
60
70
E
x
p
lo
ra
ti
o
n
 D
u
ra
ti
o
n
 [
H
rs
.]
Exhaustive
Heuristic
Algorithm 1
E
x
h
au
st
iv
e 
E
x
p
lo
ra
ti
o
n
 D
u
ra
ti
o
n
 [
Y
rs
.]
L
o
g
a
ri
th
m
ic
 S
ca
le
25
50
75
00
25
150
175
200
225
8
9
1[×10x]
On Average, 23.6× Reductions in Execution 
Time Compared to the Heuristic
29×
Estimated Exploration Time
Figure 11: Exploration Time Analysis of Algorithm 1.
This also includes reducing the LSBs approximated to multi-
ples of 2. Fig. 11 illustrates the execution time analysis of our
proposed approach compared with the baselines such as ex-
haustive and heuristic. As can be observed, we reduced the
exploration time by ~23.6×, on average, when compared to the
heuristic baseline.
6.2 Approximations in Signal Processing
Considering 0% quality loss during the data pre-processing
stage, we evaluate the output quality by deploying approxi-
mations in the signal processing stages. We restrict the design
space by limiting the number of approximable LSBs to 4, 8, and
16, for the differentiator, squarer, and moving average stages,
respectively. Similar to the previous stage, for the sake of sim-
plicity, we restrict the list of elementary approximate adders
and multipliers to ApproxAdd5 and AppMultV1. We exhaus-
tively evaluated all 135 possible combinations to extract energy
reductions and accuracy values for each design (in ~11 hours).
Our proposed design generation methodology is equally appli-
cable in this case to reduce the exploration time and generate
design points that offer high energy reductions. We obtain two
Pareto-optimal points from the design space by extracting the
Pareto-frontier. Similarly, we extract four Pareto-optimal de-
signs for the data pre-processing stage. The output quality and
energy results of these Pareto-optimal designs are presented
in Fig. 12. We also perform a design space exploration of the
designs obtained from the data pre-processing and signal fil-
tering stages to analyze the percentage loss in output quality
0
5
10
15
20
25
0
20
40
60
80
100
120
Alpha -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Beta
0
20
40
60
80
100
20
0
Hardware Configuration
P
ea
k
 D
et
ec
ti
o
n
 A
cc
u
ra
cy
 [
%
]
E
n
er
g
y
 R
ed
u
c
ti
o
n
 [
×
1
]
10
95% Quality Threshold
Peak Detection Accuracy
Energy Reductions
LSB LPF HPF DER SQR SWI
A1 Raspberry Pi B+ (ARMv8)
A2 Zero LSBs Approximated
B1 10 8 0 0 0
B2 10 12 0 0 0
B3 12 8 0 0 0
B4 12 12 0 0 0
B5 0 0 2 8 16
B6 0 0 4 8 16
B7 10 8 2 8 16
B8 10 8 4 8 16
B9 10 12 2 8 16
B10 10 12 4 8 16
B11 12 8 2 8 16
B12 12 8 4 8 16
B13 12 12 2 8 16
B14 12 12 4 8 16
10-7
Figure 12: Energy-Quality Evaluation of the
Approximate Designs Proposed for the Pan-Tompkins
Algorithm.
7
Accepted for publication at the Design Automation Conference 2019 (DAC’19), Las Vegas, Nevada, USA
Accurate Peak Detection
Q
R
S
 o
n
H
P
F
 S
ig
n
al
Q
R
S
 o
n
M
W
I 
S
ig
n
al
Q
R
S
 o
n
E
C
G
 S
ig
n
al
Approximate Peak Detection (B10)
Actual QRS Peak
Misclassified QRS Peak
Approximation errors cause a new peak before the actual QRS complex
Misalignment of peaks between HPF and MWI  Peak Omitted (Heartbeat Missed)
Figure 13: Heartbeat Misclassification Analysis of the Approximate Processing Unit (B10).
and obtained energy reductions of an end-to-end system. The
results of this design space exploration are presented in Fig. 12.
The energy reductions presented are obtained with respect to
the accurate hardware design with zero approximations (de-
noted by A2). We also evaluate the energy consumption of the
application by executing it on the Raspberry Pi 3 B+ (ARM v8)
with HDMI and WiFi switched off (denoted by A1). The energy
consumption of A1 is ~7 orders of magnitude higher than the
energy consumption of A2. Design B9 reduces the energy con-
sumption by ~19.7× while detecting all the peaks present in
the database. Design B10 reduces the energy consumption by
~22× while tolerating a loss of 1% in peak detection accuracy.
We also analyze the output signal of the design B10 to under-
stand why less than 1% heartbeats were missed by comparing
it with the output of A2. Fig. 13 illustrates the differences in
output signal of the accurate (A2) and approximate processing
units (B10). The errors introduced by the approximate arith-
metic blocks cause the algorithm to misclassify the error as
a peak. Due to the misalignment of peaks between the HPF
and MWI signals (larger than a preset threshold), the detected
peak is omitted as an error in classification, and the heartbeat
is missed.
7 CONCLUSION
Wepresented a novel approximate bio-signal processingmethod-
ology, XBioSiP, for achieving energy reductions in energy-
constrained, sensory, IoT edge devices, and wearable electron-
ics. After compiling a library of the elementary approximate
modules, we evaluate the error-resilience of all application
stages to determine the upper-bound of the approximation pa-
rameters. Then we use our proposed design generation method-
ology to develop approximate stages of the target application
that satisfy the quality constraint. We propose to evaluate the
quality constraint twice, after an intermediate stage (data pre-
processing) as well the final stage (signal processing), to ensure
fine-grained quality control of the intermediate signal. We
evaluate the effectiveness of XBioSiP using an ECG process-
ing application called the Pan-Tompkins Algorithm. We have
successfully reduced the energy consumption by ~19.7× for
0% loss in peak detection accuracy, and by ~22× for less than
1% quality loss. Furthermore, to enable further research and
development in this field, we have open-sourced the behav-
ioral and RTL implementations of our approximate modules
at https://xbiosip.sourceforge.io/. In the future, we plan to ex-
tend our work to include diagnostic techniques and algorithms
across multiple bio-signal processing domains such as ECG-
based arrhythmia detection and EEG-based seizure prediction.
ACKNOWLEDGEMENTS
The authors would like to thank Florian Kriebel for his feedback
and detailed technical comments, which helped us improve the
quality of this work.
REFERENCES
[1] Woongki Baek and Trishul M Chilimbi. 2010. Green: a framework for sup-
porting energy-conscious programming using controlled approximation.
In ACM Sigplan Notices, Vol. 45. ACM, 198–209.
[2] Vinay K Chippa et al. 2013. Analysis and characterization of inherent
application resilience for approximate computing. In DAC. ACM.
[3] Keith M Diaz et al. 2015. Fitbit®: an accurate and reliable device for
wireless physical activity tracking. International journal of cardiology 185
(2015), 138–140.
[4] Hadi Esmaeilzadeh et al. 2012. Architecture support for disciplined approx-
imate programming. In ACM SIGPLAN Notices, Vol. 47. ACM, 301–312.
[5] D. Bortolotti et al. 2014. Approximate compressed sensing: ultra-low
power biosignal processing via aggressive voltage scaling on a hybrid
memory multi-core processor. In Proceedings of the 2014 ISLPED. ACM,
45–50.
[6] Wei Gao et al. 2016. Fully integrated wearable sensor arrays for multi-
plexed in situ perspiration analysis. Nature 529, 7587 (2016), 509.
[7] Ary L Goldberger et al. 2000. PhysioBank, PhysioToolkit, and PhysioNet:
components of a new research resource for complex physiologic signals.
Circulation 101, 23 (2000), e215–e220.
[8] Vaibhav Gupta et al. 2011. IMPACT: imprecise adders for low-power
approximate computing. In ISLPED. IEEE Press, 409–414.
[9] Vaibhav Gupta et al. 2013. Low-power digital signal processing using
approximate adders. IEEE TCAD 32, 1 (2013), 124–137.
[10] Robert SH Istepanian et al. 2011. The potential of Internet of m-health
Things “m-IoT” for non-invasive glucose level sensing. In EMBC. IEEE.
[11] Alaa Khushhal et al. 2017. Validity and reliability of the Apple Watch for
measuring heart rate during exercise. Sports Medicine International Open
1, 06 (2017), E206–E211.
8
Accepted for publication at the Design Automation Conference 2019 (DAC’19), Las Vegas, Nevada, USA
[12] Parag Kulkarni et al. 2011. Trading accuracy for power with an under-
designed multiplier architecture. In VLSI Design (VLSI Design), 2011 24th
International Conference on. IEEE, 346–351.
[13] Michael A Laurenzano et al. 2016. Input responsiveness: using canary
inputs to dynamically steer approximation. ACM SIGPLAN Notices 51, 6
(2016), 161–176.
[14] Asit K Mishra et al. 2014. iACT: A software-hardware framework for
understanding the scope of approximate computing. In Workshop on
Approximate Computing Across the System Stack (WACAS).
[15] Sparsh Mittal. 2016. A survey of techniques for approximate computing.
ACM Computing Surveys (CSUR) 48, 4 (2016), 62.
[16] Arsalan Mohsen Nia, Mehran Mozaffari-Kermani, Susmita Sur-Kolay,
Anand Raghunathan, and Niraj K Jha. 2015. Energy-efficient long-term
continuous personal health monitoring. IEEE Transactions on Multi-Scale
Computing Systems 1, 2 (2015), 85–98.
[17] Jiapu Pan and Willis J Tompkins. 1985. A real-time QRS detection algo-
rithm. IEEE Trans. Biomed. Eng 32, 3 (1985), 230–236.
[18] Tifenn Rault. 2015. Energy-efficiency in wireless sensor networks. (Économie
d’énergie dans les réseaux de capteurs sans fil). Ph.D. Dissertation. Univer-
sity of Technology of Compiègne, France. https://tel.archives-ouvertes.
fr/tel-01470489
[19] Semeen Rehman et al. 2016. Architectural-space exploration of approxi-
mate multipliers. In ICCAD. ACM, 80.
[20] Muhammad Shafique et al. 2016. Cross-layer approximate computing:
From logic to architectures. In DAC. ACM, 99.
[21] Swagath Venkataramani et al. 2015. Approximate computing and the
quest for computing efficiency. In DAC. ACM, 120.
9
