Synthesizing Compact Hardware for Accelerating Inference from Physical
  Signals in Sensors by Tsoutsouras, Vasileios et al.
Synthesizing Compact Hardware for Accelerating
Inference from Physical Signals in Sensors
Vasileios Tsoutsouras1, Max Vigdorchik2, and Phillip Stanley-Marbell3
1,2,3Department of Engineering, University of Cambridge, Cambridge CB3 0FA, UK.
This manuscript was compiled on February 5, 2020
We present dimensional circuit synthesis, a new method for gener-
ating digital logic circuits that improve the efficiency of training and
inference of machine learning models from sensor data. The hard-
ware accelerators that the method generates are compact enough
(a few thousand gates) to allow integration within low-cost miniatur-
ized sensor integrated circuits, right next to the sensor transducer.
The method takes as input a description of physical properties of rel-
evant signals in the sensor transduction process and generates as
output a Verilog register transfer level (RTL) description for a circuit
that computes low-level features that exploit the units of measure of
the signals in the system.
We implement dimensional circuit synthesis as a backend to the
compiler for Newton, a language for describing physical systems.
We evaluate the backend implementation and the hardware it gen-
erates, on descriptions of 7 physical systems. The results show that
our implementation of dimensional circuit synthesis generates cir-
cuits of as little as 1662 logic cells / 1239 gates for the systems we
evaluate.
We synthesize the designs generated by the dimensional cir-
cuit synthesis compilation backend for a low-power miniature FPGA
targeted by its manufacturer at sensor interface applications. The
circuits which the method generated use as little as 27% of the re-
sources of the 2.15×2.5 mm FPGA. We measure the power dissipa-
tion of the FPGA’s isolated core supply rail and show that, driven
with a pseudorandom signal input stream, the synthesized designs
use as little as 1.0 mW and no more than 5.8 mW. These results show
the feasibility of integrating physics-inspired machine learning meth-
ods within low-cost miniaturized sensor integrated circuits, right
next to the sensor transducer.
Sensors | In-Sensor Machine Learning | RTL | FPGAs | Compilers.
1. Introduction
Sensor integrated circuits are at the forefront of the data
pipeline feeding the recent revolution in machine learning sys-
tems. Sensors transduce a physical signal such as acceleration,
temperature, or light, into a voltage which is then converted
by analog-to-digital converters (ADCs) into a numeric rep-
resentation, for input to computation. Digital preprocessing
within sensor integrated circuits, or the software that consume
their output, then apply appropriate calibration constants and
scaling to convert these digitized voltages into a scaled and
dimensionally-meaningful representation of the signal (e.g.,
acceleration in m/s2).
Figure 1 shows how contemporary sensor-driven computing
systems move the digitized data at the output of signal con-
version circuits through many transmission and storage steps
before the data are used in training a model or in driving an
inference, typically on server far removed from the sensing
process. This data movement costs time and energy. When
Lambda
EC2
In situ signal 
processing,
training, or 
inference 
algorithms
Sensor Hub, Microcontroller, DSP, or 
Microprocessor
I2C
SPI
I2C
SPI
Sensor
ADC
Sensor 
transducer
➌➊ ➋
or01001
Training or inference 
on a server
Fig. 1. Existing sensing systems typically send data to servers for training models
and generating inferences. Moving data both within a system and over networks
adds latency and costs energy.
ever-greater volumes of data will enable new applications and
inference models, it will be valuable to perform the necessary
computations as close to the signal acquisition and transduc-
tion process as possible: ideally, in the sensor integrated circuit
itself (labeled “Ê” in Figure 1).
However, since these sensor integrated circuits are typically
required be low cost (often under 10 USD), have small die area
(often less than 4 mm2), and use minimal power (typically un-
der 1 mW), it is challenging to integrate even the most efficient
and compact traditional learning and inference methods into
these devices themselves.
A. Physics constrains signals from sensors. The values
taken on by data from sensors are constrained by the laws
of physics and by the dynamics of the structures to which
sensors are attached. Most physical laws and the governing
equations for most system dynamics take the form of sums
of product terms with each product term comprising powers
of the system’s variables (1–3). Because there are a bounded
number of irreducible relative powers in these product terms
whose units of measure result in meaningful units for the whole
expression, it is common in many engineering disciplines to
use information on units of measure (dimensional analysis)
to derive candidate relations for experimentally-observed phe-
nomena (3, 4). Recent work (5) has used this observation to
prune the hypothesis set of functions considered during ma-
chine learning to achieve significant improvements in both
training and inference, improving training latency by 8660×
and reducing the arithmetic operations in inference over 34×.
In this work, we build on these results to develop a new
backend for the Newton compiler (6). The backend generates
register transfer level (RTL) hardware designs for accelerating
the execution of the required pre-inference parts of analytic
models relating the signals in a multi-sensor system according
PSM and VT conceived the idea. VT performed the compiler RTL-generation back-
end implementation. VT and MV performed the experimental evaluation. All authors
contributed to the writing.
1To whom correspondence should be addressed. E-mail: vt298@cam.ac.uk
Tsoutsouras et al.
ar
X
iv
:2
00
2.
01
24
1v
1 
 [c
s.A
R]
  4
 Fe
b 2
02
0
1 include "NewtonBaseSignals.nt"
2
3 v0 : constant = 0 (meter*second**-1);
4
5 UAVglider: invariant(
6 h: distance,
7 v: speed,
8 m: mass) =
9 {
10 h ~ {v, m, v0, kNewtonUnithave_AccelerationDueToGravity}
11 }
Fig. 2. Newton language specification, for a sensor-instrumented unpowered glider.
The specification only states the physical signals/quantities relevant to the system,
their units of measure, and the fact that they are related to each other and to a
constant kNewtonUnithave_AccelerationDueToGravity (line 10).
to the specifications of the physics of a system. Because our
method uses information from dimensional analysis and units
of measure to synthesize hardware to accelerate inference from
sensors, we call the method dimensional circuit synthesis.
B. Dimensional circuit synthesis: physics-derived
pre-inference processing in sensors. Dimensional circuit
synthesis is a compile-time method to generate digital logic cir-
cuits for performing pre-inference processing on sensor signals.
Dimensional circuit synthesis takes as its input a specification
of the signals that can be obtained from the sensors in a system
and their units of measure. Using these specifications, dimen-
sional circuit synthesis generates hardware to compute a set
of physically plausible expressions relating the signals in the
system. The logic circuits which the method generates repre-
sent sets of monomial expressions which form dimensionless
groupings of sensor signals (i.e., products whose units cancel
out). The synthesized hardware takes as input digital represen-
tations of sensor readings and generates the computed value
of the dimensionless expressions as its output. A machine
learning training or inference process then uses these dimen-
sionless products as its inputs and prior work has shown that
this preprocessing can significantly improve both the latency
and accuracy of inference. An on-device (in-sensor) inference
engine will integrate the generated RTL that performs pre-
processing with either custom RTL or a programmable core
implementing the inference using, e.g., a neural network. We
evaluate the generated RTL on the Lattice iCE40, a state-of-
the-art, ultra-miniature FPGA that meets the size and power
consumption constraints of in-sensor processing.
C. Contributions. This article makes two main contributions
to on-device and in-sensor inference:
• We present dimensional circuit synthesis (Section 2), a
new method to generate RTL hardware for pre-processing
sensor data prior to inference, thereby improving latency
and reducing overhead.
• We evaluate the generated RTL on the Lattice iCE40 ultra-
miniature FPGA (Section 3) and show that the generated
RTL is fast enough to allow real-time processing, while
consuming minimal power.
RTL HW 
for Π2
RTL HW 
for ΠN
Inference Output
RTL HW 
for Π1 ...
Sensor Signal #1
Π1 Π2 ΠN
Sensor Signal  #2
Sensor Signal  #k
In-Sensor Inference Hardware
11010
01001
00010
01001 Predictive Model (e.g., neural net)
Fig. 3. The hardware generated by dimensional circuit synthesis preprocesses k
sensor signals to obtain N < k dimensionless products Π1 . . .Πk . A predictive
model takes these products as input and generates an inference output.
2. Background and Methodology
Dimensional circuit synthesis takes as input descriptions of the
units of measure of the sensor signals in a system. Figure 2
shows an example Newton description for an unpowered UAV
(i.e., a glider). Let a physical system for which we want to
construct an efficient predictive model have k symbols corre-
sponding to physical constants or sensor signals. From the
Buckingham Π-theorem (3), we can form N ≤ k dimension-
less products, Π1 . . .ΠN and these dimensionless products are
the roots of some function Φ, where
Φ(Π1,Π2, . . . ,Πi, . . . ,ΠN ) = 0. [1]
Wang et al. (5) use Equation 1 as the basis for generating a
preprocessing step of offline training and inference of mod-
els of physical systems and propose an automated framework
for generating these dimensionless products. In a subsequent
calibration step, they learn a model for the function Φ and
demonstrate that learning Φ from the Π1 . . .ΠN can be both
significantly more efficient and more accurate than learning a
function from the original k sensor signals directly. The dimen-
sionless products Π1 . . .ΠN are essential to both training and
inference and to achieving the orders-of-magnitude speedup.
In this work, we present a method for generating hardware
to efficiently compute these Πs. Doing so close to the sensor
transducer also reduces the data sensing systems must transmit
from the sensor transducer to either a sensor hub, microcon-
troller, or other component performing on-device training and
inference, potentially improving system efficiency and perfor-
mance. Figure 3 shows how the generated hardware for Π
computation fits within an on-device inference system.
A. Dimensional Circuit Synthesis. Figure 3 shows how
hardware blocks generated by dimensional circuit synthesis cal-
culate the values of the Π products. The input of these modules
are the sensor signals corresponding to the physical parameters
specified as the input to the dimensional circuit synthesis anal-
ysis in the Newton specification language (see, e.g., Figure 2).
The calculated Π product values correspond to the output of
the pre-processing step of the inference function and they feed
into any existing method for classification or regression. This
final step could be a programmable low-power core such as
the 32-bit RISC cores now integrated into some state-of-the-art
sensor integrated circuits or a low-power machine learning
accelerator such as Marlann (7), implemented in either RTL or
in a miniature FPGA like we use in our evaluation in Section 3.
Tsoutsouras et al.
Physical system 
description
Newton
Compiler
Target physical 
parameter
Offline dimensional 
functions calibration 
(training)
Dimensio
nal 
function
s Φ 
(uncalibr
ated)
CPU core 
(e.g., RISC-V)
Π calculation RTL modules
Π RTL 
modules
(FPGA)
Sensor 
data
➊ User
Description
Dimensional 
analysis
➋ Inference engine 
calibration and synthesis
➌ Run-time in-sensor 
inference at edge
➍
Compilation
Dimensional 
circuit 
synthesis 
backend
(this work)
In-sensor inference hardware
Fig. 4. Proposed dimensional circuit synthesis framework. In Step Ê, the users provide the specifications and target inference parameter of the examined physical system.
Newton compiler including our implemented backend is executed in Step Ë. In Step Ì the framework translates the generated RTL modules to an FPGA bitstream and
in parallel we execute a manual calibration of dimensional functions (5) (box with dashed border). In Step Í the generated SW and HW modules are downloaded to the
in-sensor inference engine.
A.1. Approach and implementation. Figure 4 shows the four
steps which make up our implementation of dimensional circuit
synthesis. We implemented these steps as a new backend of the
Newton compiler (6), but the techniques in Figure 4 could in
principle be applied to any specification for physical systems
that contains information on units of measure.
In Step Ê, a user of dimensional circuit synthesis creates a
Newton language description such at that in Figure 2, specify-
ing the physical signals that describe the target physical system
and from which a machine learning model will eventually be
trained.
Next, in Step Ë, the user invokes the Newton compiler with
our new dimensional circuit synthesis backend activated. Be-
cause the method of constructing dimensionless groups can
result in multiple dimensionless products (Πi in Equation 1),
the user specifies which of the physical signals in the input
physical system description will be the target variable of a
machine learning model for the function Φ from Equation 1.
Our new dimensional circuit synthesis backend identifies the
group of dimensionless products where the target parameter
appears in only one of the dimensionless products. The outputs
of Step Ë are: (i) a function Φ, defined in terms of the dimen-
sionless products Πi, but whose form has not yet been fully
defined; (ii) RTL descriptions for hardware to compute the
dimensionless products Πi, including RTL descriptions of the
functional units (multipliers and dividers) that will perform the
arithmetic operations of the dimensionless product monomials.
Because floating-point operations can be expensive in both re-
sources and execution latency on energy-constrained on-device
training and inference systems, we use a signed fixed-point ap-
proximate real number representation (8) for the signals in the
dimensionless products computed in the synthesized hardware.
Each real number is represented by 32 bits, using 1 bit for the
sign, 16 bits for the decimal part and 15 bits for the fractional
part (i.e., a Q16.15 fixed-point representation). This choice
leads to fast and lightweight multiplication and division units
by sacrificing the ability to use an arbitrary precision floating
point representation. The compiler backend is fully paramet-
ric with respect to the length of the fixed point representation
as well the precision of the fractional part and can generate
hardware with arbitrary fixed-point representation sizes. This
will allow future designs to tailor the precision of the compute
modules to the requirements of the inference algorithms (9).
In Step Ì, we can train the uncalibrated dimensional func-
tion offline on values of the dimensionless groups Πi computed
offline as done in prior work (5).
Finally, in Step Í the outputs of the hardware blocks com-
puting the dimensionless products feed the models trained
offline to generate inferences. Alternatively a system could
also use the values of the dimensionless products to feed in situ
training of models implemented in a processor core, or to feed
training in situ of a hardware neural network accelerator (7).
3. Experimental Evaluation
We evaluated the hardware generated by the dimensional cir-
cuit synthesis backend using a Lattice Semiconductor iCE40
FPGA. The iCE40 is a low-power miniature FPGA in a minia-
ture wafer-scale 2.15 mm×2.50 mm WLCSP package and is
targeted at sensor interfacing tasks and at on-device machine
learning. We used the fully open-source FPGA design flow,
comprising the YoSys (10) synthesis tool (version 0.8+456) for
synthesis and NextPNR (10) (version git sha1 5344bc3) for
placing, routing, and timing analysis.
We performed our measurements on an iCE40 Mobile De-
velopment Kit (MDK) which includes a 1Ω sense resistor in
series with each of the supply rails of the FPGA (core, PLL,
I/O banks). We measure the current drawn by the FPGA core
by measuring the voltage drop across the FPGA core supply
rail (1.2 V) resistor using a Keithley DM7510, a laboratory-
grade 7-1⁄2 digital multimeter that can measure voltages down
to 10 nV and we thereby computed the power dissipated by
the FPGA core for each configured RTL design. We used a
pseudorandom number generator to feed the Π computation
circuit modules under evaluation with random input data.
A. Results. We evaluated dimensional circuit synthesis on
seven different physical systems described in the Newton spec-
ification language. Table 1 provides a brief description of the
inputs as well as a summary of the measurement results. The
Tsoutsouras et al.
Table 1. Experimental evaluation on iCE40 FPGA of dimensional circuit modules generated from the description of 7 physical
systems.
Name Description Target LUT4 Gate Maximum Execution Avg. Power Avg. Power
Parameter Cells Count Frequency Latency at 12 MHz at 6 MHz
Beam
Cantilevered beam model,
excluding mass of beam
Beam deflection 2958 2590 16.88 Mhz 115 cycles 3.5 mW 1.8 mW
Pendulum, static
Simple pendulum exclud-
ing dynamics and friction
Osc. period 1402 1239 17.07 Mhz 115 cycles 2.0 mW 1.1 mW
Fluid in Pipe
Pressure drop of a fluid
through a pipe
Fluid velocity 4258 3752 15.65 Mhz 188 cycles 5.8 mW 3.0 mW
Unpowered flight
Unpowered flight (e.g., cat-
apulted drone)
Position (height) 1930 1865 16.44 Mhz 81 cycles 2.3 mW 1.2 mW
Vibrating string Vibrating string Osc. frequency 2183 1787 16.67 Mhz 183 cycles 2.5 mW 1.3 mW
Warm vibrating string
Vibrating string with tem-
perature dependence
Osc. frequency 3137 2718 16.77 Mhz 269 cycles 1.9 mW 1.0 mW
Spring-mass system
Vertical spring with at-
tached mass
Spring constant 1419 1240 16.67 Mhz 115 cycles 3.4 mW 1.8 mW
table also includes the target parameter for each respective
execution of the Newton compiler. For example, for the phys-
ical description of the pendulum the target parameter was its
oscillation period, while for the physical descriptions of a fluid
in pipe the target parameter was the velocity of the fluid. This
value of this parameter is inferred at run-time by the machine
learning model that is fed with the output of the Π computation.
The results in Table 1 show the FPGA resource utilization as
well as resource utilization when mapped to CMOS gates, of
each generated Π computation module, including the fixed
point arithmetic modules that implement the required arith-
metic operations.
The execution latency column lists the cycles required for
completing the calculation of each of the generated RTL mod-
ules. We obtained the number of cycles by simulating the
execution of the RTL modules for pseudorandom inputs gen-
erated by an LFSR. In each RTL module, the calculation of
different Π products is parallelized but the required operations
per Π product are executed serially. As a result, designs with
larger resource usage in the table, such as the hardware for the
unpowered flight model, conclude faster than smaller designs
such as the static pendulum. All modules require less than 300
cycles. As a result, for both 6 and 12 Mhz clocks, the generated
hardware can handle sample rates of over 10k samples/second,
permitting real-time operation.
The last column of Table 1 shows the measured power
dissipation of each design running in the iCE40 FPGA. In all
cases, the power dissipation is less than 6 mW and as low as
1 mW, demonstrating the suitability of our method for small-
factor, battery-operated on-device inference at the edge.
4. Insights and Conclusions
Dimensional circuit synthesis is a new method for generating
digital logic circuits that improve the efficiency of training
and inference of machine learning models from sensor data.
The method complements prior work on dimensional function
synthesis, a new method for learning models from sensor data
that enables orders of magnitude improvements in training and
inference on physics-constrained signal data. Dimensional
circuit synthesis, which we present in this paper, implements
preprocessing steps required by dimensional function synthe-
sis in hardware. This article presented the principle behind
the methods and the design and implementation of a compiler
backend that implements dimensional circuit synthesis for the
iCE40, a low-power miniature FPGA in a miniature wafer-
scale 2.15 mm×2.50 mm WLCSP package which is targeted
at sensor interfacing tasks and at on-device machine learn-
ing. The hardware accelerators that the method generates are
compact (fewer than four thousand gates for all the examples
investigated) and low power (dissipating less than 6 mW even
on a non-optimum FPGA). These results show, for the first
time, that it could be feasible to integrate physics-inspired ma-
chine learning methods within low-cost miniaturized sensor
integrated circuits, right next to the sensor transducer.
1. Feynman R (1967) The character of physical law. 1965. Cox and Wyman Ltd., London.
2. Feynman RP, Leighton RB, Sands M (1965) The feynman lectures on physics; vol. i. Ameri-
can Journal of Physics 33(9):750–752.
3. Buckingham E (1914) On physically similar systems; Illustrations of the use of dimensional
equations. Physical Review 4(4):345–376.
4. Mahajan S (2010) Street-fighting mathematics: the art of educated guessing and opportunis-
tic problem solving. (MIT Press).
5. Wang Y, Willis S, Tsoutsouras V, Stanley-Marbell P (2019) Deriving Equations from Sensor
Data Using Dimensional Function Synthesis. ACM Trans. Embed. Comput. Syst. 18(5s).
6. Lim J, Stanley-Marbell P (2018) Newton: A language for describing physics. arXiv preprint
arXiv:1811.04626.
7. Symbiotic EDA (2019) MARLANN—A simple FPGA Machine Learning Accelerator.
https://github.com/SymbioticEDA/MARLANN.
8. Behrooz P (2000) Computer arithmetic: Algorithms and hardware designs. Oxford University
Press 19:512583–512585.
9. Micikevicius P, et al. (2017) Mixed precision training. arXiv preprint arXiv:1710.03740.
10. Shah D, et al. (2019) Yosys+ nextpnr: an open source framework from verilog to bit-
stream for commercial fpgas in 2019 IEEE 27th Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM). (IEEE), pp. 1–4.
Tsoutsouras et al.
