LogicNets: Co-Designed Neural Networks and Circuits for
  Extreme-Throughput Applications by Umuroglu, Yaman et al.
LogicNets: Co-Designed Neural Networks and
Circuits for Extreme-Throughput Applications
Yaman Umuroglu*, Yash Akhauri*, Nicholas J. Fraser and Michaela Blott
Xilinx Research Labs
Dublin, Ireland
Abstract—Deployment of deep neural networks for applica-
tions that require very high throughput or extremely low latency
is a severe computational challenge, further exacerbated by
inefficiencies in mapping the computation to hardware. We
present a novel method for designing neural network topologies
that directly map to a highly efficient FPGA implementation. By
exploiting the equivalence of artificial neurons with quantized
inputs/outputs and truth tables, we can train quantized neural
networks that can be directly converted to a netlist of truth tables,
and subsequently deployed as a highly pipelinable, massively
parallel FPGA circuit. However, the neural network topology
requires careful consideration since the hardware cost of truth
tables grows exponentially with neuron fan-in. To obtain smaller
networks where the whole netlist can be placed-and-routed
onto a single FPGA, we derive a fan-in driven hardware cost
model to guide topology design, and combine high sparsity with
low-bit activation quantization to limit the neuron fan-in. We
evaluate our approach on two tasks with very high intrinsic
throughput requirements in high-energy physics and network
intrusion detection. We show that the combination of sparsity and
low-bit activation quantization results in high-speed circuits with
small logic depth and low LUT cost, demonstrating competitive
accuracy with less than 15 ns of inference latency and throughput
in the hundreds of millions of inferences per second.
I. INTRODUCTION
Deep Neural Networks (DNNs) have a wide application
scope beyond the highly-popular computer vision use cases,
promising to replace manual algorithmic implementations
in many application domains. Certain applications, such as
data collection from particle physics experiments [1], line-
rate filtering of packets for network intrusion detection [2]
and wireless communications [3], have stringent real-time
requirements and extremely high data-rates. Replacing parts
of such extreme-throughput applications with machine-learned
components requires highly specialized DNN implementations
that offer inference rates reaching hundreds of millions of
samples per second with sub-microsecond latency. This is the
key challenge we try to address in this work: how can we build
DNN implementations that are able to meet the performance
and latency constraints for extreme-throughput applications?
Popular hardware platform options for accelerating DNN
inference include GPGPUs, FPGA overlays and specialized
tensor processors [4]. Most of these alternatives apply the
traditional paradigm in computer design where hardware and
software are designed separately, bridged by a compiler that
generates instructions to schedule the required computation
*Equal contribution
onto available hardware. However, this flexibility typically
comes at the cost of performance overheads, making it difficult
to apply this approach to extreme-throughput applications.
Prior work [1], [5]–[7] demonstrated that specialized co-design
approaches are able to produce FPGA DNN implementations
that yield increased throughput, while still offering the ability
to reconfigure to address changing requirements.
In this paper, we present a novel method named LogicNets
for co-designing DNN topologies that map directly to an
efficient FPGA implementation for extreme-throughput appli-
cations. Our scheme is based on the observation that artificial
neurons with quantized inputs and outputs can be converted to
truth tables. However, an efficient FPGA implementation for a
truth table is generally only possible when the number of inputs
is small. By limiting neuron fan-in using activation quantization
and sparsity, we show how we can design DNN topologies
that are still trainable using standard backpropagation, and
can be mapped directly to an equivalent hardware circuit with
small combinatorial depth. DNNs designed and trained in this
manner result in fast and efficient FPGA implementations that
can fulfill the performance requirements for extreme-throughput
applications. Extending on our abstract in [8], this paper makes
the following contributions:
• We describe LogicNets, a DNN-hardware co-design
methodology that allows trained quantized networks to
be directly converted to an equivalent hardware netlist of
truth tables.
• We exploit sparsity and quantization to reduce the neuron
fan-in, and provide an analytical cost model to quickly
estimate the required FPGA resources to guide topology
design.
• We develop a PyTorch library to train sparse, quantized
topologies and convert them to Verilog netlists.
• We empirically evaluate our approach on two extreme-
throughput tasks, demonstrating FPGA implementations
with competitive accuracy and throughput in the hundreds
of millions of inferences per second.
II. BACKGROUND
In this section, we briefly describe prior work relating to
DNN inference acceleration on FPGAs, and schemes previously
used to construct quantized and sparsely-connected DNNs.
ar
X
iv
:2
00
4.
03
02
1v
1 
 [e
es
s.S
P]
  6
 A
pr
 20
20
(a) LUTNet-style [2], [7] with weights in LUT equations. An explicit
accumulation and activation datapath is still present.
(b) LogicNets and NullaNet [14] with all operations packed into LUTs.
Fig. 1. Two architectural alternatives for specialized FPGA inference.
A. Sparse and Quantized Neural Networks
In a sparse neural network, each layer of neurons receives
inputs from only a few connections of the previous layer of
activations, in contrast to dense networks where all previous
layer activations are inputs to each neuron of the next layer.
Numerous techniques to build sparse DNN topologies exist,
including learned sparsity [9], pruning techniques [10] and a
priori fixed sparsity [11] which we utilize in this work due to
its relative simplicity. Quantization involves restricting weights,
activations or both to a set of discrete values. To preserve
DNN accuracy with low-bit quantization (e.g. ≤ 4−bits) it is
typically necessary to use specialized techniques during training,
such as using the Straight-Through Estimator (STE) [12] to
propagate gradients through non-differentiable quantizers and
learned scale factors to reduce approximation error. We refer
the reader to the survey by Guo et al. [13] for further details.
B. Prior Work on FPGA DNN Inference
DNN inference typically consists of applying a sequence
of multiply-accumulates (MACs), followed by a nonlinear
activation function. Many alternatives exist when mapping
these computations to the FPGA fabric and a large body of
prior work exists on FPGA DNN inference; here, we only cover
the works closely related to ours and refer the reader to a recent
survey by Zhao et al. [15] for further reading. We organize our
discussion of prior work according to the presence or absence
of explicit weight storage, MAC and activation datapaths in
the proposed architecture, as illustrated in Figure 1.
Weights in LUT equations. Figure 1a shows architectures
with network weights “baked in” into LUT equations, but
parts of the MAC and activation datapaths still exist. This
enables greater performance with fully unrolled (non-time-
multiplexed) implementations without a control path, but is less
flexible. Wang et al. [7] introduce LUTNet, a LUT-optimized
FPGA inference scheme which achieves high LUT density.
[7] take a pruned version of ReBNet [16] in which some of
the XNOR-popcount operations are mapped more effectively
to k-input LUTs, although explicit popcount and thresholding
datapaths are still present and occupy significant resources.
Murovic et al. [2] implement binarized networks which have
been fully unrolled and implemented directly into LUTs of
a small FPGA, although explicit accumulation and activation
datapaths are still present. This unrolling allow the synthesis
optimization tool to potentially simplify significant portions
Fig. 2. A 6:1 NEQ with batch normalization that maps to a 6:1 LUT.
of the compute logic. Duarte et al. [1] present a package
called hls4ml which generates high-level synthesis-based FPGA
designs which supports two axes of folding, but can also
generate full-unrolled designs. [1] operates at higher bit-widths
(8 or 16-bit) and maps significant portions of the compute to
the DSP blocks.
No explicit datapath. Figure 1b shows the architecture
proposed in this work, where all operations and weights are
packed into a truth table and no explicit MAC or activation
datapath is present. Nazemi et al. [14] introduce NullaNet,
which proposes converting activation-quantized neural networks
into large truth tables in a similar fashion to this work. Their
stated goal is reducing the number of memory accesses, whereas
LogicNets aims to co-design DNNs that can yield FPGA
circuits that offer extremely high throughput and low latency.
Key differences between NullaNet and this work are as follows:
• NullaNet only considers densely connected networks and
suffers from high fan-in, we use sparse topologies to avoid
this problem (Section III-C).
• NullaNet uses a lossy truth table sampling method to
overcome the fan-in problem which gives an approxi-
mation of the DNN, whereas we use a lossless method
(Section III-A).
• NullaNet only considers binary quantization of activations,
we consider several low-precision variants, which is key
to achieving higher accuracy (Section V-B).
III. BUILDING DNNS FROM SMALL TRUTH TABLES
In this section, we explain the core concepts that form
the foundations for LogicNets. As the performance demands
for extreme-throughput applications are so high, the available
hardware must be utilized to its full potential. We aim to
answer the following question: given that the building blocks
of FPGA hardware are small truth tables that can implement
any function, how can we design DNN topologies that map
well to these building blocks?
A. Equivalence of Neurons and Truth Tables
The foundation of LogicNets is the equivalence of artifical
neurons with quantized inputs/outputs and truth tables. Consider
an artifical neuron with Cin different inputs, where each input
is β bits wide, and let the neuron produce a single β-bit
output. Let X be the total number of input bits or fan-in, i.e.
X =
∑Cin
i=0 β and Y = β be the total number of output bits,
Regardless of the internal complexity of the neuron, we can
always implement its functionality with an X-input, Y -output
(denoted X : Y ) truth table by enumerating all of its possible
Da
tas
et
6 x
 2-
bit
 co
nn
ec
tio
ns
pe
r n
eu
ro
n
6 x
 2-
bit
 co
nn
ec
tio
ns
pe
r n
eu
ro
n
64 x
12:2 NEQ
6 x
 2-
bit
 co
nn
ec
tio
ns
pe
r n
eu
ro
n
... 6 
x
2-
bi
t
in
pu
ts
1 
x 
2-
bi
t
ou
tp
ut∑
floating point
weights
64x170
=10.8k 6:1 LUTs
12:2 NEQ
12:2 NEQ
12:2 NEQ
1x12:2 LUT NEQ = 170x6:1 LUTs
32 x
12:2 NEQ
32 x
12:2 NEQ
Layer 0 Layer 1 Layer 2
32x170
=5.4k 6:1 LUTs
32x170
=5.4k 6:1 LUTs
Fig. 3. Example sparse network and 6:1 LUT cost calculation.
2X different inputs, observing the output, and recording it into
the truth table.
In LogicNets, we refer to the Verilog implementation of an
X : Y truth table as a Hardware Building Block (HBB), and
any trained artifical neuron that can be converted into an HBB
as a Neuron Equivalent (NEQ). Since the internal complexity
of NEQs does not matter as long as the number of input-
output bits are fixed, we can add components that makes the
DNN training process easier into the NEQ. Figure 2 illustrates
this for a 6:1 NEQ with floating point weights and batch
normalization, with six FP32 values for the weights and four
for the batch normalization parameters, requiring 10 ·32 = 320
bits of storage. The equivalent HBB only requires 64 bits to
store the truth table, which is 5× smaller.
B. The FPGA LUT Cost of Large Truth Tables
To implement NEQs that have either more inputs or more
bits per input, we need HBBs with larger X values, which can
be implemented by combinations of smaller FPGA LUTs. For
instance, the output of two 6:1 LUTs can be fed into a third 6:1
LUT to act as a multiplexer, implementing a 7:1 LUTs using
three 6:1 LUTs. For higher output bitwidth Y the cost scales
linearly, if each additional output bit is produced by adding
a copy of the table. Generalizing this, we can analytically
estimate the number of 6:1 FPGA LUTs required to implement
an X : Y HBB using Equation 1.
LUTCost(X,Y) =
Y
3
· (2X−4 − (−1)X) (1)
However, scaling up X in this manner is expensive: the number
of LUTs needed grows exponentially1 with X . For instance,
implementing a single 32 : 1 NEQ would require close to a
hundred million 6:1 LUTs, which is much larger than even the
largest FPGAs available today. Thus, it is critical to keep the
fan-in X small enough so that each NEQ in the topology has
a reasonably small LUT cost.
C. Designing Topologies with Restricted Fan-In
Modern DNN topologies do not necessarily restrict neuron
fan-in, which makes them impractical for direct mapping to
HBBs. In LogicNets, we propose to co-design DNN topologies
1Although special FPGA capabilities such as F7MUX/F8MUX and heuristic
logic minimization decrease the LUT cost, the exponential trend remains.
with awareness of the FPGA hardware implementation cost by
restricting the fan-in of each artifical neuron. Recalling that
the fan-in is computed as X =
∑Cin
i=0 β, we observe that we
have two key topological parameters to control it:
• Number of inputs Cin: A DNN may have hundreds
of neurons per layer, and connecting a neuron’s inputs
to every other neuron in the previous layer will result
in an intractibly large fan-in. We take inspiration from
the developments in sparse topologies (Section II-A),
connecting each neuron to γ previous neurons, with a
γ much smaller than the previous layer size.
• Bitwidth of inputs β: The bitwidth of the activations
from the previous layer also has a large impact on fan-in
X . We apply training-time techniques developed in prior
work (Section II-A) to quantize the activations to ≤ 4-bits,
which reduces the fan-in substantially.
With these fan-in restrictions in place, we can explore
topologies with different number of neurons and layers while
avoiding the exponential growth in LUT cost, using the
cost model to guide the exploration. To compute the total
FPGA resource cost for a network, we can simply sum
the resources taken up by each NEQ in the topology. We
provide an example for the network illustrated in Figure 3.
Here, each neuron output is quantized to β = 2 bits, and
is connected to γ = 6 outputs from the previous layer as
its input. Thus, X = 6 · 2 = 12 and Y = 2. According to
Equation 1, the cost for a 12-bit-input, 2-bit-output NEQ is
LUTCost(12, 2) = 23 · (212−4 − (−1)12) = 170 6:1 LUTs.
Since there are 128 such NEQs in total, the estimated cost for
this network is 128 · 170 = 21760 6:1 LUTs.
IV. THE LOGICNETS DESIGN FLOW AND IMPLEMENTATION
Having explored the foundational ideas in Section III, we
now present LogicNets as a three-step design flow to train
DNNs that map directly and efficiently to FPGAs:
1) Define Hardware Building Blocks (HBB) and Neuron
Equivalents (NEQ)
2) Define and train a DNN of NEQs in PyTorch then convert
to netlist of HBBs
3) Postprocess the netlist, synthesize to obtain a bitfile
Figure 4 illustrates the steps in the LogicNets design flow.
We have implemented a prototype of this design flow with
a PyTorch library in order to enable faster design space
exploration. In the following sections, we describe each step
in greater detail.
A. Define HBBs and NEQs
As described in Section III-A, NEQs and HBBs constitute
the building blocks of LogicNets on the PyTorch and Verilog
side, respectively. The first step in the design flow is to identify
the range of X : Y values that yield HBBs with reasonable
LUT cost, and to define corresponding NEQs in PyTorch that
can be trained with the sparsity and activation quantization
restrictions in place. These can be NEQs that map to single 6:1
or 5:2 FPGA LUTs, or a generic X:Y truth table that will be
6:1 LUT 5:2 LUT6:1 NEQ
∑
FP32 
weights
Y-bit 
output
5:2 NEQ
X:Y NEQ
synthesized 
from 6:1 or 5:2
LUTs by Vivado
X:Y LUT
to
ta
l X
 b
its
 
of
 in
pu
t
Batch
Norm
Quant
ReLU
NEQtoHBB(..)
in1   in2   ...   inX out1   out2   ...  outX
0
0
1
0
1
1
0
0
1
0
1
1
1
0
1
0
1
1
...
...
...
...
...
...
Neuron Equivalents (NEQs)
in PyTorch
Hardware Building Blocks (HBBs)
in Verilog
(a) Define HBBs and NEQs
6:1 LUT
6:1 LUT 5:2 LUT
NEQtoHBB(..)
Network of trained NEQs
(in PyTorch)
6:1 NEQ
6:1 NEQ
5:2 NEQ
X:Y NEQ
X:Y LUT
Netlist of configured HBBs
(in Verilog)
(b) Train and convert a network of NEQs
6:1 LUT
6:1 LUT 5:2 LUT
X:Y LUT
Synthesis, P&R
Netlist of HBBs + registers Placed and routed circuitwith optimized netlist
(c) Postprocessing and deployment
Fig. 4. Three steps in the LogicNets design flow.
implemented by the synthesis tool. In this work, we add batch
normalization followed by uniform quantization with learned
scale factors using Brevitas [17] for better accuracy and easier
training. The first step needs to be performed only once per
FPGA device family, since the identified building blocks can
be used to construct multiple different topologies.
B. Train and convert a network of NEQs
The second step is followed for each new DNN that
needs to be trained and deployed, and takes place in the
PyTorch machine learning framework. Using the available
NEQs identified in Step 1, a deep neural network topology
is constructed by instantiating NEQs and connecting them
together. We use the approach described in [11] to provide fixed
random sparsity with the desired per-neuron fan-in. To guide
topology design, our library implements the analytical model
from Equation 1 to estimate the required FPGA resources prior
to training. Once the topology is defined, the DNN is trained in
PyTorch using standard DNN optimizers and backpropagation.
Methods applied to improve standard DNN training such as
knowledge distillation and ensembling can also be applied here.
Finally, the trained network is converted into a Verilog netlist
of HBB instances and their (sparse) connections. To convert
NEQs into HBBs, we follow the enumeration-based procedure
in Section III-A to evaluate each input combination and add an
entry into the HBB truth table. The truth tables are expressed as
read-only memories (ROMs) with a case statement returning
the evaluated constant for each input combination, and we
leave it to the synthesis tool to map the ROMs to FPGA LUTs.
C. Postprocessing, Synthesis and Deployment
Any optimization admitted by a netlist can be applied as the
postprocessing step. For instance, a heuristic logic minimizer
can be applied to the network to use fewer LUTs, pipeline
registers can be inserted between the layers to increase the clock
frequency, or the netlist can be split up into chunks for mapping
to a smaller FPGA with dynamic partial reconfiguration, one
chunk at a time. In this paper, we focus on single-FPGA
implementations for extreme-throughput applications and only
consider register insertion between layers for postprocessing,
as shown in Figure 4c. After any preprocessing is complete,
the final netlist is processed with synthesis, place-and-route
algorithms to yield an FPGA bitfile. As our results in Sec-
tion V-C indicate, synthesis-time optimizations such as heuristic
logic minimization can yield significant hardware cost savings,
effectively pruning the network further to make it more sparse.
V. EVALUATION
In order to evaluate LogicNets we picked the following tasks
with extreme-throughput requirements from two domains:
• Jet Substructure Classification (JSC): Large-scale physics
experiments such as those in CERN produce terabytes
of instrumentation data every second, which is processed
by a hierarchy of triggers to filter out the interesting
results. Recent work by Duarte et al. [1] successfully
applied DNNs as to the Jet Substructure Classification
(JSC) task. This task targets the FPGA-based triggers of
the CERN ATLAS and CMS experiments, which must
be pipelined to handle a data rate of 40 MHz and limit
response latency to less than a microsecond. We use the
formulation from Duarte et al. [1] for JSC as a 16-input,
5-output classification task, and refer the reader to their
work for a more detailed explanation of the task.
• Network Intrusion Detection (NID): FPGAs are commonly
used for implementing high-performance packet process-
ing systems that still provide a degree of programmability
[18]. An advantage of such systems is their ability to
facilitate stronger network security by detecting malicious
or suspicious network packets, which may be implemented
using DNNs [19]. To avoid introducing bottlenecks on
the network, the DNN implementation must be capable of
detecting malicious ones at line rate, which can be millions
of packets per second, and is expected to increase further
as next-generation networking solutions provide increased
throughput. To assess LogicNets on this domain we use
the UNSWNB15 dataset [20] which provides example
packets labeled as bad (0) or normal (1) with 49 generated
input features extracted from simulated modern intrusion
attacks. We follow the approach by Murovic et al. [2] in
terms of dataset preprocessing and feature conversion.
We trained a number of sparsely-connected, activation-
quantized MLPs on the chosen tasks using the LogicNets
TABLE I
HIGHLIGHTS FROM LOGICNETS RESULTS ON THE CHOSEN TASKS.
Name Neurons per Layer β γ Accuracy Model LUT Synth LUT FF Reported Fmax Remarks
JSC-S 64, 32, 32, 32 2 3 67.8% 330 214 244 1,585 MHz
JSC-M 64, 32, 32, 32 3 4 70.6% 42,075 14,428 440 599 MHz
JSC-L 32, 64, 192, 192, 16 3 4 71.8% 303,285 37,931 810 427 MHz βi = 4, βo = 7, γo = 5
NID-S 593, 100 2 7 83.88% 473,308 3,586 1,320 811 MHz
NID-M 593, 256, 128, 128 2 7 91.30% 754,292 15,949 1,274 471 MHz
NID-L 593, 100, 100, 100 3 5 88.68% 1,021,175 25,050 1,421 418 MHz βi = 2, γi = 7
PyTorch library. All layers use the same γ and β, except when
γi, γo, βi and βo are used to specify the first and last layers’
fan-in and bitwidth, respectively. Based on the feedback from
our cost model, we limit X = β · γ ≤ 15 in our exploration
to focus on LogicNets implementable on a single FPGA. All
networks presented here are trained for 1000 epochs, with
a mini-batch size of 1024, using the ADAM optimizer and
a step decay learning rate schedule starting from 0.1. Once
the training is complete, we generate Verilog from the trained
network as described in Section IV-B. We use Xilinx Vivado
2018.3 in out-of-context mode with the default settings for
synthesis and Flow_PerfOptimized_high for place and
route without any manual placement constraints, targeting the
xcvu9p-flgb2104-2-i FPGA part. We insert registers
at the network input, output and between every layer, and
constrain the clock to 1 ns to achieve the highest possible
frequency. The correctness of the generated hardware is verified
by performing post-synthesis simulation and ensuring the same
results as the original PyTorch network are returned.
A. Overview of Key Results
Table I presents several networks obtained using LogicNets
on the JSC and NID tasks, picked to illustrate interesting points
from our partial design space exploration. The table names each
datapoint by indicating which task the network was trained
on, specifies the network topology and quantization, the test
accuracy, and FPGA resources from both the analytical model
and post-synthesis results. We make the following observations:
Scalable resource footprint. LogicNets-style models expose
many topological knobs to control the size of the network,
which translates into FPGA resource savings at the cost of
some accuracy. This is reflected in the variety of neurons per
layer, β and γ in Table I. For instance, JSC-S is able to achieve
close to 68% accuracy using only 214 LUTs, while JSC-L
offers 4% points better accuracy at the cost of 37.9k LUTs.
Simple, high-frequency circuits. LogicNets yield simple
circuits by design, with as little as a single level of LUTs
between registers when neuron fan-in is constrained to be
β · γ ≤ 6. This is the case for JSC-S, which has a reported
Fmax of 1.5 GHz. Although this frequency cannot be achieved
in practice due to limitations on the global clock network of
the FPGA, the positive slack yielded by the the simplicity of
LogicNets-style netlists is still quite significant and would
facilitate timing closure when integrating LogicNets-style
DNNs into other high-speed designs. Larger LogicNets-designs
102 103 104 105 106 107
84
86
88
90
Model LUTs
A
cc
ur
ac
y
%
β = 1
β = 2
β = 3
Fig. 5. Accuracy impact of β with model LUTs.
such as NID-L with β · γ = 15 require more levels of logic
and resources and are more challenging to place and route, but
are still capable of achieving high clock rates over 400 MHz.
Competitive accuracy with sparsity and quantization. We are
able to obtain datapoints which offer accuracy over 70% for the
JSC task and over 90% for the NID task, which is competitive
with the accuracy reported by related work (Section V-E). We
expect accuracy to further increase by using better sparsity
methods compared to fixed random sparsity, such as magnitude
pruning and sparse momentum, but this investigation is left
for future work.
B. Impact of Activation Bitwidth
To better understand the impact of activation bitwidth (β)
on accuracy, we train a variety of topologies with different
number of neurons, fan-ins (γ) and activation bitwidths (β),
plotting the accuracy on the JSC task against the analytical LUT
cost in Figure 5, grouping the datapoints by β. We observe a
general increase in accuracy for the larger models, and some
interesting behavior regarding β. Although binary activations
generally have lower LUT cost, we observe that higher β can
yield higher accuracy at similar LUT cost. For instance, the
lowest-cost β = 2 datapoint offers 88.7% accuracy with 2120
LUTs, whereas the highest-accuracy binary result provides
85.3% with 7392 LUTs. A similar trend is observed for β = 3,
although the accuracy improvement over β = 2 is smaller.
Figure 6 presents this from another perspective, where
LogicNets with the same number of neurons but different β
and γ are clustered in columns. Training the same number of
neurons with greater activation bitwidth yields greater accuracy.
Increasing the number of neurons improves the results slightly
for β = 1 but brings little to no benefit for β > 1. In general,
it is difficult to estimate the effect of DNN hyperparameters
0 500 1,000 1,500 2,000
84
86
88
90
Total Neurons
A
cc
ur
ac
y
%
β = 1
β = 2
β = 3
Fig. 6. Accuracy impact of β with number of neurons.
Fig. 7. Post-synthesis resource breakdown for NID-S.
on accuracy, but our experiments indicate that the activation
bitwidth β is vital to achieving good accuracy for LogicNets.
C. Lossless Pruning via Synthesis Optimizations
There are notable differences in the LUT counts yielded by
the analytical model and post-synthesis results in Table I, where
the post-synthesis LUT counts are on average 39× smaller than
the LUT cost predicted by the analytical model. To understand
why, we take a closer look at the post-synthesis utilization data
for NID-S as reported by Vivado. Figure 7 illustrates the pre-
and post-synthesis resource usage per layer, and the registers
between layers. The extent of pruning is evident in both LUT
and FF counts, indicating that many neurons are removed
entirely, while others are implemented with much smaller cost.
The drastic reduction in resources can be attributed to two
effects. Firstly, many NEQs have no path to output owing to a
combination of fixed random sparsity, the chosen γ and neuron
counts. This is reflected in the post-synthesis results, which
removes many unconnected neurons and registers due to the
over-provisioned topology, resulting in fewer LUTs than what
is predicted by our basic analytical model due to effective
reduction of X and number of neurons. The JSC models
are less over-provisioned and exhibit smaller post-synthesis
savings of 1.5–8×. The second effect is the actual cost of
single HBBs, which is 682 LUTs for the 14:2 HBB according
to the analytical model. Examining the per-neuron resource
costs, we observe LUT costs ranging from 10–262, as well as
instances F7MUX and F8MUX primitives. This indicates that
significant savings of 2.6–68× can be achieved by applying
heuristic logic minimization on the learned HBB functions. We
leave a more in-depth study of LogicNets logic minimization
for future work.
D. Limitations
Based on our evaluation, we identify two key limitations of
our current approach. The first is the inability to map dense
TABLE II
COMPARING LOGICNETS TO RELATED WORK WITH II=1.
Work Accuracy Fclk Latency Resources
JSC-L 71.8% 384 MHz 13 ns 37.9k LUT
[1] 75% 200 MHz 50 ns 88k LUT, 1k DSP
NID-M 91.30% 471 MHz 10.5 ns 15.9k LUT
[2] 90.1% 51 MHz 19.6 ns 51k LUT
networks with high fan-in neurons commonly used in machine
learning, instead requiring custom topologies that apply fan-in
restrictions via sparsity and activation quantization. The second
is the large design space with combinations of layer and neuron
counts, activation bitwidths and fan-ins that must be explored
to find suitable networks of sufficiently high accuracy and low
resource cost. Although these may be acceptable limitations
for extreme-throughput applications that commonly require
specialized design effort, we hope to address these limitations
in future work to broaden the scope of LogicNets.
E. Comparison to Related Work
We compare LogicNets to two prior works on extreme-
throughput, fully-unfolded (II=1) FPGA implementations on
these domains, according to the metrics presented in Table II.
For the JSC task, we take the work by Duarte et al. [1] as
our baseline, which reports an accuracy of 75% and presents
a fully-unrolled FPGA implementation of a pruned and 16-
bit quantized and 30% sparse neural network. To focus on
the implementation of the core neural network part itself, we
exclude the 5-cycle softmax execution from the latency. Our
JSC-L implementation is able to offer 1.9× higher throughput
and 3.8× lower latency at the cost of 3.2% points lower
accuracy. Since JSC-L uses less than half of the LUTs
compared to [1] and no DSP slices, the accuracy could be
potentially increased further by using a larger LogicNets-style
network and more resources. For the NID task, we compare
against Murovic et al. [2], who implement a dense binarized
neural network achieving 90.1% accuracy as a fully-unfolded,
combinatorial circuit without any registers. The LogicNets
NID-M implementation outperforms their solution in terms of
all metrics, offering 1.2% points higher accuracy, 9.2× higher
throughput and 1.9× lower latency using 3.2× fewer LUTs.
VI. CONCLUSION AND FUTURE WORK
In this work, we have investigated how DNN topologies
that map well to FPGA building blocks can be constructed
to obtain efficient implementations for extreme-throughput
applications. Noting that quantized neurons with limited fan-in
can be converted into small truth tables, we have proposed
a flow to design sparse, quantized topologies that map to
highly efficient FPGA implementations. On two tasks with
extreme-throughput requirements, we were able to demonstrate
implementations with competitive accuracy, low latency and
throughput in the hundreds of millions of inferences per second.
LogicNets opens up a wide array of possible future work
owing to its cross-stack nature which spans machine learning,
compilation and FPGA synthesis. On the machine learning
side, we plan to explore training-time methods to increase
the accuracy of sparse and quantized topologies, exploring
the possibilities for sparse convolutions, as well as mixing
LogicNets-style layers with more conventional ones in the
same topology in order to increase the accuracy and apply it to
more difficult problems. On the tooling and hardware synthesis
side, we hope to further study the benefits of heuristic logic
minimization to design more accurate analytical cost models,
as well as techniques to synthesize larger LogicNets models
quickly and efficiently.
REFERENCES
[1] J. Duarte, S. Han, P. Harris, S. Jindariani, E. Kreinar, B. Kreis,
J. Ngadiuba, M. Pierini, R. Rivera, N. Tran et al., “Fast inference
of deep neural networks in FPGAs for particle physics,” Journal of
Instrumentation, vol. 13, no. 07, p. P07027, 2018.
[2] T. Murovicˇ and A. Trost, “Massively parallel combinational binary neural
networks for edge processing,” Elektrotehniski Vestnik, vol. 86, no. 1/2,
pp. 47–53, 2019.
[3] Y. Shi, K. Davaslioglu, Y. E. Sagduyu, W. C. Headley, M. Fowler, and
G. Green, “Deep learning for rf signal classification in unknown and
dynamic spectrum environments,” 2019 IEEE International Symposium
on Dynamic Spectrum Access Networks (DySPAN), pp. 1–10, 2019.
[4] M. Blott, L. Halder, M. Leeser, and L. Doyle, “Qutibench: Benchmarking
neural networks on heterogeneous hardware,” 2019.
[5] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre,
and K. Vissers, “FINN: A framework for fast, scalable binarized neural
network inference,” in Proceedings of the 2017 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays. ACM, 2017, pp. 65–
74.
[6] S. Tridgell, M. Kumm, M. Hardieck, D. Boland, D. Moss, P. Zipf,
and P. H. W. Leong, “Unrolling ternary neural networks,” ACM Trans.
Reconfigurable Technol. Syst., vol. 12, no. 4, Oct. 2019. [Online].
Available: https://doi.org/10.1145/3359983
[7] E. Wang, J. J. Davis, P. Y. Cheung, and G. A. Constantinides, “Lutnet:
Rethinking inference in fpga soft logic,” in 2019 IEEE 27th Annual
International Symposium on Field-Programmable Custom Computing
Machines (FCCM). IEEE, 2019, pp. 26–34.
[8] Y. Umuroglu, Y. Akhauri, N. J. Fraser, and M. Blott, “High-throughput
DNN inference with LogicNets.” IEEE, 2020.
[9] M. Wortsman, A. Farhadi, and M. Rastegari, “Discovering neural wirings,”
ArXiv, vol. abs/1906.00586, 2019.
[10] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep
neural networks with pruning, trained quantization and huffman coding,”
arXiv preprint arXiv:1510.00149, 2015.
[11] A. Prabhu, G. Varma, and A. Namboodiri, “Deep expander networks:
Efficient deep networks from graph theory,” in Proceedings of the
European Conference on Computer Vision (ECCV), 2018, pp. 20–35.
[12] Y. Bengio, N. Le´onard, and A. Courville, “Estimating or propagating
gradients through stochastic neurons for conditional computation,” arXiv
preprint arXiv:1308.3432, 2013.
[13] Y. Guo, “A survey on methods and theories of quantized neural
networks,” CoRR, vol. abs/1808.04752, 2018. [Online]. Available:
http://arxiv.org/abs/1808.04752
[14] M. Nazemi, G. Pasandi, and M. Pedram, “NullaNet: training deep
neural networks for reduced-memory-access inference,” arXiv preprint
arXiv:1807.08716, 2018.
[15] R. Zhao, S. Liu, H.-C. Ng, E. Wang, J. J. Davis, X. Niu, X. Wang,
H. Shi, G. A. Constantinides, P. Y. Cheung et al., “Hardware compilation
of deep neural networks: An overview,” in 2018 IEEE 29th International
Conference on Application-specific Systems, Architectures and Processors
(ASAP). IEEE, 2018, pp. 1–8.
[16] M. Ghasemzadeh, M. Samragh, and F. Koushanfar, “ReBNet: Residual
binarized neural network,” in 2018 IEEE 26th Annual International Sym-
posium on Field-Programmable Custom Computing Machines (FCCM).
IEEE, 2018, pp. 57–64.
[17] Xilinx, “Training-aware quantization in pytorch,” January 2020, [Online;
retrieved 2020-01-13]. [Online]. Available: https://github.com/Xilinx/
brevitas
[18] ——, “Software defined specification environment for networking
(SDNet),” March 2014, [Online; retrieved 2020-01-03]. [Online].
Available: https://www.xilinx.com/support/documentation/backgrounders/
sdnet-backgrounder.pdf
[19] B. G. HB, P. Poornachandran, S. KP et al., “Deep-net: Deep neural
network for cyber security use cases,” arXiv preprint arXiv:1812.03519,
2018.
[20] N. Moustafa and J. Slay, “UNSW-NB15: a comprehensive data set for
network intrusion detection systems (unsw-nb15 network data set),” in
2015 Military Communications and Information Systems Conference
(MilCIS). IEEE, 2015, pp. 1–6.
