ResNet-like Architecture with Low Hardware Requirements by Limonova, Elena et al.
ResNet-like Architecture with Low Hardware
Requirements
Elena Limonova
FRC CSC RAS,
Smart Engines Service LLC,
Moscow, Russia
Email: limonova@smartengines.com
Daniil Alfonso
JSC MCST,
Moscow, Russia
Email: alfonso d@mcst.ru
Dmitry Nikolaev
Smart Engines Service LLC,
Institute for Information Transmission
Problems RAS, Moscow, Russia
Email: dimonstr@iitp.ru
Vladimir V. Arlazarov
FRC CSC RAS,
Smart Engines Service LLC,
Moscow, Russia
Email: vva@smartengines.com
Abstract—One of the most computationally intensive parts
in modern recognition systems is an inference of deep neural
networks that are used for image classification, segmentation,
enhancement, and recognition. The growing popularity of edge
computing makes us look for ways to reduce its time for mobile
and embedded devices. One way to decrease the neural network
inference time is to modify a neuron model to make it more
efficient for computations on a specific device. The example of
such a model is a bipolar morphological neuron model. The
bipolar morphological neuron is based on the idea of replacing
multiplication with addition and maximum operations. This
model has been demonstrated for simple image classification
with LeNet-like architectures [1]. In the paper, we introduce
a bipolar morphological ResNet (BM-ResNet) model obtained
from a much more complex ResNet architecture by converting
its layers to bipolar morphological ones. We apply BM-ResNet to
image classification on MNIST and CIFAR-10 datasets with only
a moderate accuracy decrease from 99.3% to 99.1% and from
85.3% to 85.1%. We also estimate the computational complexity
of the resulting model. We show that for the majority of ResNet
layers, the considered model requires 2.1-2.9 times fewer logic
gates for implementation and 15-30% lower latency.
I. INTRODUCTION
Machine vision is becoming very popular and finds many
practical applications [2]–[7]. The proliferation of complex
technical systems leads to the emergence of very different
requirements for the computational algorithms used. The con-
cept of edge computing is gaining popularity, which means that
calculations are performed as close as possible to the end-user.
In this case, it is often impossible to use powerful hardware,
and the methods must work quickly and accurately enough
on a device with limited resources. Therefore, the task of
increasing the computational efficiency of pattern recognition
algorithms is becoming more and more critical.
A widely used approach to pattern recognition relies on
neural network-based algorithms. The choice of specific neural
network architecture can be pretty tricky and depends on
the problem, the desired accuracy, and suitable computa-
tional efficiency [8]. Cutting edge neural network architectures
significantly differ from each other but have one thing in
common. They all strongly rely on convolutional layers. These
layers perform the convolution of the input signal with one
or more filters. The convolution operation has several features
that are very important for visual recognition. The first feature
is the fact that the result does not depend on the spatial
position of the image object. The relatively small size of the
filter provides an analysis of a small space region of image
(receptive field) and allows us to select elementary features,
for example, corners. These elementary features are, in turn,
analyzed by subsequent layers. The aspects of convolutional
neural networks repeat the properties of receptive neurons.
However, the computational operation performed inside the
neuron is not limited to a weighted sum of input signals,
followed by non-linearity. For example, retinal neurons are ON
/ OFF neurons; that is, ON-neurons only work in the presence
of light, and OFF-neurons activate in the dark. Although we
can model a similar effect using activation functions, modern
neural network architectures typically include only a positive
data processing pathway via Rectified Linear Unit (ReLU)
activation. However, in [9] Kim et al. show that processing
negative information and using separate signal processing
paths can improve recognition accuracy. Thus, there is a reason
to believe that the traditional model of the convolutional
neural network can be improved, considering the biological
mechanisms of perception.
The new bipolar morphological neuron model [1] has all
these features of perception, including separate processing
paths for positive and negative data, that emulate excitation
and inhibition inside the neuron. There is one more impor-
tant thing to mention: it does not use multiplication in the
operation of convolution, and uses addition and maximum
instead. Therefore, there is reason to believe that such a
model can be significantly more computationally efficient.
Although it utilizes quite difficult activation functions outside
the convolution, we can implement them efficiently for cases
when we can use approximations. However, the model has
ar
X
iv
:2
00
9.
07
19
0v
1 
 [c
s.C
V]
  1
5 S
ep
 20
20
been demonstrated only in image classification with LeNet-
like networks. Compact lightweight networks are handy for
mobile and embedded recognition, but there are a lot more
tasks which require deep models to reach state-of-the-art
quality.
In this paper, we demonstrate a bipolar morphological
network of ResNet architecture [10], [11]. ResNet is based
on residual blocks that allow stacking to obtain a deep neural
network. The resulting network can solve many recognition
problems with high quality and is very scalable to keep
the desired balance between inference speed and accuracy.
We show bipolar morphological ResNet22 accuracy in image
classification on CIFAR-10 and MNIST datasets. Thus, we
demonstrate for the first time that bipolar morphological
neurons can be used in deep neural networks. We also analyze
the computational complexity of the network and estimate the
number of logic gates required for implementation.
The structure of the paper is as follows. Section 2 gives
related work on network structures and speedup methods. In
Section 3, we describe the bipolar morphological model, and
its complexity in terms of computing operations, logic gates
and clock cycle latency. Section 4 shows our experimental re-
sults on performance for the ResNet-like model on MNIST and
CIFAR-10 image classification problems. Finally, in Section 5,
we summarize our paper.
II. RELATED WORK
Researchers pay more and more attention to various approx-
imations of neural network structures and the development of
alternative computational models. The main problem solved in
this way is the acceleration of the network inference on spe-
cific devices, which can vary significantly in their architecture
and requires different approaches to optimize efficiency.
For example, a universal method is to reduce the redundancy
of neural network architecture. Low-rank approximations of
computations are very popular in order to reduce the com-
putational complexity of convolutional layers. They are based
on various decompositions of convolutional filters (SVD vari-
ations, for example) [12]–[14], including depth-wise separable
convolution and its modifications that divide the channels of
the input image into groups and process them separately [15],
[16]. These approaches are still being improved. For example,
in [17], authors propose separable convolutional eigen-filters,
which reduce computational complexity and demonstrate the
complete absence of loss in recognition quality. In [18],
authors speed up convolutional layers by using depth-wise sep-
arable convolution and show a way to find the optimal channel
configuration in groups to prevent a quality decrease. Modern
methods also combine different approaches, for example, low-
rank approximations and sparse filters, and reduce the number
of parameters of modern architectures by about 30%, leaving
accuracy almost at the same level [19].
Low precision integer approximations are also very popular.
For example, in [20], authors propose an end-to-end 8-bit
integer model without internal conversions to floating-point
data types. As a result, this model can provide fast infer-
ence on mobile and embedded devices with relatively small
accuracy losses. The research on increasing the accuracy of
the quantized network is also underway [21]–[24]. Among
such approximations, binary neural network models [25], [26]
occupy a special place. They can be very computationally
efficient on special devices and use less memory. However,
they are still inferior to floating-point models in terms of
recognition accuracy. Fast implementations are also of interest
[27], [28].
Recently, the idea of multiplication-free neural networks
is gaining popularity again. The first model without multi-
plications was a morphological neural network, which was
proposed back in 1996 by Ritter [29] and then complemented
with dendrites [30]. It did not find extensive use due to the
insufficiently high accuracy for complex recognition tasks.
Recently, however, new models have been proposed that
restrict the use of multiplications, for example, DeepShift [31],
where Elhoushi et al. replace multiplications by an effective
bit shift. In [32], Chen et al. propose to use the L1-norm
in convolutional layers, while preserving the multiplications
in the batch-normalization layer. To train the network, the
authors used sign gradient in backpropagation. The results
demonstrate a slight decrease in recognition accuracy on the
MNIST, CIFAR-10, and CIFAR-100 datasets.
III. BIPOLAR MORPHOLOGICAL NETWORKS
Bipolar morphological (BM) neuron presented in [1] per-
forms multiplication-free approximation of a classical neuron
to increase its computational efficiency for specialized devices.
A. BM neuron model
A classical neuron performs the following operation:
y(x,w) = σ
(
N∑
i=1
wixi + wN+1
)
, (1)
where x is an input vector of length N , w is weight vector
of length N + 1 and σ(·) is a nonlinear activation function.
Calculations in the bipolar morphological neuron can be
expressed as:
yBM (x, V, v) = σ
(
exp
N
max
j=1
(lnx+j + V
+
j )−
− exp Nmax
j=1
(lnx+j + V
−
j )− exp
N
max
j=1
(lnx−j + V
+
j )+
+ exp
N
max
j=1
(lnx−j + V
−
j ) + v
)
,
(2)
x+j =
{
xj , xj ≥ 0,
0, xj < 0,
(3)
x−j =
{
−xj , xj < 0,
0, xj ≥ 0,
(4)
where x is an input vector of length N , V +, V − are weight
vectors of size N , v is bias, σ(·) is a nonlinear activation
function. We define ln 0 = −∞ and replace it by a big enough
negative value for actual computations.
Since the neuron processes positive and negative parts of
input x in a quite similar manner, we can interpret it as two
identical computational paths responsible for excitation and
inhibition.
We consider V +j and V
−
j as separate weights and train them
independently.
B. Training
The training method for networks with bipolar morpholog-
ical layers is shown in [1]. The problem with the straightfor-
ward construction of the BM network and training it using
standard gradient methods is that there is only one non-zero
gradient element due to max operation, and only one weight is
updated at each iteration. Thus it may show poor quality. Some
weights can never be updated and never fire after training, thus
giving redundancy to the network.
Instead, we can use incremental layer-by-layer conversion
from the standard layer to the BM layer. The approach is in
training standard network and modifying convolutional and
fully-connected layers from the first to the last and training
new partly BM network. It can be summarized as:
1) Train classical network using conventional gradient
descent-based methods;
2) For each convolutional and fully-connected layers: re-
place neurons of type (1) with weights w by the BM-
neurons with weights {V +, V −, v}, where:
V +j =
{
lnwj , if wj > 0,
−∞, otherwise,
V −j =
{
ln |wj |, if wj < 0,
−∞, otherwise,
v = wN+1.
(5)
3) Perform additional training of the network after conver-
sion of each layer using same method as in 1.
C. Computational complexity
Since neurons in neural network models are organized into
layers, we consider a BM layer. The layer uses many addition
and maximum operations instead of multiplications. However,
the exact number depends on the computation organization.
Although the length of the x+ and x− is N for each of them,
together, they have N non-zero terms precisely. Only these
terms will contribute to the result. So, we can say that the
number of log operations is only N and does not consider
those zero terms.
The standard convolutional layer with input IL×M×C and
output OL×M×F does the following:
O(l,m, f) = σ
(
C∑
c=1
K−1∑
∆l=0
K−1∑
∆m=0
I(l + ∆l,m+ ∆m, c)·
·w(∆l,∆m, c, f) + b(f)
)
, f = 1, F , l = 1, L,m = 1,M
(6)
Here F is the number of filters, C is the number of input
channels, K ×K is the spatial dimensions of the filter, input
image size is L×M × C, w is a set of convolutional filters,
b is the bias. We suppose I is padded properly for the result
to be of the same size.
The standard fully-connected layer with input I(p) and
output OQ does:
O(q) = σ
(
P∑
p=1
I(p) · w(p, q) + b(q)
)
, q = 1, Q (7)
Here P is the number of inputs, Q is the number of neurons
in the layer, w is a set of fully-connected weights, b is set of
biases.
The number of operations for the standard and BM convo-
lutional layers is shown in Table I. In Table II, we show the
number of operations for standard and BM fully-connected
layers.
TABLE I
THE NUMBER OF OPERATIONS IN THE CONVOLUTIONAL (CONV) LAYER
OF BM AND STANDARD MODELS. F IS THE NUMBER OF FILTERS, C IS THE
NUMBER OF INPUT CHANNELS, K ×K IS THE SPATIAL DIMENSIONS OF
THE FILTER, INPUT IMAGE SIZE IS L×M × C .
Op Standard conv BM conv
σ(·) FLM FLM
Exp 0 4FLM
Log 0 CLM
Add FK2CLM 2F (K2C + 2)LM
Max 0 2F (K2C − 1)LM
Mul FK2CLM 0
TABLE II
THE NUMBER OF OPERATIONS IN THE FULLY-CONNECTED (FC) LAYER OF
BM AND STANDARD MODELS. P IS THE NUMBER OF INPUTS, Q IS THE
NUMBER OF NEURONS IN THE LAYER.
Op Standard fc BM fc
σ(·) Q Q
Exp 0 4Q
Log 0 P
Add QP 2Q(P + 2)
Max 0 2Q(P − 1)
Mul QP 0
D. Hardware implementation complexity
In general, in order to compare the computational efficiency
of such structures, we need to understand what kind of
computing device is involved and know the characteristic
latency and throughput of multiplier and adder. On modern
x86-64 and ARM architectures, for example, general-purpose
Arithmetic Logic Units (ALUs) are used. Execution time for
the multiplication does not differ from the time for addition for
floating-point data and only slightly differs for integer vector
data (see Table III). Therefore, it will be extremely difficult to
obtain inference acceleration implementing a BM network on
a CPU, even with the coefficients and input signals converted
to integers, since the total number of operations is more than
in the standard layer. For this reason, the proposed model is
primarily aimed at FPGA and ASIC projects.
TABLE III
THE LATENCY AND AVERAGE THROUGHPUT OF VECTOR ARITHMETIC
OPERATIONS FOR 32-BIT PACKED VALUES IN A VECTOR [33], [34].
Op latency throughput
Intel Skylake-X, floating-point 128-bit vector
add 4 0.5
max 4 0.5-1
mul 4 0.5-1
Intel Skylake-X, integer 128-bit vector
add 1 0.33
max 1 0.5
mul 5 0.5
mul+add 5 0.5
ARM Cortex-A57, floating-point 128-bit vector
add 5 2
max 5 2
mul 5 2
ARM Cortex-A57, integer 128-bit vector
add 3 2
max 3 2
mul 5 1
mul+add 5 1
In this case, it is possible to make an efficient implemen-
tation with parallel execution of 4 computational paths with a
specialized adder and maximum blocks, and not the general-
purpose ALU. An inevitable delay will be caused by the
multiplexing mechanism, which is necessary to direct the input
signal to the desired computational pathway.
Let us estimate the number of logic gates and clock cycle
latency required for the arithmetical operations involved in the
computations. We have used Verilog HDL to get register trans-
fer level description of addition, multiplication and maximum
computation arithmetic units conforming to IEEE 754 floating-
point standard, and Synopsys Design Compiler to implement
it at gate-level and obtain logic gate complexity and latency
characteristics. For the logarithm and exponent operation,
we have used software approximation through addition and
multiplication to evaluate the hardware complexity. These
values for single-precision data type are shown in Table IV.
Our custom implementation of the log function gives the
precision of 4 decimal digits but is faster than the full-precision
one. Let us describe it.
The floating-point numbers of single-precision in IEEE 754
are represented as the sign s, mantissa b = b0b1b2, . . . , b22
and exponent e:
x = 2e−127 · 1.b22b21, . . . , b0 (8)
So,
log2 x = e− 127 + log2(1 + b22b21, . . . , b0), (9)
where b22b21, . . . , b0 is in [0, 1). It means that we only need
to approximate log2(1 + y) for y ∈ [0, 1).
The approximation we construct is polynomial and has the
5th order:
f(y) = log2(1 + y)→ f˜(y) =
5∑
i=0
Ciy
i (10)
We obtain the coefficients Ci by solving the system of linear
equations. For 3 points 0, 0.5 and 1 we equate the values of
f(y) to the values f˜(y). The same we do with the values
of f ′(y) and f˜ ′(y). The resulting Ci are {0, 1.44269504,
−0.71249131, 0.42046732, −0.1955884, 0.04491735}, and
the approximation has a maximum error of about 7 · 10−5
in [0, 1). When computed with Horner’s method, it uses only
5 multiplications and 6 additions (including the one to get
e − 127) and bit manipulations to get s, e, and b, which are
free for hardware.
For the exponent we used a reference implementation for
approximated exponent from Intel [35].
TABLE IV
THE ESTIMATE NUMBER OF GATES AND LATENCY FOR ARITHMETICAL
OPERATIONS
Op Gates Latency, clock cycles
add 16048 3
max 1464 2
mul 35345 4
log 154179 35
exp 256965 21
Knowing the number of single operations (see Section III-C)
and assuming that all four terms in (2) are computed in
parallel (twice less operation for add and max for one thread
and 4 times less for exp) we obtain the approximate gate
complexity of the circuit and its latency. The ratios of the
gate numbers and the clock cycle latencies for the standard
and BM convolutional layers are presented in Table V. These
ratios demonstrate that for core layers inside the network
with quite a large number of input channels, we can get 2.1-
2.9 times fewer gates and 15-30% lower latency for the BM
layer. Figure 1 illustrates the total gate number required for
all convolutional layers for the ResNet-22 for CIFAR-10 (see
Section IV-B) depending on how many standard layers are
replaced by BM ones. Of course, the real ASICs do not require
a separate unit for each layer, and we only demonstrate the
general complexity decrease of the BM network.
However, there are several things that we can improve in
the BM layer model for practical purposes. For example,
we can use less precise logarithm approximation. Even more
promising is using quantized BM layers to perform more
calculations in integers.
IV. BM-RESNET
The ResNet [10], [11] is a modern deep neural network
architecture. The core idea is to use an identity shortcut con-
nection to skip some layers to deal with vanishing gradients’
TABLE V
THE APPROXIMATE GATE NUMBER AND LATENCY RATIOS FOR STANDARD
AND BM CONVOLUTIONAL LAYERS.
F C K Gates
standard/BM
Latency
standard/BM
16 1 1 0.16 0.22
16 16 1 1.14 0.80
32 1 1 0.17 0.23
32 32 1 1.64 1.02
64 1 1 0.17 0.23
64 64 1 2.11 1.18
128 1 1 0.17 0,23
128 128 1 2.45 1.28
256 1 1 0.17 0.23
256 256 1 2.67 1,34
512 1 1 0.17 0,23
512 512 1 2.80 1.37
16 1 3 1.02 0.87
16 16 3 2.50 1.29
32 1 3 1.03 0.89
32 32 3 2.70 1.34
64 1 3 1.03 0.89
64 64 3 2.81 1.37
128 1 3 1.04 0.91
128 128 3 2.87 1.39
256 1 3 1.04 0.90
256 256 3 2.9 1.39
512 1 3 1.04 0.90
512 512 3 2.92 1.40
Fig. 1. The total gate number in convolutional layers of the ResNet-22
network depending on the number of BM convolutions. The rest of the
network still uses standard convolutional layers.
problem. It allows us to stack convolutional layers and obtain
better recognition quality than shallower models. So, ResNet
is a scalable and accurate model that founds wide application
in practice.
Our goal was to replace all classical neurons in convolu-
tional layers by BM-neurons, keeping the network structure the
same. We conducted training according to the approach from
Section III-B. We trained 22-layer ResNet-v2 (the architecture
is briefly shown in Fig. 2) at first with standard convolutional
layers (step 1). Then we converted the convolutions layer by
layer to BM convolutions according to (step 2) and trained the
Fig. 2. The ResNet architecture with 22 convolutional layers used in
experiments. Batch normalization and activations are omitted for simplicity.
network for the 50 epochs (step 3). The layers were converted
sequentially from the first to the last. After the conversion of
all layers, the whole network was trained until the accuracy
stopped improving. The experiment is aimed at optimizing
training time because the modelling time of BM-neuron using
standard frameworks is quite slow. We performed training with
a standard Adam optimizer [36] minimizing cross-entropy
loss.
A. MNIST
MNIST is a database of gray handwritten digits consisting
of 60000 gray images of size 28 × 28 pixels for train and
10000 images for test [37]. We used 10% of the training set
for validation and the rest for training. A few samples from
the dataset are shown in Fig. 3.
Fig. 3. Sample images from MNIST dataset.
Fig. 4. Accuracy on MNIST after conversion before additional training.
The accuracy of the original ResNet was 99.3%. Evolution
of the accuracies in the process of layer conversion to BM is
shown in Fig. 4-5. The final macro-average precision and recall
are shown in Fig. 6 and Fig. 7. We can see that, in general,
the network was able to preserve the original accuracy. Some
fluctuations can be associated with not enough training time
for each conversion step for the network to converge to the best
accuracy. However, these layers were affected during further
training, so it should not affect the final result with all BM
convolutions.
The BM-ResNet demonstrated only slight accuracy decrease
to 99.1% (from 99.3%) and can still be considered suitable for
practical usage.
B. CIFAR-10
CIFAR-10 is a database with 60000 32 × 32 color im-
ages [38] for train and 10000 images for test. These images
show objects of 10 different classes. A few samples from the
dataset are shown in Fig 8. We used standard preprocessing
(normalized the pixel values of each sample to be in a range
[0, 1], and subtracted a mean image over the whole training
database from each sample). We also used data augmentation,
which included random horizontal and vertical shifts and
random horizontal flips.
Evolution of the accuracies in the process of layer conver-
sion to BM is shown in Fig. 9-10. The final macro-average
precision and recall are shown in Fig. 11 and Fig. 12. The
accuracy of the standard ResNet was 85.3%. The BM-ResNet
a)
b)
Fig. 5. Accuracy on MNIST after conversion and additional training for
accuracy range 0.95-1.00 (a) and 0.989-0.995 (b).
Fig. 6. Macro-average precision on MNIST after conversion and additional
training.
Fig. 7. Macro-average recall on MNIST after conversion and additional
training.
Fig. 8. Sample images from CIFAR-10 dataset.
Fig. 9. Accuracy on CIFAR-10 after conversion before additional training.
Fig. 10. Accuracy on CIFAR-10 after conversion and additional training.
with all the convolutional layers converted demonstrated accu-
racy decrease to 77.7%, which is a significant accuracy drop.
However, with 16 BM-convolutional layers, its accuracy was
85.1%, and with 18 BM-convolutional layers, it was 83.9%.
So, the BM-ResNet with 16 converted layers is nearly as good
as the original ResNet.
V. CONCLUSION
In this paper, we managed to obtain a 22-convolutional
layer ResNet-like with bipolar morphological convolutions.
We converted and trained the network to classify MNIST and
CIFAR-10 datasets. The accuracy on MNIST was 99.3% with
Fig. 11. Macro-average precision on CIFAR-10 after conversion and addi-
tional training.
Fig. 12. Macro-average recall on CIFAR-10 after conversion and additional
training.
standard convolutional layers and 99.1% with morphological
ones. The accuracy on CIFAR-10 was 85.3% with standard
convolutional layers and 85.1% with 16 BM-convolutional
ones.
Since FPGA implementation of our BM networks requires
about 2.1-2.9 times fewer gates and gets 15-30% lower latency
for large enough layers, our results show that it is possible to
create a neural processing unit with lower power consumption,
higher inference speed and reasonable accuracy compared to
units for standard networks.
Our experimental setup for training can be found at: https:
//github.com/SmartEngines/bipolar-morphological-resnet
ACKNOWLEDGMENT
The authors would like to thank A.B. Merkov, PhD, for his
helpful advice and comments. This work is partially supported
by Russian Foundation for Basic Research (projects 17-29-
03240, 18-07-01384).
REFERENCES
[1] E. Limonova, D. Matveev, D. Nikolaev, and V. V. Arlazarov, “Bipolar
morphological neural networks: convolution without multiplication,”
in Twelfth International Conference on Machine Vision (ICMV), vol.
11433, International Society for Optics and Photonics. SPIE, 2020,
pp. 962 – 969. [Online]. Available: https://doi.org/10.1117/12.2559299
[2] Y. S. Chernyshova, A. V. Sheshkus, and V. V. Arlazarov, “Two-step cnn
framework for text line recognition in camera-captured images,” IEEE
Access, vol. 8, pp. 32 587–32 600, 2020.
[3] M.-J. Chang, J.-T. Hsieh, C.-Y. Fang, and S.-W. Chen, “A vision-based
human action recognition system for moving cameras through deep
learning,” in Proceedings of the 2019 2nd International Conference on
Signal Processing and Machine Learning (SPML), 2019, pp. 85–91.
[4] Z. Zhang and H. Peng, “Deeper and wider Siamese networks for
real-time visual tracking,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4591–
4600.
[5] X. Li, F. Yin, and C. Liu, “Page object detection from PDF document
images by deep structured prediction and supervised clustering,” in 2018
24th International Conference on Pattern Recognition (ICPR), 2018, pp.
3627–3632.
[6] P. V. Bezmaternykh, D. A. Ilin, and D. P. Nikolaev, “U-Net-bin: hacking
the document image binarization contest,” Computer optics, vol. 43,
no. 5, pp. 825–832, 2019, dOI: 10.18287/2412-6179-2019-43- 5-825-
832.
[7] A. Bokovoy, “Automatic control systems architecture for group of small
unmanned aerial vehicles,” ITiVS, no. 1, pp. 68–77, 2018.
[8] Y. V. Vizilter, V. S. Gorbatcevich, and S. Y. Zheltov, “Structure-
functional analysis and synthesis of deep convolutional neural networks,”
Computer Optics, vol. 43, no. 5, pp. 886–900, 2019.
[9] J. Kim, O. Sangjun, Y. Kim, and M. Lee, “Convolutional neural network
with biologically inspired retinal structure,” Procedia Computer Science,
vol. 88, pp. 145–154, 2016.
[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition (CVPR), 2016, pp. 770–778.
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep
residual networks,” in European conference on computer vision (ECCV).
Springer, 2016, pp. 630–645.
[12] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting
linear structure within convolutional networks for efficient evaluation,” in
Advances in Neural Information Processing Systems 27, Z. Ghahramani,
M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, Eds. Curran
Associates, Inc., 2014, pp. 1269–1277.
[13] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional
neural networks with low rank expansions,” in Proceedings of the British
Machine Vision Conference (BMVC). BMVA Press, 2014.
[14] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, “Learning separable
filters,” in 2013 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2013, pp. 2754–2761.
[15] F. Chollet, “Xception: Deep learning with depthwise separable convo-
lutions,” in 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2017, pp. 1800–1807.
[16] E. Limonova, A. Sheshkus, A. Ivanova, and D. Nikolaev, “Convolutional
neural network structure transformations for complexity reduction and
speed improvement,” Pattern Recognition and Image Analysis, vol. 28,
no. 1, pp. 24–33, 2018.
[17] S. Scheidegger, Y. Yu, and T. McKelvey, “Separable convolutional eigen-
filters (SCEF): Building efficient CNNs using redundancy analysis,”
arXiv preprint arXiv: abs/1910.09359, 2019.
[18] X. Wang, M. Kan, S. Shan, and X. Chen, “Fully learnable group
convolution for acceleration of deep neural networks,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2019, pp. 9049–9058.
[19] K. Guo, X. Xie, X. Xu, and X. Xing, “Compressing by learning in a
low-rank and sparse decomposition form,” IEEE Access, vol. PP, pp.
1–1, 10 2019.
[20] Y. Yao, B. Dong, Y. Li, W. Yang, and H. Zhu, “Efficient implemen-
tation of convolutional neural networks with end to end integer-only
dataflow,” in 2019 IEEE International Conference on Multimedia and
Expo (ICME), July 2019, pp. 1780–1785.
[21] Y. Choukroun, E. Kravchik, and P. Kisilev, “Low-bit quantization of neu-
ral networks for efficient inference,” arXiv preprint arXiv:1902.06822,
2019.
[22] M. Pietron, R. Casas, M. Wielgosz et al., “Methodologies of com-
pressing a stable performance convolutional neural networks in image
classification,” Neural Processing Letters, pp. 1–23, 2019.
[23] Q. Sun, F. Shang, K. Yang, X. Li, Y. Ren, and L. Jiao, “Multi-precision
quantized neural networks via encoding decomposition of {-1,+ 1},” in
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI),
vol. 33, 2019, pp. 5024–5032.
[24] D. A. Ilin, E. E. Limonova, V. V. Arlazarov, and D. P. Nikolaev, “Fast
integer approximations in convolutional neural networks using layer-by-
layer training,” in ICMV 2016, D. P. N. W. Z. J. Z. Antanas Verikas,
Petia Radeva, Ed., vol. 10341, no. 103410Q. Bellingham, Washington
98227-0010 USA: SPIE, July 2017, pp. 1–5, dOI: 10.1117/12.2268722.
[25] L. Mocerino and A. Calimera, “TentacleNet: A pseudo-ensemble tem-
plate for accurate binary convolutional neural networks,” arXiv preprint
arXiv:1912.10103, 2019.
[26] J. Li, Y. Wang, B. Liu, Y. Han, and X. Li, “Simulate-the-hardware:
Training accurate binarized neural networks for low-precision neural
accelerators,” in Proceedings of the 24th Asia and South Pacific Design
Automation Conference (ASP-DAC), ser. ASPDAC 19. New York,
NY, USA: Association for Computing Machinery, 2019, p. 323328.
[Online]. Available: https://doi.org/10.1145/3287624.3287628
[27] E. E. Limonova, M. I. Neyman-Zade, and V. L. Arlazarov, “Special
aspects of matrix operation implementations for low-precision neural
network model on the elbrus platform,” Bulletin of the South Ural State
University. Ser. Mathematical Modelling, Programming & Computer
Software, pp. 1–12, 2020.
[28] R. Morcel, H. Hajj, M. A. R. Saghir, H. Akkary, H. Artail, R. Khanna,
and A. Keshavamurthy, “FeatherNet: An accelerated convolutional
neural network design for resource-constrained FPGAs,” ACM Trans.
Reconfigurable Technol. Syst., vol. 12, no. 2, 2019. [Online]. Available:
https://doi.org/10.1145/3306202
[29] G. Ritter and P. Sussner, “An introduction to morphological neural
networks,” Proceedings of 13th International Conference on Pattern
Recognition (ICPR), vol. 4, pp. 709–717 vol.4, 1996.
[30] G. X. Ritter, L. Iancu, and G. Urcid, “Morphological perceptrons with
dendritic structure,” in The 12th IEEE International Conference on Fuzzy
Systems (FUZZ), 2003. FUZZ ’03., vol. 2, May 2003, pp. 1296–1301
vol.2.
[31] M. Elhoushi, F. Shafiq, Y. Tian, J. Y. Li, and Z. Chen,
“DeepShift: Towards multiplication-less neural networks,” arXiv
preprint arXiv:1905.13298, 2019.
[32] H. Chen, Y. Wang, C. Xu, B. Shi, C. Xu, Q. Tian, and C. Xu, “AdderNet:
Do we really need multiplications in deep learning?” arXiv preprint
arXiv:1912.13200, 2019.
[33] A. Fog, “Instruction tables: Lists of instruction latencies, throughputs
and micro-operation breakdowns for intel, AMD and via CPUs,” Techni-
cal University of Denmark, http://www.agner.org/optimize/ instruction
tables.pdf , 2017.
[34] Cortex-A57 Software Optimization Guide, http:// infocenter.arm.
com/help/ topic/com.arm.doc.uan0015b/Cortex A57 Software
Optimization Guide external.pdf .
[35] Reference Implementations for Intel Architecture Approximation
Instructions VRCP14, VRSQRT14, VRCP28, VRSQRT28, and VEXP2,
https:// software.intel.com/en-us/articles/reference-implementations-
for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-
vexp2.
[36] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[37] THE MNIST DATABASE of handwritten digits, http://yann.lecun.com/
exdb/mnist/ .
[38] CIFAR-10 dataset, https://www.cs.toronto.edu/∼kriz/cifar.html.
