This paper presents a structural design of the hardware-efficient module for implementation of convolution neural network (CNN) basic operation with reduced implementation complexity. For this purpose we utilize some modification of the Winograd's minimal filtering method as well as computation vectorization principles. This module calculate inner products of two consecutive segments of the original data sequence, formed by a sliding window of length 3, with the elements of a filter impulse response. The fully parallel structure of the module for calculating these two inner products, based on the implementation of a naïve method of calculation, requires 6 binary multipliers and 4 binary adders. The use of the Winograd's minimal filtering method allows to construct a module structure that requires only 4 binary multipliers and 8 binary adders. Since a highperformance convolutional neural network can contain tens or even hundreds of such modules, such a reduction can have a significant effect.
Introduction
Artificial intelligence, deep learning, and neural networks represent powerful and incredibly effective machine learning-based techniques used to solve many scientific and practical problems. Applications of deep neural networks to machine learning are diverse and promptly developing, reaching the various fields of fundamental sciences, technologies and real-world. Among the various types of deep neural networks, convolutional neural networks (CNN) are the most widely used [1] . The basic and most time-consuming operation in CNN is the operation of a twodimensional convolution. Several methods have been proposed to accelerate the calculation of convolution, including the reduction of arithmetic operations via Fast Fourier transform (FFT) and the use of hardware accelerators based on FPGA, GPU and ASIC [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] . FFT based method of computing convolution is traditionally used for large filters, but state of the art CNN use small filters. In this situation one of the most effective algorithms used in the computation of a small-length two-dimensional convolution is the Winograd's minimal filtering algorithm, that is most intensively used in recent time [17] . The algorithm compute linear convolution over small tiles with minimal complexity, which makes it more effective with small filters and small batch sizes. In fact, this algorithm calculates two inner products of neighboring vectors formed by a sliding time window from the current data stream with an impulse response of the 3-tap finite impulse response (FIR) filter.
Many publications have been devoted to the implementation of computations in networks based on the Winograd's minimal filtering method [17] [18] [19] [20] . However, the principles of organizing the structure of the module that implements the filtering algorithm have not been considered in detail by anyone. Our publication is intended to fill this gap.
Preliminaries
As already noted, the basic operation of convolutional neural networks is a sliding inner product of vectors, formed by a moving time window from the current data stream with an impulse response of the M-tap FIR filter. It can be described by the following formula: The idea of Winograd's minimal filtering method is to compute these two filter outputs in following way [17] :
The values 2 ) (
can be calculated in advance, then this method requires 4 multiplications and 8 additions, which is equal to number of arithmetical operations in the direct method. But since multiplication is a much more complicated operation than addition, the Winograd's minimal filtering method is more efficient than the direct method of computation.
The above expressions exhaustively describe the entire set of mathematical operations needed to compute, but they do not disclose the way and sequence of the computation organization, nor the structure of the processor module that implements these operations. 
Structural synthesis of Winograd's minimal filtering module
Entries of the matrix ) (
4
S diag can be computed with the help of the following procedure: S in accordance with the procedure (3).
In low power application specific integrated circuits (ASIC) design, optimization must be primarily done at the level of transistor amount. From this point of view a multiplication requires much more intensive hardware resources than an addition. Moreover, a binary multiplier occupies much more area and consumes much more power than binary adder. This is because the implementation complexity of a fully parallel multiplier grows quadratically with operand length, while the implementation complexity of an adder increases linearly with operand length. Therefore, the algorithm containing as little as possible of multiplications is preferable from the point of view of ASIC design. Fig. 4 shows a structure of processing module for ASICoriented implementation of Winograd's minimal filtering basic operation. The module contains four two-input and two three-input algebraic adders, four multipliers and a register memory for storing the values i s . It is assumed that these elements can be precomputed and written to the register memory before the calculations begin. Depending on the requirements for the speed of calculations, the modules can be cascaded and combined into clusters. Today a better alternative than ASIC are FPGAs (fieldprogrammable gate arrays) -the integrated circuits designed to be configured by a customer or a designer. If the early FPGAs contained only small embedded multipliers, then more recent FPGAs contain DSP blocks, that include not only multipliers, but also internal adders designed in such a way that part of the additions in (1) can also be computed inside the DSP blocks. However even if the DSP block contains embedded multipliers, their number is always limited. This means that if the implemented scheme has a large number of multiplications, the projected processor may not always fit into the chip and the problem of minimizing the number of multipliers remains relevant.
This applies fully to FPGAs Stratix II that contain DSP blocks, each of which includes just 4 multipliers, as well as three adders at the block input and three adders at the block output. Such a block structure allows using the hardware resources of the chip with a maximum degree of efficiency. It is easy to see that a fully parallel implementation of direct calculations does not fit into the boundaries of one Stratix II DSP block. Fig. 5 shows a structure of processing module for implementation of Winograd's minimal filtering basic operation on the base of Altera Stratix II high-speed FPGA chip. The bulk of the computation is performed inside the DSP block, but the adders outlined by the dash-dotted line on a gray background are implemented using external logic gates. By way of background information, it is necessary to emphasize, that not all outputs of the DSP block are used in the proposed solution (see Fig. 5 ). Depending on the performance that is needed in the neural network, the number of processor modules implementing the Winograd's filtering basic operation can be quite large. 
Conclusion
This work looks into some issues of structural design of the hardware-efficient module for implementation of CNN basic operation using Winograd's minimal filtering method. This method reduces the number of multipliers at the expense of increased number of adders. Taking into account a relative hardware complexity of multiplier and adder, reducing the number of multipliers at the expense of the increased number of adders is desirable. The calculations demonstrate the effectiveness of proposed solutions and their universal impact on the different types of CNN layers as well as on the principles of network operation.
