Realization of NumPy Tensordot using the field programmable gate array for embedded machine learning applications by Grout, Ian
 
Realization of NumPy Tensordot using the Field 
Programmable Gate Array for Embedded Machine 
Learning Applications 
 
Ian Grout  
Department of Electronic and Computer Engineering 
University of Limerick 
Limerick, Ireland 
Ian.Grout@ul.ie 
Abstract—Today, Machine Learning (ML) and Deep 
Learning (DL) functions are embedded into electronic systems 
enabling the inclusion of levels of system “intelligence” that 
otherwise could not be included using non-ML/DL approaches 
due to design considerations such as the required data 
processing times. Underlying the ML and DL operations are the 
necessary processing requirements, data storage (memory) and 
data structures (the format of the data). In addition, the manner 
in which the data is processed can be software based, hardware 
based, or a combination of software and hardware operations. 
In this paper, the Field Programmable Gate Array (FPGA) is 
considered to implement a FPGA based implementation of 
NumPy Tensordot in Python for computing the tensor dot 
product along specific axes for arrays greater than one-
dimension. The functionality will be implemented within an 
embedded Xilinx MicroBlaze processor targeting the Xilinx 
Artix-7 FPGA.  
Keywords—tensor, Tensordot, FPGA, embedded systems 
I. INTRODUCTION 
As the interest in, and ability to implement, advanced 
Machine Learning (ML) and Deep Learning (DL) [1] 
functions into everyday computing increases, the applications 
and possible physical implementations (in software and 
hardware) increase. Applications can range from large-scale 
operations (big data) through to smaller, mobile, embedded 
system applications. The rapid expansion of IoT (Internet of 
Things) [2] services, such as wireless sensor systems, means 
that there is a need to design efficient hardware and software 
solutions to maximize device functionality whilst minimizing 
cost, size and power consumption. Most solutions developed 
today are based on software programs/scripts running on a 
microprocessor based system. Such systems incorporate high 
processor speeds and a large amount of available memory, 
with memory ranging from on-processor Cache through to 
external volatile and non-volatile memory. 
An increasingly important design aspect is in the selection 
of the right mix of technologies to implement data processing, 
data storage (memory) and data acquisition. In many solutions 
today, a processor based approach is used where a suitably 
written software program/script implements the required 
operations on a predefined hardware architecture. However, 
where this approach cannot fulfil the required functionality, a 
custom hardware or hardware/software co-design approach 
must be considered. In such an approach, the designer can 
select which operations are performed in software and which 
operations are performed in hardware. In addition, this leads 
to the ability to implement sequential and concurrent 
operations to reduce the overall processing time. For advanced 
hardware architectures, either the Field Programmable Gate 
Array (FPGA) or Application Specific Integrated Circuit 
(ASIC) approach is used. The FPGA is particularly interesting 
as it provides for a lower cost entry for design prototyping as 
well as for final application than when considering an ASIC 
approach. This means that is can be cost-effective to use 
multiple FPGAs as co-processing elements within an 
application as depicted in Fig. 1. 
 
Fig. 1. FPGA based co-processing 
In this paper, a custom hardware processor based design 
using the Xilinx MicroBlaze processor and targeting the 
Xilinx Artix-7 FPGA with the computations implemented in 
software is presented. This design is aimed to implement 
within a FPGA different tensor products [3] that are supported 
through NumPy Tensordot [4]. It supports the inclusion of co-
processing capabilities using a FPGA connected to a main 
processor. 
Within a FPGA, a digital circuit/system can be implemented 
in hardware only, in software only, or a mixture of hardware 
and software operations: 
• Custom hardware can be created using a 
Hardware Description Language (HDL) design 
description and can be developed using either 
VHDL [5] or Verilog HDL [6]. 
• A software program can be run on an embedded 
processor, where the processor is either a soft 
core (available as HDL code) or a hard core (a 
physical processor core fabricated within the 
FPGA). 
This paper is structured as follows. Section II will 
introduce NumPy Tensordot and give an example use that will 
be mapped to the FPGA using an embedded processor. 
Section III will introduce the MicroBlaze processor 
implementation within the Xilinx Artix-7 FPGA. Section IV 
will present and discuss the system operation and section V 


































II. TENSORDOT FUNCTION IMPLEMENTATION 
CONSIDERATIONS 
In ML and DL applications, the key considerations for 
implementing the design requirements considered here are the 
capabilities of the underlying hardware and software for: 
1. Data processing. 
2. Data storage and retrieval. 
3. Data structures (data format). 
In considering the data structures then values, such as for 
example sensor data samples in an embedded sensor system, 
can be stored in different formats ranging from single scalar 
values through to complex, multi-dimensional arrays. 
Computations on such data structures are handled well when 
considering the data structures as tensors. Tensors are thought 
of here as essentially multi-dimensional arrays, and tensor 
computations implement mathematical operations on multi-
dimensional arrays using tensor calculus. Hence, in a sensor 
system, the sampled data can be stored in different formats 
that suit the available memory (size and arrangement) and the 
data processing requirements. Tensors are realized in a 
programming/scripting language as scalar values or multi-
value arrays (one dimensional arrays or higher dimensions). 
The computations can be formed as user created 
functions/methods in a software programming/scripting 
language and many languages, such as MATLAB [7], R [8] 
and Python [9], provide library/module support for this type 
of arithmetic. Python is now the language of choice in many 
engineering and scientific applications and tensor 
computation is supported in modules that include NumPy [10] 
and Google TensorFlow [11]. NumPy and TensorFlow 
support their own implementation of a function called 
Tensordot [4, 12]. Considering NumPy, Tensordot supports 
the definition of two input tensors, a and b and an array-like 
object (axes) containing two array-like objects. The 
computation is the sum the products of a’s and b’s elements 
over the axes specified by axes. As an example, consider the 
following Python script that creates two input arrays (two-
dimensional arrays as 4x4 matrices), a and b, and runs 
Tensordot with axes as scalar numbers 0, 1 and 2. The shape, 
dimensions and array values are then printed to the screen, or 
other standard output device. This is the basic use for axes set 
as integer numbers, but the axes can also be arrays of integer 
numbers: 
 =  1 2 3 42 3 4 53 4 5 64 5 6 7 
 =  5 6 7 86 7 8 90 1 2 34 3 2 1 
 
import numpy as np 
 
a = np.array( [ [ 1, 2, 3, 4 ], 
                [ 2, 3, 4, 5 ], 
                [ 3, 4, 5, 6 ], 
                [ 4, 5, 6, 7 ] ] ) 
 
b = np.array( [ [ 5, 6, 7 ,8 ], 
                [ 6, 7, 8 ,9 ], 
                [ 0, 1, 2, 3 ], 
                [ 4, 3, 2, 1 ] ] ) 
c1 = np.tensordot( a, b, 0 ) 
c2 = np.tensordot( a, b, 1 ) 
c3 = np.tensordot( a, b, 2 ) 
 
print( 'a  shape >> ' + str(a.shape) ) 
print( 'b  shape >> ' + str(b.shape) ) 
print( 'c1 shape >> ' + str(c1.shape) ) 
print( 'c2 shape >> ' + str(c2.shape) ) 
print( 'c3 shape >> ' + str(c3.shape) ) 
print( 'a  dims. >> ' + str(a.ndim) ) 
print( 'b  dims. >> ' + str(b.ndim) ) 
print( 'c1 dims. >> ' + str(c1.ndim) ) 
print( 'c2 dims. >> ' + str(c2.ndim) ) 
print( 'c3 dims. >> ' + str(c3.ndim) ) 
 
print( a ) 
print( b ) 
print( c1 ) 
print( c2 ) 
print( c3 ) 
 
The shape (shape) and dimensions (ndim) of the arrays can 
be seen for each of the axes defined:  
a  shape >> (4, 4) 
b  shape >> (4, 4) 
c1 shape >> (4, 4, 4, 4) 
c2 shape >> (4, 4) 
c3 shape >> () 
a  dims. >> 2 
b  dims. >> 2 
c1 dims. >> 4 
c2 dims. >> 2 
c3 dims. >> 0 
 
This shows that the results from running the Tensordot 
computation differs depending on the axes setting and the user 
would be able to select the particular value of axes depending 
on their computation needs: 
• axes = 0: Tensor product: a ⊗ b 
• axes = 1: Tensor dot product: a · b 
• axes = 2: Tensor double contraction: a : b 
For this example on two 4x4 input matrices, when axes = 
0, the Kronecker Product [13] (c1 = a ⊗ b) on two 
matrices is computed resulting in a 4x4x4x4 array (c1). For 
axes = 1, the result (c2 = a · b) is a 2-D matrix (4x4 array) 
that implements matrix multiplication producing: 
2 =  · ) = 33 35 37 3948 52 56 6063 69 75 8178 86 94 102 
 
For axes = 2, the result (c3 = a : b) is a scalar number 
(262). For the above PC/laptop based software 
implementation, the user does not need to be consider how the 
arrays are actually created, where they are stored in memory 
and the memory requirements. A typical PC/laptop would 
have access to microprocessor cache memory, GBytes of 
external Random Access Memory (RAM) and Hard Disk 
Drive (HDD) memory. However, for an embedded 
application, the memory requirements would need careful 
consideration, particularly when the amount of available 
memory was limited. This would need to account for the 
amount of data to be stored as well as the memory (number of 
bytes) required for the different data types. 
III. MICROBLAZE IMPLEMENTATION 
The system developed and presented in this paper is based 
on the Artix-7 FPGA configured with a MicroBlaze soft 
processor core (version 11.0) and interfacing with an external 
system using a Universal Asynchronous Receiver Transmitter 
(UART). The connection to the external system (PC/laptop) 
was implemented using a USB (Universal Serial Bus) 
interface. For development and demonstration purposes, the 
slow speed of the serial communications was not considered 
an issue as in a high-speed scenario, a higher speed 
communications bus arrangement would be used. However, 
the USB/UART connection allowed for ease of programming, 
debugging and runtime instruction/data communications. A 
desktop PC/laptop interface developed in Python allowed a 
user to control the computation parameters, to upload the input 
arrays (a and b), to run the FPGA based computation and to 
retrieve the computation results. The basic principle of 
operation is shown in Fig. 2.  
 
Fig. 2. System operation overview 
The FPGA based system design is based on an 
implementation of the MicroBlaze 32-bit RISC (Reduced 
Instruction Set Computer) architecture soft processor core 
with peripherals as shown in Fig. 2 connected to the processor 
via the AXI (Advanced eXtensible Interface) bus. The 
processor operates on a master clock frequency of 100 MHz. 
The top level block diagram for the design developed in Xilinx 
Vivado (version 2018.3.1) [14] is shown in Fig. 3. 
 
Fig. 3. Microblaze processor system block design in Xilinx Vivado 
Hence, the design is a custom hardware architecture based 
on the MicroBlaze processor with the array computation 
functionality implemented in software on the processor. A 
user can upload instructions and array data to the FPGA and 
download array data and runtime statistics from the FPGA. A 
simple communications protocol enables the instructions to 
control the system operation. The MicroBlaze uses an instance 
of local memory (32K x 32-bits) for holding program code and 
data. For this hardware arrangement with the Artix-7 FPGA, 
only a small amount of the hardware resources available were 
used. Fig. 4 shows the utilization report identifying the 
available and utilized device resources. 
 
Fig. 4. Microblaze processor device utlization (Xilinx Vivado report) 
Xilinx Vivado was used to create the hardware design and 
FPGA configuration bitstream and the design was exported to 
Xilinx SDK (Software Development Kit) [15], version 
2018.3, for C program development and device configuration 
(hardware) and programming (software). 
IV. SYSTEM OPERATION 
A user can communicate with the FPGA using a PC/laptop 
COM port with the FPGA connected using a USB cable. For 
ease of development, the Digilent Arty-35T [16] Artix-7 
FPGA Development Board was used. 
The user can upload instructions and data, and retrieve 
data and runtime statistics. Table I shows the program core 
instructions and their meaning. The arrays considered for 
initial development purposes were 1-D, 2-D and 3-D arrays 
which are Rank 1, 2 and 3 tensors respectively. The arrays 
were considered to hold integer type data and the array sizes 
were initially considered for demonstration purposes as 1x4 
(1-D vectors), 4x4 (2-D matrices) and 4x4x4 (3-D cubes). 
However, it would be common to consider significantly larger 
size arrays with the addition of considering floating point 
numbers. This would then need to consider additional design 
complexity (hardware and software) in terms of: 
1. Computation time. 
2. Memory requirements, with the use of external 
memory connected to the FPGA. 
However, the underlying approach would remain the same 
and the additional design complexity would determine where 
data is stored, what external memory devices are required and 










User interface: Python 
C program 










TABLE I.  CASE STUDY DESIGN: SUPPORTED INSTRUCTIONS 
Instruction 
No. Instruction Meaning 
0 Initialize 
Initialize the contents of the input and results 
arrays in the FPGA to default values 
1 Parameter 
Set the FPGA Tensordot computation 
parameters (instruction followed by parameter 
data) 
2 Upload 
Upload new contents to the input arrays 
(instruction followed by the input array data) 
3 Run Run the FPGA Tensordot computation 
4 Read 
Download array contents from the FPGA 
(instruction followed by array data read) 
5 Statistics 
Obtain runtime statistics on the arrays and 
computation operations (instruction followed 
by data read) 
 
The idea behind the use of the interface would be as 
follows: 
1. The main process running on the PC/laptop wishes to 
subcontract a tensor computation to the connected 
FPGA. 
2. The arrays are initialized to their default values. 
3. The computation parameters are uploaded to the 
FPGA. 
4. New input array (a and b) data are uploaded to the 
FPGA. 
5. The computation is run. 
6. The results array data are downloaded for use and 
analysis. 
7. Computation run statistics (array shapes, dimensions 
and computation time) are downloaded for analysis. 
In this manner, the FPGA acts as an attached processor to 
the main (PC/laptop) processor and the FPGA can run 
computations in parallel (concurrently) with the main 
PC/laptop processor. This provides additional processing 
power and flexibility in how computations can be performed 
based on different requirements. In addition, the approach can 
be adapted to different hardware/software configurations. The 
computations are implemented in software running on the 
MicroBlaze and this gives the additional potential to vary the 
system architecture (see Fig. 3) for different requirements as 
well as developing code to access and process array data using 
different algorithms. For example, for axes = 0 (for two 4x4 
matrix inputs only), a basic C source code would resemble the 
following: 
  if ( parameter == 0 ) 
  { 
    for ( i = 0; i < 4; i++ ) 
    { 
      for ( j = 0; j < 4; j++ ) 
      { 
        for ( k = 0; k < 4; k++ ) 
        { 
           for ( m = 0; m < 4; m++ ) 
           { 
             c1[i][j][k][m] = a[i][j] * b[k][m]; 
           } 
        } 
      } 
    } 
  } 
This implements the algorithm using for loops on the input 
arrays a[4][4] and b[4][4] to produce the result array 
c1[4][4][4][4]. It should also be noted that although for this 
software implementation, computations (multiplications and 
additions) are performed in a sequential manner, it is also 
possible to implement the functionality in hardware rather 
than software and this would allow the results to be obtained 
in a reduced time given the potential use of concurrent 
operations rather than sequential operations. This would then 
be reliant on the availability of suitable hardware within the 
FPGA and external to the FPGA. 
V. CONCLUSIONS 
This paper has presented and discussed a FPGA based 
solution acting as a co-processor to implement functions of 
NumPy Tensordot within an embedded processor. The paper 
discussed the rationale for the work along with architecture 
and operation of a case study design. The system allows for 
different tensor computations to be performed on two input 
arrays. Such co-processing systems allow for parallel 
(concurrent) processing operations to be implemented. 
REFERENCES 
[1] Ahmad Shawahna, Sadiq M. Sait and Aiman El-Maleh, “FPGA-Based 
Accelerators of Deep Learning Networks for Learning and 
Classification: A Review", IEEE Access, pp 7823 - 7859, 28th 
December 2018 
[2] The Institute of Electrical and Electronics Engineers (IEEE), IEEE 
Internet of Things Journal. Internet: http://ieee-iotj.org/ [9th November 
2019] 
[3] Daniel Fleisch, A Student's Guide to Vectors and Tensors (Student's 
Guides) 1st Edition, Cambridge University Press, 14th November 2011, 
ISBN-10: 0521171903, ISBN-13: 978-0521171908 
[4] The SciPy community, numpy.tensordot. Internet: 
https://docs.scipy.org/doc/numpy/reference/generated/numpy.tensord
ot.html [9th November 2019] 
[5] The Institute of Electrical and Electronics Engineers (IEEE), IEEE 
1076-2008 - IEEE Standard VHDL Language Reference Manual 
[6] The Institute of Electrical and Electronics Engineers (IEEE), IEEE 
1364-2005 - IEEE Standard for Verilog Hardware Description 
Language 
[7] The Mathworks Inc., MATLAB. Internet: 
https://uk.mathworks.com/products/matlab.html [9th November 2019] 
[8] The R Foundation, The R Project for Statistical Computing. Internet: 
https://www.r-project.org/ [9th November 2019] 
[9] Python Software Foundation, Python. Internet: 
https://www.python.org/ [9th November 2019] 
[10] NumPy developers, NumPy. Internet: https://numpy.org/ [9th 
November 2019] 
[11] Google, TensorFlow. Internet: https://www.tensorflow.org [9th  
November 2019] 
[12] Google, tf.tensordot, TensorFlow Core r2.0. Internet: 
https://www.tensorflow.org/api_docs/python/tf/tensordot [9th 
November 2019] 
[13] D.S.G. Pollock, "On Kronecker Products, Tensor Products and Matrix 
Differential Calculus". Internet: 
https://www.le.ac.uk/economics/research/RePEc/lec/leecon/dp14-
02.pdf [9th November 2019] 
[14] Xilinx, Vivado Design Suite HLx Editions. Internet: 
https://www.xilinx.com/products/design-tools/vivado.html [9th 
November 2019] 
[15] Xilinx, Xilinx Software Development Kit (XSDK). Internet: 
https://www.xilinx.com/products/design-tools/embedded-
software/sdk.html [9th November 2019] 
[16] Digilent, Arty. Internet: 
https://reference.digilentinc.com/reference/programmable-
logic/arty/start [9th November 2019] 
 
