Evolutionary Cell Aided Design for Neural Network Architectures by Colangelo, Philip et al.
Evolutionary Cell Aided Design for Neural Network
Architectures
Philip Colangelo∗1,3, Oren Segal†2, Alexander Speicher‡2, and Martin Margala§1
1Department of Computer Engineering, University of Massachusetts Lowell
2Department of Computer Science, Hofstra University
3Intel PSG
Abstract
Mathematical theory shows us that multilayer feedforward Artificial Neural Networks(ANNs) are uni-
versal function approximators, capable of approximating any measurable function to any desired degree of
accuracy. In practice designing practical and efficient neural network architectures require significant ef-
fort and expertise. We present a novel software framework called Evolutionary Cell Aided Design(ECAD)
meant to aid in the exploration and design of efficient Neural Network Architectures(NNAs) for reconfig-
urable hardware. Given a general neural network structure and a set of constraints and fitness functions,
the framework will explore both the space of possible NNA and the space of possible hardware designs,
using evolutionary algorithms, and attempt to find the fittest co-design solutions according to a prede-
fined set of goals. We test the framework on an image classification task and use the MNIST data set
of hand written digits with an Intel Arria 10 GX 1150 device as our target platform. We design and
implement a modular and scalable 2D systolic array with enhancements for machine learning that can
be used by the framework for the hardware search space. Our results demonstrate the ability to pair
neural network design and hardware development together using an evolutionary algorithm and removing
traditional human-in-the-loop development tasks. By running various experiments of the fittest solutions
for neural network and hardware searches, we demonstrate the full end-to-end capabilities of the ECAD
framework.
1 Introduction
The difficulty in designing performant NNAs has brought a recent surge in interest in auto design of NNAs.
The focus of the existing body of research has been on optimizing NNA design for accuracy [1][2][3]. Optimiz-
ing NNAs is typically a difficult process in part because of the vast number of hyperparameter combinations
that exist, and in cases where a combination is not optimal, performance will suffer. In fact, many deep
learning frameworks such as TensorFlow [4] and Keras [5] offer support for hyperparameter tuning, but these
are typically Bayesian optimizations that treat the ANN as a black-box and are used for the neural network
training process. Research shows that the parameters of a network can directly influence the accuracy,
throughput, and energy consumption of that model in deployment [6].
Once an accurate NNA has been found, the next step is to try to fit it into existing hardware i.e. a
CPU, GPU, or a custom built but general purpose neural network hardware device such as a TPU [7]. None
of these hardware solutions offer network specific specialization. The gap between the two optimizations is
where ECAD comes in, it allows to search for an optimal hardware/NNA co-design by exploring the design
∗philip.colangelo@intel.com
†oren.segal@hofstra.edu
‡aspeicher1@pride.hofstra.edu
§Martin Margala@uml.edu
1
ar
X
iv
:1
90
3.
02
13
0v
3 
 [c
s.N
E]
  1
1 M
ay
 20
19
space on the NNA and the hardware side and allows to implement a custom hardware solution for a specific
NNA model using reconfigurable hardware.
2 Related Work
In the past several years we have seen great strides made in the performance of ML algorithms on complex
tasks using deep neural networks [8][9]. Manually designing state of the art deep neural network architec-
tures for ML requires significant amount of time and labor [2][10]. Automating NNA search has been an
ongoing effort for the past few decades but is becoming a focus of the NNA research community because of
the difficulty in designing deep networks which are ever growing in complexity[2][3][10]. Automatic Artificial
Neural Network Architectures Search (NAS) can be conducted using different strategies such as random
search, evolutionary algorithms, Reinforcement Learning (RL), Bayesian optimization, and gradient-based
methods [10]. Using Evolutionary Algorithms (EAs) to search for performant architectures has been inves-
tigated extensively [11][12] over the years. Some recent results indicate that evolutionary algorithms offer
better results than random search and reinforcement learning [3] Recently, there has been growing interest
in NAS for deep neural networks that specialize in image recognition [1][2][13].
As deep and complex neural networks became increasingly popular and with the realization that existing
hardware architectures are not specifically optimized for such computation, new forms of specialized archi-
tectures have been proposed and designed [7] to help increase performance and energy efficiency. Designing
such static new architectures can be prohibitively expensive and risky since the field of neural computing is
evolving so rapidly. Optimizing hardware for neural networks is a research topic that is constantly evolving
and correlating with the ever-changing network structures that are being developed. Recent publications
have shown new architectures carving their niche in deep learning by offering unique methods for accelerating
the workloads of neural network applications [14][15][16]. Specifically, these reconfigurable architectures are
a popular platform for both research and deployment due to their ability to change their fabric routing and
resource structures to fit various workloads and optimizations by leveraging the resiliency of neural networks
through low-numeric precision [17][18] and sparsity [19][20]. Other accelerator designs use reconfigurable
fabrics to change the logic routing to preprocess data and offload to a specialized ASIC [21].
Multiple tool flows exist for optimizing fixed NNA designs for reconfigurable hardware (FPGAs). The
majority of available tool flows target image recognition tasks. A recent survey on available tool flows is
available here [22].
The body of work on NAS concentrate on accuracy as the main measure of performance, though opti-
mizing for NAS can lead to more simplified NNA that could in turn simplify and optimize hardware designs
[3][10]. On the other hand, optimizing for hardware performance parameters (latency/throughput/power)
is normally done on an existing NNA design and there is no attempt to modify the NNA (layers/neurons
etc.)[22].
Combining NAS and hardware optimizations could potentially close the loop between design and imple-
mentation of NNAs[22].
To the best of our knowledge ECAD is the first framework capable of conducting NAS and hardware
co-optimization. It is capable of working on both the NNA level (neurons/layers etc.) and the hardware
level (LUTS/DSPs etc.) at the same time i.e. given a general NNA structure it will evolve and search both
spaces (NNA/hardware) in tandem.
3 ECAD Software
ECAD is intended to create a NNA that is optimized towards specific design goals. At the heart of the
software side, lies an evolutionary algorithm and a vector of fitness functions. Fitness functions currently
include measurements of accuracy, speed, energy efficiency, and throughput. ECAD allows to select the
importance(weight) given to each of the fitness functions and by doing so guide an evolutionary process
towards the required fitness goals. The result is a neural network design optimized towards the goals
specified in the fitness functions.
2
3.1 ECAD Software Flow
Figure 1 shows an overview of the ECAD flow. The next sections provide detail for each of the stages in the
flow.
3.1.1 The ECAD Configuration File
The process starts with a user generated description of the desired neural network. It includes a description
of the structure, constraints, fitness goals and weight of each goal. Those values are stored in the ECAD
configuration file in a textual JavaScript Object Notation (JSON) format. An example of the configuration
file can be seen in Listing 1. Lines 11 to 16 define population values such as initial and maximum population
size, mutation rate etc. Lines 17 to 22 specify the worker types that will be used to evaluate our goals
and their parameters. lines 33 to 90 declare cell types and their parameters such as range of legal values,
mutation rate and user defined functions to be called upon a change event such as a mutation. Lines 99 to
109 define the hardware we wish to target. lines 111 to 117 declare a cell array that will hold the general
structure of the network we would like to explore through the evolutionary process.
3.1.2 Population Generation
Initially the system will create a population of neural networks using the base design specified in the con-
figuration file. The population initial size, maximum size and change rate are all controlled using the
configuration parameters. Each auto generated network instance will be mutated and different from the
original base design.
As the evolutionary process progresses and once a sufficient number of networks are evaluated according
to all fitness parameters, the process of population generation will repeat itself continuously except that the
mutations will be based on the most fit individuals in the population, selected according to their fitness
scores (see steady-state model in [23]).
3.1.3 Testing Population Fitness
In the next stage each NN instance will be sent to one or more fitness evaluators or workers in ECAD
terminology. The workers are designed as independent processes and can be distributed across a cluster of
computers. They are orchestrated using a Master/Worker parallel computation model [24] running on top
of MPI [25].
Each Worker sits in a separate process, inherits from a common C++ Worker class, and implements a
common software interface. The system is built to be flexible and allow adding different types of workers
easily. The implementation details for each worker can be completely different.
We currently have three types of workers implemented:
1. Simulation Worker capable of simulating a NN design and return accuracy and timing results
2. Physical Worker capable of synthesizing a NN design and return hardware synthesis results
3. HWDB Worker capable of accurately estimating NN synthesis results and return estimated hardware
synthesis results
Once a worker’s fitness evaluation is complete it will return the results to the master/server process. The
master process will collect the results from all the workers that evaluated each NN and apply a combined
score to each NN. Scores are combined using a unique Id that is assigned to each generated NN instance
when it is initially created.
3
3.1.4 Sorting Population by Fitness
In the next stage the ECAD system will sort the NN population according to the score each NN received.
The top NN instances will be selected for mutation.
The mutation process will take the top NN performers and mutate the cell/layer fields randomly according
to constraints specified in the configuration file. For example, the dense layer could specify a range of allowed
number of neurons etc.
The mutated NN will be introduced to the population as new instances (3.1.2). Note that if we limit
the population size in the configuration file and the addition of the new instances will cause an overflow, the
worst performers will be removed from the population in this step as well.
We now have a new population containing a mix of old and new instances. New instances will be sent to
evaluation and the evaluation process will repeat until the end condition is met.
The end condition could be the number of generations the simulation flow is requested to run or a desired
accuracy.
During the evaluation process, ECAD outputs the top networks and their scores to a custom database
(EcadDB). At the end of the simulation this database is examined, and the top networks can be extracted for
further investigation and full hardware synthesis. Note that even though it is possible to run full synthesis
during the ECAD flow we use the HW resource estimator option because of resource and time constraints.
A full hardware compile can take several hours per network instance and so it is reserved for top network
candidates that deserve further investigation.
3.2 From Abstract Network Description to Concrete Design
The process by which an abstract network description is transformed to a hardware or simulation design is
depicted in Figure 2. The software layer is built to allow the transformation to be general. For each type of
transformation we wish to make, we design a C++ object of type Actualizer (or a writer) which takes the
abstract design and produces a model that we can test in software or hardware.
For hardware design generation we opted to use high-level OpenCL over low-level Hardware Description
Language(HDL) since it allows auto generated designs to be more easily examined and maintained by human
engineers. But since the system is flexible we can always move away from high-level to the HDL level if
the need arises. Software level simulation of NN designs is used for testing accuracy, verifying results and
estimating hardware resource usage without going through the costly hardware compilation process. For
software level accuracy and validation we use a neural network simulator.
3.3 Neural Network Simulator
The Neural Network Simulator (NeuralNetSim) is responsible for testing and verifying the ECAD neural
network architectures. We chose TensorFlow[4] as the machine learning simulation framework as it supports
both CPU and GPU training for deep neural networks, it enjoys continuous support and updates in addition
to a high-level python API. This allows for quick trouble shooting as well as easily adding new features,
such as different types of layers or activation functions to the simulator. The NeuralNetSim consists of three
classes:
1. ECAD Reader servers as an interface to the evolutionary framework. It’s main responsibility is ac-
cepting inputs such as files and training arguments as well as returning data and reports after training
has been completed. It extracts the network info from the ECAD file and passes it to the TF Model
Builder.
2. TF Model Builder holds the actual TensorFlow graph and all other TensorFlow variables. It is respon-
sible for dynamically creating the graph based on the info passed by the reader. It loads and handles
the training and testing data. Finally it also holds the training functions which are called from the
ECAD reader.
4
3. TF Functions is a collection of TensorFlow API functions that are called by the TF Model Builder
when building the graph. New TensorFlow functions can be added here to expand support for other
types of networks, optimizers, activation functions, etc.
3.3.1 Neural Network Simulation Flow
The ECAD reader is a flexible script responsible for accepting an ECAD file as well as several other arguments
that affect the training of the neural network;
1. ECAD file containing the networks architecture.
2. Destination Directory used to return the report file, weights and biases.
3. Epochs dictates how many times the training set is used for training.
4. Batch size sets the amount of samples that are passed into the network at a time during training.
5. SaveWB is an optimal argument that tells the simulator to export the weights and biases after training.
6. verboseTF displays potential warning or deprecation messages when the graph is created.
In the first stage, the ECAD neural network architecture is converted into a TensorFlow graph to be
trained. The ECAD network consists of cells which either represent the input, hidden layers, activation
functions, or output which are encapsulated in a cell array. This cell array is sequentially traversed and the
graph is dynamically generated. Whenever a new hidden layer is created a reference to the layers weights
and biases is kept, which can later be used for retrieval of the values after training. The data type for the
neural network is currently float32 which is used for both the input data, the weights and the biases as well.
After the full network graph is created the cost and optimizer are instantiated. Currently the softmax
with cross entropy function is used for the MNIST [26] classification problem and the utilized optimizer is
the AdamOptimizer[27] which is set to a learning rate of 0.001. None of these functions are final and as the
project expands more cost functions and optimizers can be added.
The final step, before training can begin, is getting the training and testing data. In this case the
MNIST data is imported using tf.keras.datasets and the input data is converted to float32 numpy arrays.
All elements in the inputs are divided by 255 to normalize the values between 0 and 1. Since ECAD only
supports MLP networks, as of now, the input samples are reshaped to arrays of 784 elements which is a
flattened representation of the 28x28 images of the MNIST data-set. The training and testing labels are
converted to one hot arrays so that they can be used with the TensorFlows softmax with cross entropy
function.
After all related TensorFlow variables are created a TensorFlow session is instantiated and the training
can begin. The training data is batched and fed to the network and optimizer for the defined amount of
epochs. The optimizer will reduce the loss calculated by the given cost function for each batch. After each
full pass of the training data, the epochs accuracy is noted and the final accuracy is returned to the ECAD
reader. After the training has been completed the reader will create a report JSON file storing the networks
name/id, accuracy, the number of epochs, the full training time, and the batch size. The file is stored in the
destination directory and is utilized by the evolutionary algorithm to pick the most optimal networks for
further development.
If the saveWB argument is set, the ECAD reader will call the references to the weights and biases
and extract them as float32 arrays. These are then converted to binary files and stored in the destination
directory. For a MLP network two files per layer are created, one weights and one biases file, and each are
named after the cell from the ecad file for easy reference. These files can be utilized for testing the FPGA
designs to further check if the actual design has benefits on a hardware level. To make the files more usable,
each file contains 4 integers at the beginning which describe the dimensions of the stored data.
5
Figure 1: ECAD flow.
Figure 2: ECAD description to design Transformations.
6
4 Hardware Design
Analysis of deep neural networks (DNNs) shows that while state-of-the-art networks are becoming more
accurate over time, it often comes at the cost of increased parameter size leading to higher computational
complexity and/or power consumption. The authors of [6] show the correlations between the number of
operations required for inference, the time taken to classify an image or batches of images, the power
consumption of various hardware, and the resulting top-1 accuracy of current top performing models. It is
evident that each neural network has a unique structure that may or may not fit in a certain system, i.e.,
there are trade-offs related to system constraints like power consumption, latency, classification accuracy, or
throughput. Benchmarks that are run to arrive at these correlations are typically executed on instruction
set based architectures like CPU or GPU. Similar instructions being called for each network allows for a nice
baseline, however, the intended solution space we are interested in also includes re-configurable hardware
which provides a unique pipeline system capable of molding to a networks unique structure.
Field Programmable Gate Arrays (FPGAs) provide a “sea of logic” that can be programmed and re-
configured to create unique circuits by routing together various primitive building blocks like those found
in Intel R©’s Arria 10 [28] FPGA. Intel’s Arria 10 FPGA provides variable precision DSP blocks that can be
configured for either integer based or single precision floating-point multiply and add operations, 8-input
fracturable look-up table based ALMs for implementing various logic functions, and embedded M20K mem-
ory blocks. Leveraging the flexibility of the FPGA, we can find unique solutions for all machine learning
models.
The evolutionary algorithm will guide a search for optimal FPGA hardware designs given a specific
machine learning problem and optimization goals. Accomplishing this requires the evolutionary algorithm
to have hooks into the FPGA solution space. Due to the nature of most DNN operations being vector
and matrix based, we chose a 2-dimensional systolic array of processing elements (PEs) as shown in Figure
3. The evolutionary algorithm has the capability to change the hardware configurations such as the array
structures height, width, and PEs so that it may produce a new design for each permutation.
Every unique hardware configuration or permutation the evolutionary algorithm finds will be judged by
a fitness function. Fitness functions are created to give a score so that future generations converge toward
a more optimal design. Scores can be based on anything from power efficiency, throughput, or even logic
utilization. Many solutions exist that would be functionally equivalent but have very different hardware.
Some solutions will be computationally faster and others slower yet use less power. The ability to search for
various levels of fitness such as performance, power, or cost allows a single design space to satisfy the needs
of different vertical markets.
Giving the evolutionary algorithm hooks into the hardware design space means providing a way for the
evolutionary algorithm to modify the hardware description. Traditional hardware design done through a
hardware description language such as Verilog allows for complete control over the hardware but does not
provide a modular, software-like paradigm that would make this process straightforward. We address this
issue by using OpenCL [29] which is a high-level, C99-based programming language that can be used with
Intel’s FPGA SDK for OpenCL [30] to target Intel FPGA devices.
4.1 2D Systolic Array Implementation
Systolic arrays [31] are a great fit for FPGA because of their pipelined data flow and memory bandwidth
tuning capabilities. They can be scaled across one or many devices allowing for efficient data and model
parallel solutions where each array has the capability to be uniquely configured. They are also modular.
Each PE is designed to do a portion of the work, typically computing a partial result as a dot-product, and
can be replaced with different processing types, e.g., low-bit, dense, or sparse.
Figure 3 shows the high-level architecture for our systolic array implementation which includes a couple
enhancements for machine learning applications. Full descriptions of each module of our design will be
discussed in the following sections.
7
Figure 3: 2D systolic array hardware architecture used in the design space exploration.
Figure 4: Matrix blocking example.
4.1.1 Matrix Blocking and DDR Storage Considerations
Due to the nature of systolic arrays (also referred to as a grid in this text), input matrices are blocked against
the spatial configurations of the architecture as shown in Figure 4. Matrix A blocks have a height that is
defined by the number of rows in the grid multiplied by an interleaving factor. The height of a matrix B
block is the same as the width of a matrix A block and is defined by the product of vector width and scaling
factor. The width of matrix B is defined as the number of columns in the grid multiplied by an interleaving
factor. Interleaving is a parameter that is adjusted for balancing data reuse to ease bandwidth constraints.
The larger the interleaving factor, the more data reuse and less bandwidth that is required to feed the PEs,
but as the block size grows it becomes less efficient for mapping to smaller matrix sizes. Scaling factor is a
parameter that is used to enable more efficient global memory access. We treat global memory as a 512-bit
cache line. Every trip to global memory reads in a vector width amount of data and if the vector width is
less than 512-bits then this can lead to sub-optimal bandwidth utilization. Scale will tune the block width
so that each read to global memory is closer to 512-bits and more efficient. Blocks are stored contiguously
in DDR memory to ease the global memory access patterns. Because of this, we transpose matrix B on the
host so that sequential memory accesses traverse its rows.
4.1.2 Loaders
Depicted as A and B in Figure 3, loaders are responsible for reading blocks in from global memory and
sequencing the right data at the right time to the chain of memory modules so that each dot product
computation is doing work towards the correct output matrix block. Part of this sequencing includes notifying
8
Figure 5: Memory module internals.
the memory modules when the last block in a row or column has been sent, e.g. in Figure 4 block A0,2 and
B2,0. Further, the loaders need to keep data flowing long enough to drain the last output matrix blocks back
to global memory, so we incorporated a flush sequence that acts as a way to both re-initialize all accumulators
and caches back to zero while allowing enough cycles to write all output data. During this sequence, instead
of reading from global memory, the loaders simply send zero’s along the memory modules.
4.1.3 Memory Modules
Memory modules (MMods) are nothing more than a daisy chain of smart double buffers that read in the
next block of data from the loader modules into a local cache while writing the current block to the PEs.
MMods are chained along both the row and column dimension with the outer most module connected to a
loader. Following Figure 5, an MMod is made up of an input router whose job is to first direct a block of
input to the write select demux before switching over to sending data down to its neighbor MMod. Once
a cache such as Mem0 is full, the buff sel select will update so that new blocks are written to one memory
while the read mux takes data from the other. Both memories are arbitrated in such a way that no read or
write contention exists. All non-select lines depicted in Figure 5 have a width that supports a vector worth
of data.
4.1.4 Processing Elements
Each PE is responsible for computing a dot-product. Peripheral PEs (PE0, PE1, PE2 as seen in Figure 3)
get one input from a memory module (MMod) and the other from a neighbor except for the very first PE
(PE0) who receives both inputs from MMods. Inner PEs (PE3) receive their input from both neighbor PEs.
Each connection to a PE (except for the OMod connections that will be covered shortly) carries several data
elements equal to the vectorization parameter of the array, or in other words, the width of the dot product.
Figure 6 shows the internal workings of a PE. Each multiplier and adder in our design computes on 32-bit
single-precision floating point data. The width of the dot product is depicted as n in the diagram. We chose
a reduction tree strategy to make effective use of the DSP blocks and allow for deep pipelining of the design.
PEs also include a small cache shown as shift registers (SR) in Figure 6. The size of the shift registers is
based on the interleaving factor (shown as I) Every cycle, a new vector enters the tree and is accumulated
along with a previous value that is stored in one of the shift registers. The output from this accumulation
9
Figure 6: PE internals.
is then routed to either the output or to the back of the shift register. The demux selector is based on a
counter that keeps track of how many partial sums have been computed which signals when a result is ready.
Once the counter rolls over, a drain sequence begins by routing the accumulated result out and starting a
new output block sequence.
4.1.5 Output Modules and Global Drain
Once a block of data is ready to be saved back to global memory, the PEs start the draining process which
begins by writing the contents of its cache to its neighbor. Results are drained in rows, so each PE drains
along its column. The first row of PEs is connected to an output module (OMod) which are connected to
each other in a daisy chain fashion (refer to Figure 3). OMods continue the draining process by propagating
the results along to a global drain whose responsibility is to prepare the data to be written back to global
memory. Data being drained arrives to the output modules in a non-contiguous way, so the global drain has
its own local cache that is used to buffer data back to DDR, see Figure 7. While this is the base design used
for all our experiments, the global drain does require additional memory resources, so when scaling designs,
the reordering of data may need to be done back on the host processor. Reordering data via global memory
addressing is not efficient so we always write data back in a contiguous fashion. The global drain supports
some additional features unique to traditional MLP style of compute. Both bias and activation function
support is included and optionally bypassed if desired. In the case of bias skipping, we preload the bias
cache with all zeros and never read from global memory. When bias is used, enough bias data is prefetched
into a small local cache to be used on the next drain sequence. For the activation function block, we bypass
simply using the activation mux as shown in Figure 7.
4.1.6 OpenCL Implementation of the 2D Systolic Array
Each module described in the previous sections was coded in OpenCL as a separate kernel. Kernels were
connected using Intel’s OpenCL channels extension which implements FIFO style buffers of variable depth
and width. Hooks into the modular design were made available via C99 based macro definitions. This means
that at hardware compile time, the macro definitions shown in Table 1 need to be defined so that prepro-
cessor can populate code according to the macros purpose. Each macro definition affects what hardware is
generated, and so, we can compile various permutations of these macros to generate unique hardware for
FPGA.
10
Figure 7: Global drain internals.
Table 1: OpenCL preprocessor macro definitions
Macro Description
SYS ROWS Number of rows the systolic array (grid) contains.
SYS COLS Number of columns the grid contains
SYS VEC Width of the data path and dot product reduction tree
INTERLEAVE Matrix A block height and matrix B block width
SCALE Number of vectors in a block
Figure 8 shows the process of generating unique systolic array hardware. ECAD uses an actualizer, or
writer, to update the macro definitions from Table 1. These macros along with the OpenCL kernel code are
organized by the hardware worker who runs the OpenCL compiler. The compiler goes through many steps
which are detailed in [32] resulting in an aocx file which includes the bitstream to be programmed onto the
FPGA.
Kernels and elements from Figure 3 were connected through Intel’s OpenCL channels extension which
implements FIFO style buffers of variable depth and width. The data path flowing through the systolic array
(not including the OMod and Global Drain connections) was designed to carry a SYS VEC width that was
accomplished through C99 style structs with a member array of SYS VEC elements. The data path for the
output modules and global drain was made to support a single element.
Both loader kernels operate in a similar way except that the loader for matrix A includes additional code
to handle the instructions for draining and flushing. Further, having two separate loader kernels allows for
two parallel accesses to global memory. In our experiments, we only had 1 bank of DDR memory, however,
for cases that have two or more, loader kernels can be made to each access a single unique bank. This helps
alleviate any potential issues with bank arbitration.
Two different OpenCL kernels were designed for the MMods, one for the SYS ROWS dimension and
one for the SYS COLS dimension. Like the loader kernels, the MMods that read in matrix A data needed
additional logic to handle the drain and flush sequences. Figure 9 shows the variable vec1 as a struct with
member variables b (bool) and v (vector). This data flows from the SYS ROWS MMods who populates
the bool to signal when a new set of blocks begin the next output sequence. Notice that vec2 which flows
from the MMods along the SYS COLS dimension only carries a vector of data d and does not contain any
additional member variables.
Following the suggested method for implementing a modular systolic array as described in the Intel
FPGA SDK for OpenCL Pro Edition: Programming Guide [32], we leveraged the num compute units(X, Y)
attribute with compile time macro definitions SYS ROWS and SYS COLS to effectively stamp out a grid
of processing elements. We also followed the suggestions in [32] for the dot product reduction tree shown
in Figure 9. Sum is initialized with zero or the next value from the local accumulator shift register (refer
to section 4.1.4) then runs through SYS VEC pipelined DSP blocks being accumulated along the way to
11
Figure 8: OpenCL flow for generating hardware.
Figure 9: OpenCL reduction tree dot product with accumulator.
arrive at the final reduced output. Writing the code this way helps the compiler to infer the correct Arria
10 floating point mode DSP block [33].
4.2 Hardware Model
Modeling the hardware design allows the framework to search for both constrained and unconstrained designs.
Unconstrained means that the evolutionary algorithm has less knowledge of the target hardware and has
the freedom to search a much larger design space. When constrained, the evolutionary algorithm will only
return configurations that it believes can be synthesized. Further, having a software-based model provides
a much faster means of exploration compared to running through a series of hardware compiles.
The framework uses a hardware worker object that is targeted for a specific design like the 2D systolic
array. Neural network description files containing information about cell parameters and connectivity are
sent to the worker whose job is to return the fitness of that description in context to the hardware accelerator.
If the targeted hardware model can 1. provide the necessary hooks to the evolutionary algorithm to allow
new permutations and 2. provide the required fitness metrics, then any accelerator can be used in the search.
ECAD currently uses the 2D systolic array hardware design as the sole model that is used in search.
Inputs to the current model are explained in section 4.1. Any model that is used in ECAD must have
inputs or “hooks” that allow the evolutionary algorithm to permute the design. After a permutation is
created, it is then evaluated by the hardware worker who then provides the following results:
• Total time (ms) the total time in milliseconds that it takes the accelerator to run the provided
network description file.
• Potential giga-operations per second (GOP/s) the maximum performance that can be expected
out of the accelerator, also known as the roofline performance.
12
• Effective GOP/s actual performance of the accelerator. This number is derived by mapping the
network description to the potential performance of the accelerator.
• Images per second (img/s) useful for workloads that contain image data sets such as the MNIST
dataset used in most of our experiments.
• Latency (ms) the amount of time in milliseconds before the accelerator provides the first output.
Each result is sent back to the framework and depending on the optimization settings and various
pressures used in the search, the results are evaluated and weighted to compute the overall fitness of that
permutation.
5 Experiments
In the following experiments we use the MNIST [26] dataset to test and validate the complete ECAD
framework flow. The dataset consists of 60,000 training samples and 10,000 test samples and has been used
extensively in literature to demonstrate the ability of various neural network designs. Each sample is a
28x28 grayscale (1 channel) image of a single digit, which each pixel being a 8bit value ranging from 0 to
255. Before inputting a sample into the dense neural network the pixel values are compressed to a range
from 0 to 1 and the sample is flattened to a single array of 784 values. To expedite the training process,
these formatted samples are passed in batches to the neural network.
5.1 Verifying the Hardware Model
First, we validated our hardware model to ensure the results from our HWDBWorker were accurate. Tar-
geting Intel’s Arria 10 GX 1150 FPGA platform, the model considered various hardware resources including
available DDR blocks, DSPs, and embedded memories. Averaging across several hardware compiles, we ar-
rived at a target clock frequency of 250 MHz. Verifying the hardware model was accomplished by comparing
the HWDBWorker outputs to the measured results. When describing a particular configuration for the 2D
systolic array, we will use the following notation: (rows, cols, vector width, interleave, scale) which are all
defined in section 4.1.
Our goal was to validate the hardware model running an actual workload to not only verify performance
but also the accuracy of the results. For this, we trained a 4-layer MLP with (196/190/150/10) neurons
on the MNIST dataset and compared the resulting classification accuracy with the FPGA after running all
10,000 images. The first hardware design we ran had the configuration (4, 4, 8, 8, 8). First, we created a
configuration file with the hardware specifics and ran it through the HWDBWorker to determine the modeled
performance of the accelerator. Next, we compiled the design and measured the hardware performance, both
results are presented in Table 2. The resulting classification accuracy obtained from running the MLP in
hardware matched the accuracy reported by the Simulator. The results show that for higher batching (64
and above), our model averages 95% accurate and 81.7% accurate across all tests. Lower batching for this
design, which can be interpreted as a less efficient workload mapping, has a greater effect on model accuracy
especially at batch 1. The smaller the workload, the shorter time the processing occurs inside the DSP blocks
and the performance becomes harder to predict due to various overheads not accounted for. In other words,
the variance we see in the measured results becomes less significant with larger workloads. The model is
always initialized with the theoretical ceiling for performance and works its way down once it considers the
workload. Two observations from our results are: 1. This model is providing the theoretical limit for these
workloads so instead of looking at the model and attempting to change it to match the inefficiencies of the
hardware, we will instead look at optimizing the hardware further to match the model and 2. our model
remains consistent across permutations so that while some absolute performance results may vary (low batch
for example) the relationship between permutations will always remain the same.
13
Table 2: Modeled vs measured performance for configuration (4, 4, 8, 8, 8) running an MLP
Modeled Measured
Batch size Effective GOP/s Execution time (ms) Effective GOP/s Execution time (ms)
1 1.16 0.38 0.65 0.679
16 18.6 0.38 10.3 0.68
32 37.2 0.38 20.7 0.68
64 40.3 0.7 35.81 0.78
128 42 1.35 39 1.43
256 42.98 2.63 41.2 2.74
512 43.47 5.2 41.5 5.45
1024 43.7 10.35 42.4 10.66
2048 43.84 20.64 43 21
5.2 Design Space Exploration
In the following design experiments, our goal is to find an optimized hardware solution for a given network
structure that consists of an input layer, a single hidden layer and an output layer (784/N/10). The input
is a 28X28 image and the output is the digit found in the input image.
We start investigating the different individual pressures on the design by running an evolutionary process
on individual goals.
5.2.1 Optimizing for Accuracy
To optimize for accuracy, we select to run the simulator worker and create an evolutionary pressure towards
accuracy. We do this by setting the goal to maximize accuracy. The simulator will be given different network
designs, it will then simulate them using TensorFlow for a given number of epochs (4) and return the accuracy
results. Figure 10 shows accuracy vs number of neurons as the evolutionary process progresses through the
generations. As can be seen in Figure 10 after 50 generations the number of neurons rises to above 700.
As the evolutionary process progresses it finds that to achieve high accuracy the number of neurons must
fluctuate to between 748 and 1024 with the best result at 894. Some support for these results can be found in
the literature. Several prior works report competitive results for a single hidden layer of neurons containing
800 neurons [26].
5.2.2 Optimizing for Images per Second
In this stage our goal is to try to force the evolving hardware design to output the most images per second.
This would be a common goal in hardware design exploration especially when using re-configurable hardware
with a low overall power envelope. By maximizing the number of images per second (img/s) we could
potentially save money and conserve energy. To create such evolutionary pressure, we use our HWDBWorker
to evaluate potential designs and return information such as the images per second (see 4.2) which we could
use to select upon in the evolutionary process. As can be seen in Figure 11, to achieve an optimal number
of img/s, the evolutionary process put pressure on the number of neurons to drop. In fact, to achieve
the optimal performance, the number of neurons tends towards zero, this is because the EA found that
less computation, i.e., less neurons, resulted in a higher throughput and because there was no pressure to
maintain accuracy, it continued to drop. During this particular search, the hardware configuration started
to converge (4,8,8,16,18) while the batch size began to rise (to 1024) and the neurons remained low. This
resulted in hardware that was 100% efficient in the “batch” dimension which is used in every layer. Since the
MNIST dataset requires an input size of 784, the EA needed to find a common dimension through the vector
width and scale parameters that both kept total execution time of the MLP down but could also process the
14
Figure 10: Optimize for accuracy - accuracy vs number of neurons
Figure 11: Optimize for img/s - images per sec vs number of neurons
15
Figure 12: Optimize for img/s and min accuracy - images per sec vs number of neurons
Figure 13: Optimize for img/s and min accuracy - convergence of hardware parameters from least performant
hardware architecture to most performant architecture when optimizing for img/s objective
16
input efficiently. This dimension mapped to 91% efficiency. The trend of reducing the neurons did work for
the EA but had pressure been placed on accuracy, it would have found it could utilize the entire common
dimension with neurons.
5.3 Optimizing for both Accuracy and Images per Second
Now that we understand the different conflicting pressures we proceed to try and find an optimal compromise
between the accuracy and the images per second (img/s) metric. If we do not give a constraint on each of the
objectives our evolutionary process will find some middle ground compromise which might not be suited for
our needs. So instead we force the evolutionary process towards a minimum accuracy goal and a maximum
number of img/s goal. Figure 12 shows the resulting evolutionary process and the optimal combination of
img/s. The minimum accuracy goal is set to ninety percent. The best img/s result (129,144) is lower than
the unconstrained accuracy version (362,844) but given a minimum accuracy of 90% this result could be
used in a practical design.
Trends in the hardware configurations that the EA displayed can be visualized in Figure 13. We noted
previously that to get the best accuracy, the algorithm uses more neurons, yet to get the best performance,
typically less neurons are desired. This forces the EA to find better results for how well the neurons are
mapping on hardware. Table 3 shows the results of the top permutations for the optimization process. The
only parameters that seem to be different are the columns, vector width, and interleaving factor. Rows
seem to have converged to 2 because input to an MLP without batching is a vector and the EA decided
that making this dimension in hardware small, better fits the solution. Further, batching during inference
does not affect accuracy so the hardware is better utilized in other dimensions. Scale also converges at 2
and this is most likely because this parameters can be thought of as quantization along the matrix common
dimensions, and a smaller number may result in a better fit. Beyond efficiency mapping, scale also provides
a trade-off with global (off-chip) memory bandwidth, and because our model considers DDR memory reads
and writes, scale will generally tend to be lower to help balance the transactions. The third column, fitness
sum, gives the results of what the EA found to be the best fit solution given pressures. We notice that all the
results lie in between the top permutations for accuracy and effective GOP/s. These results show how small
variations in permutations yield large variations in fitness evaluations, and in most cases, top performers
arrive at unique solutions that may never be considered if designed by hand.
Table 3: Top permutations from optimizing accuracy and img/s
Top permutations
Accuracy Effective GOP/s Fitness Sum
Accuracy 0.942 0.933 0.936
HW Config 2,8,16,16,2 2,16,32,32,2 2,8,32,16,2
Effective GOP/s 92.62 200.98 174
Batch size 484 572 508
Neurons 1018 980 852
5.4 Hardware Generation and Measured Results
Now that we have a list of feasible permutations with the desired optimizations, we use the ECAD hardware
flow to synthesis the design. This is achieved by exporting the neural network from the ECAD database into
a network description file that can be fed to the ECAD framework for partial or complete hardware compile.
A simple command line accepts the exported design and starts the build process.
During search, the HWDBWorker only returns configurations it believes to be valid, however, there
are some cases where a design may not pass hardware compilation due to excess resource utilization. The
recommended next step after search completes is to run a partial compile on the top performers. Partial
17
compiles provide detailed information on resource utilization but the process takes a few minutes to complete
which is why partial compiles are done after search completes. After finding the top performer that passes
the partial compilation stage, it is sent through the full hardware flow which includes bitstream generation
and testing in hardware. Currently, this process of weeding out designs that cannot produce hardware is
done by hand, but could be automated into a post search process.
Performance results from running the top permutations in Table 3 through hardware are presented in
Table 4 and their corresponding logic utilization are presented in Table 5. All the top permutations in Table
3 were valid configurations and passed the full hardware compilation stage. For convenience, the hardware
configuration notation is rows, columns, vector width, interleaving, and scale. The execution time measured
in milliseconds is the total time it takes to run a batch of MNIST images through the 2-layer MLP in
hardware. These results further validate the hardware model. Table 5 shows as a percentage, how many
resources each design is using. Note that each Fmax is slightly lower than the 250 MHz we searched at.
This, in part, is due to only compiling a single seed design and having run more compiles could have resulted
in an Fmax closer to that target. The model execution time listed in the table was scaled from 250 MHz to
the actual MHz that the design closed at.
Table 4: Hardware performance results compared to model for top permutations
Hardware Configuration Modeled Execution Time (ms) Measured Execution Time (ms)
2,8,16,16,2 9.59 9.64
4,8,8,16,18 4.66 4.65
2,8,32,16,2 4.64 5.53
Table 5: FPGA resource utilization for top permutations
Hardware Compile Partial Compile
Hardware Configuration Fmax ALM M20K DSP ALM M20K DSP
2,8,16,16,2 220.41 37% 36% 26% 47% 38% 26%
4,8,8,16,18 228.26 38% 34% 26% 45% 37% 26%
2,8,32,16,2 212.18 50% 30% 43% 61% 55% 43%
6 Conclusions
We present a novel software framework called Evolutionary Cell Aided Design(ECAD) that aids in the
exploration and design of efficient Neural Network Architectures(NNAs) for reconfigurable hardware. Given
a general neural network structure and a set of constraints and fitness functions, ECAD uses a reconfigurable-
hardware/NNA co-design approach to optimize designs on both the NNA and the hardware side and attempts
to find the fittest solutions according to a predefined set of goals. We discuss and demonstrate the ability
to optimize for different and multiple optimization objectives. Performance data of the top permutations
for different optimizations including accuracy, img/s, and both simultaneously, is presented along with the
complete end-to-end ECAD flow targeting an Intel FPGA Arria 10 1150 GX device. Finally, comparing
the hardware performance of the top permutations against our model served to validate the results of the
evolutionary process.
7 Acknowledgments
Limited portion of this work has previously been presented at The 1st International Workshop on FPGAs
for Domain Experts(FPODE-18)[34].
18
References
[1] Hanxiao Liu et al. “Hierarchical representations for efficient architecture search”. In: arXiv preprint
arXiv:1711.00436 (2017).
[2] Esteban Real et al. “Large-scale evolution of image classifiers”. In: Proceedings of the 34th International
Conference on Machine Learning-Volume 70. JMLR. org. 2017, pp. 2902–2911.
[3] Esteban Real et al. “Regularized evolution for image classifier architecture search”. In: arXiv preprint
arXiv:1802.01548 (2018).
[4] Martin Abadi et al. “Tensorflow: A system for large-scale machine learning”. In: 12th Symposium on
Operating Systems Design and Implementation 16. 2016, pp. 265–283.
[5] Antonio Gulli and Sujit Pal. Deep Learning with Keras. Packt Publishing Ltd, 2017.
[6] Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. “An analysis of deep neural network models
for practical applications”. In: arXiv preprint arXiv:1605.07678 (2016).
[7] Norman Jouppi et al. “Motivation for and Evaluation of the First Tensor Processing Unit”. In: IEEE
Micro 38.3 (2018), pp. 10–19.
[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolu-
tional neural networks”. In: Advances in neural information processing systems. 2012, pp. 1097–1105.
[9] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Tech. rep.
Citeseer, 2009.
[10] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. “Neural architecture search: A survey”. In:
arXiv preprint arXiv:1808.05377 (2018).
[11] Geoffrey F Miller, Peter M Todd, and Shailesh U Hegde. “Designing Neural Networks using Genetic
Algorithms.” In: ICGA. Vol. 89. 1989, pp. 379–384.
[12] Kenneth O Stanley and Risto Miikkulainen. “Evolving neural networks through augmenting topolo-
gies”. In: Evolutionary computation 10.2 (2002), pp. 99–127.
[13] Barret Zoph et al. “Learning Transferable Architectures for Scalable Image Recognition”. In: 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), pp. 8697–8710.
[14] Utku Aydonat et al. “An OpenCL(TM) Deep Learning Accelerator on Arria 10”. In: CoRR abs/1701.03534
(2017). arXiv: 1701.03534. url: http://arxiv.org/abs/1701.03534.
[15] Naveen Suda et al. “Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Con-
volutional Neural Networks”. In: Proceedings of the 2016 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays. FPGA ’16. Monterey, California, USA: ACM, 2016, pp. 16–25. isbn:
978-1-4503-3856-1. doi: 10.1145/2847263.2847276. url: http://doi.acm.org/10.1145/2847263.
2847276.
[16] Jinhwan Park and Wonyong Sung. “FPGA based implementation of deep neural networks using on-chip
memory only”. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing,
ICASSP 2016, Shanghai, China, March 20-25, 2016. 2016, pp. 1011–1015. doi: 10.1109/ICASSP.
2016.7471828. url: https://doi.org/10.1109/ICASSP.2016.7471828.
[17] Philip Colangelo et al. “Exploration of Low Numeric Precision Deep Learning Inference Using Intel
FPGAs”. In: CoRR abs/1806.11547 (2018). arXiv: 1806.11547. url: http://arxiv.org/abs/1806.
11547.
[18] Yaman Umuroglu et al. “FINN: A Framework for Fast, Scalable Binarized Neural Network Inference”.
In: CoRR abs/1612.07119 (2016). arXiv: 1612.07119. url: http://arxiv.org/abs/1612.07119.
[19] Ganesh Venkatesh, Eriko Nurvitadhi, and Debbie Marr. “Accelerating Deep Convolutional Networks
using low-precision and sparsity”. In: 2017 IEEE International Conference on Acoustics, Speech and
Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017. 2017, pp. 2861–2865. doi:
10.1109/ICASSP.2017.7952679. url: https://doi.org/10.1109/ICASSP.2017.7952679.
19
[20] J. Yinger et al. “Customizable FPGA OpenCL matrix multiply design template for deep neural net-
works”. In: 2017 International Conference on Field Programmable Technology (ICFPT). Dec. 2017,
pp. 259–262. doi: 10.1109/FPT.2017.8280155.
[21] Eriko Nurvitadhi et al. “In-Package Domain-Specific ASICs for Intel R©Stratix R©10 FPGAs: A Case
Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)”. In: Proceedings of the
2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. FPGA ’18. Mon-
terey, CALIFORNIA, USA: ACM, 2018, pp. 287–287. isbn: 978-1-4503-5614-5. doi: 10.1145/3174243.
3174966. url: http://doi.acm.org/10.1145/3174243.3174966.
[22] Stylianos I Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. “Toolflows for mapping con-
volutional neural networks on FPGAs: A survey and future directions”. In: ACM Computing Surveys
(CSUR) 51.3 (2018), p. 56.
[23] David E Goldberg and Kalyanmoy Deb. “A comparative analysis of selection schemes used in genetic
algorithms”. In: Foundations of genetic algorithms. Vol. 1. Elsevier, 1991, pp. 69–93.
[24] Thomas Rauber and Gudula Ru¨nger. Parallel programming: For multicore and cluster systems. Springer
Science & Business Media, 2013.
[25] Edgar Gabriel et al. “Open MPI: Goals, concept, and design of a next generation MPI implementation”.
In: European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting. Springer.
2004, pp. 97–104.
[26] Yann LeCun. THE MNIST DATABASE of handwritten digits. 1998. url: http://yann.lecun.com/
exdb/mnist/ (visited on 02/24/2019).
[27] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. 2014. arXiv: 1412.
6980 [cs.LG].
[28] Intel Arria10 Device Overview. 2018. url: https : / / www . intel . com / content / www / us / en /
programmable/documentation/sam1403480274650.html (visited on 04/09/2018).
[29] The open standard for parallel programming of heterogeneous systems. url: https://www.khronos.
org/opencl/.
[30] Intel FPGA SDK For OpenCL. url: https://www.intel.com/content/www/us/en/software/
programmable/sdk-for-opencl/overview.html.
[31] HT Kung and Charles E Leiserson. “Systolic arrays (for VLSI)”. In: Sparse Matrix Proceedings 1978.
Vol. 1. Society for industrial and applied mathematics. 1979, pp. 256–282.
[32] Intel FPGA SDK for OpenCL Pro Edition: Programming Guide. url: https://www.intel.com/
content/www/us/en/programmable/documentation/mwh1391807965224.html.
[33] Intel Arria 10 Native Floating-Point DSP Intel FPGA IP User Guide. url: https://www.intel.
com/content/www/us/en/programmable/documentation/dmi1461148449999.html.
[34] The 1st International Workshop on FPGAs for Domain Experts. Nov. 2018. url: http://tytra.org.
uk/fpode/ (visited on 03/03/2019).
20
Appendices
Appendix A Sample Configuration File
Listing 1: ECAD configuration file sample
1 {
2 "name": "MLP Example configuration file",
3 "comment": "Test cell array for MLP",
4 "includes_comment": "list of configuration include files",
5 "includes": [ "GlobalSettings.ecad.cfg" ],
6 "version": "v0.0.3b",
7
8 "popConfigValues":
9 {
10 "comment": "population configuration values - these get loaded into the cell network population class in the evolution engine",
11 "initialPopSize": 20,
12 "maxPopSize": 40,
13 "changeRate": 0.20,
14 "minIndivEvalCompleteBeforeFitSelect": 10,
15 "maxGenerations": 2000,
16 "fitnessScoreGoal": 2.0,
17 "evalTypes":
18 [
19 { "type": "simJob", "weight": 1.0, "minValue": 0.9, "maxValue": 1, "active": true, "allowOverflow": false, "epochs": 4, "batchSize": 100},
20 { "type": "hwDBJob", "weight": 1.0, "minValue": 0, "maxValue": 1000.0, "active": true, "allowOverflow": false },
21 { "type": "physJob", "weight": 1.0, "minValue": 0, "maxValue": 1, "active": false, "allowOverflow": false, "minimize": true}
22 ]
23 },
24 "traitConfigValues":
25 {
26 "defChangeRate": 0.10
27 },
28
29 "cellConfigValues":
30 [
31 ],
32
33 "cellTypes":
34 [
35 {
36 "comment" : "Input Layer",
37 "cell_type" : "input",
38 "batch_size" : { "minValue": 2, "maxValue": 1024, "modValue": 2},
39
40 "templateFile": "",
41 "mainFuncName": ""
42 },
43 {
44 "comment" : "Dense Layer",
45 "cell_type" : "dense",
46
47 "systolic_id" : 0,
48
49 "neurons" : { "minValue": 2, "maxValue": 1024, "modValue": 2, "changeRate": 0.1 },
50 "sys_rows" : { "minValue": 2, "maxValue": 64, "modValue": 2, "changeRate": 0.1 },
51 "sys_cols" : { "minValue": 2, "maxValue": 64, "changeRate": 0.5, "powValue": 2, "func": "PowFunction"},
52 "sys_vec" : { "minValue": 2, "maxValue": 64, "changeRate": 0.5, "powValue": 2, "func": "PowFunction"},
53
54 "sys_intrlv -comment" : "DenseCell mutate sets sys_intrlv = (random pow 2 >= sys_rows+sys_cols)",
55 "sys_intrlv" : { "minValue": 2, "maxValue": 256, "modValue": 2},
56 "sys_scale" : { "minValue": 2, "maxValue": 256, "modValue": 2},
57
58 "row_blocks" : 0,
59 "col_blocks" : 0,
60 "vec_blocks" : 0,
61 "arows_pad" : 0,
62 "acols_pad" : 0,
63 "bcols_pad" : 0,
64
65 "enableBias" : { "minValue": 0, "maxValue": 1},
66
67 "HWGenMode_comment": "Hardware Generation Mode - for now options are: Single Systolic Array(SSA), Multi Systolic Array(MSA)",
68 "HWGenMode": "SSA",
69
70 "weightsManipulation": "matmul",
71 "layerManipulation": "add",
72
73 "templateFile" : "dense_cell.h",
74 "mainFuncName" : "dense_cell"
75 },
76 {
77 "comment" : "Relu Layer",
78 "cell_type" : "relu",
79 "relu_vec" : 1,
80 "templateFile" : "relu_cell.h",
81 "mainFuncName" : "relu_cell"
82 },
83 {
84 "comment" : "Output Layer",
85 "cell_type" : "output",
86
87 "templateFile": "",
88 "mainFuncName": ""
89 }
90 ],
21
91
92 "netConfig":
93 {
94 "netType" : "mlp",
95 "templateFile" : "",
96 "mainFuncName" : ""
97 },
98
99 "hwConfig":
100 {
101 "comment" : "These are the resource limits provided by the device data sheet",
102 "deviceType": "Arria10-1150",
103 "dsp": 1518,
104 "freq": 250,
105 "sram": 54260,
106 "mem_banks": 1,
107 "mem_speed": 2400,
108 "mem_rate": 8
109 },
110
111 "cellArray":
112 [
113 {"comment": "input layer", "cell_type": "input", "cell_name": "X", "input": "global", "output": "dense00", "input_size": 784, "fixed":
true},
114 {"comment": "dense layer", "cell_type": "dense", "cell_name": "dense00", "input": "X", "output": "relu00", "fixed": false},
115 {"comment": "relu layer", "cell_type": "relu", "cell_name": "relu00", "input": "dense00", "output": "Y", "fixed": false},
116 {"comment": "output layer","cell_type": "output", "cell_name": "Y", "input": "relu00", "output": "global", "output_size": 10, "fixed":
true}
117 ]
118 }
22
