Hardware/software co-design Riscv Rocket core Rocketchip Cal2many a b s t r a c t
Introduction
Applications of today such as audio/vision processing, wireless communication and machine learning, process massive amount of data produced by all kinds of sensors. These applications require high computation power in order to provide results in a reasonable amount of time. The demand for higher performance has always been there through the history of the microprocessors. One major solution to the continuous demand for higher computation power has been the technology development. It enables higher chip densities and higher clock rates to provide higher computation power. As a consequence of higher density and clock rates, the power consumption increases, which causes higher chip temperatures. However there are thermal limitations to the chip temperature [1] . * Corresponding author.
E-mail address: suleyman_savas@hotmail.com (S. Savas).
These limitations have led the industry to processors with lower clock rates and higher numbers of cores, which can run in parallel. Eventually, parallelism has become the key to greater performance.
Initially, multicore/manycore architectures were produced by using several identical cores on the same die. As the design space is explored, it is realized that heterogeneous architectures, consisting of different(specialized) types of cores, have the potential to provide higher performance than the homogeneous counterparts [2] [3] [4] [5] . Our results confirm this statement.
The complexity of the architectures grow as the performance and power requirements push them towards parallelism and heterogeneity. As a result, the difficulty of designing and programming these architectures increases. The hardware architects need to design each specialized core separately instead of creating identical copies of a core. Memory coherence, accesses and on-chip communications might also become more complicated due to having cores with different characteristics [5] . On the software side, the architects need to know all the details about parallel programming and the target architecture to be able to develop efficient applications.
In addition to the challenges in designing heterogeneous manycore architectures, verification and manufacturing costs are significant obstacles in exploring the design space of these architectures. There are simulators such as HP-Labs COTson [6] , Gem5 [7] , and SimFlex [8] , which provide full system simulations to overcome these obstacles and evaluate architectures with different configurations. However, based on the system size, the simulation times can be dramatically long [9] . Additionally, with the simulators, one cannot collect data on hardware resource usage and the maximum clock rate that can be achieved by the architecture.
In a prior study [10] , we proposed a design method to develop specialized single core architectures with RISC-V cores directly from dataflow applications. The method consisted of 4 steps, however, only the first 3 steps were implemented. In the first step the application is developed in a high level dataflow language ( CAL actor language [11] ) that supports application development with different models of computation including Kahn Process Network (KPN) [12] , Synchronous Dataflow (SDF) [13] , Dataflow Process Networks (DPN) [14] , and Cyclo-Static Dataflow (CSDF) [15] . This application is fed to software tools (TURNUS [16] and Cal2Many [17] ) in the second step where the compute intensive parts (hot-spots) of the application are identified and converted into hardware accelerators implemented in Chisel language [18] . This language is a hardware description language based on Scala . The rest of the application is converted into software code that can be compiled for the target architecture. In the third step, the hardware accelerators are integrated to a RISC-V core that is configurable and a single core system is generated with another tool (rocket chip generator [19] ). We developed a hardware library for basic floating-point operations and complex number arithmetic as well as some software tools and combined them with existing open source tools to automate the steps of the design method.
In this study we complete the design method by implementing the fourth step, which is connecting multiple heterogeneous cores to each other with a 2 dimensional mesh network-on-chip (NoC). That is, the design method still starts from developing the application in CAL actor language and now ends with generating manycore architectures consisting of a 2D mesh NoC and RISC-V cores with scratchpad memories and different accelerators in the form of instruction extensions targeting hot-spots of the application. The NoC router is developed in VHDL and Chisel languages separately. The details and comparison of the implementations, performance analysis, area usage on FGPA, and comparison to other studies are given in [20] . In this paper, we extend the rocket chip generator by integrating this router. We develop a new hardware/software code generation back-end to support generation of parallel C code and multiple accelerators. Additionally, we add a communication mechanism to the software generation to support core-to-core communication. Finally, we evaluate the complete design method.
We used two case studies in order to test and evaluate the design method, the tools, and the generated hardware & software implementations. We generated architectures with different configurations (targeting the chosen applications) within these case studies. The applications are executed on cycle accurate, full system emulators generated by the rocket chip generator to collect the performance results. The rocket chip generator generates synthesizable verilog code as well. We synthesize this verilog code on a Xilinx Ultrascale FPGA to collect area and timing results.
The first case study implements the autofocus criterion calculation, which is a key component of synthetic aperture radar systems [21] . In our previous work, we have implemented this case study in a sequential manner and executed on a single core. In this study, we implement two different parallel versions of it and run on 2 cores and 13 cores separately. The performance results of this study are compared to the results from Ul-Abdin et al. [21] , Ahlander et al. [22] in which the application is executed on commercial architectures.
The second case study is the first convolution layer of GoogLeNet [23] as a proof of concept to demonstrate that the design method can be used for generating architectures targeting the machine learning domain. This case study is first executed on a single core architecture, then on 4 and 5 core architectures with accelerators. The other layers can be mapped on additional cores to have a manycore architecture that can execute the whole application. We used on-chip memory to store the data for the first layer, however, the memory requirements of the whole application will lead to the use of external memory, which might harm the overall performance.
The contributions of this study can be summarized as:
• The design method that is proposed in our previous study [10] is extended to generate manycore architectures by integration of a two dimensional mesh NoC router to the tool chain.
• The hardware library that is used while generating hardware accelerators is extended with a hardware block that performs 7 × 7 convolution.
• A new back-end for hardware & software code generation is developed within the Cal2Many framework, in order to support • parallel code generation for multiple cores • generation of multiple accelerators per core • on-chip communication • placement of code and data on the local memory of the corresponding cores • large data transfers to the accelerators via direct memory connections • The tools and the generated architectures are evaluated with two different case studies from different domains.
• The performance of the generated architectures are compared with the performance of commercial architectures.
The rest of the paper is structured as follows: In Section 2 the related works are presented. Section 3 describes the step-by-step realization of the design method. Section 4 provides description of the case studies and details of the implementations. The results of the case studies follow in Section 5 . In the same section, the results are discussed. Finally in Section 6 , the study is concluded.
Related works
There are a limited number of other studies performing hardware/software co-design, using frameworks to provide resource usage and timing results, and generate synthesizable hardware implementations.
SystemCoDesigner [24] is an electronic system level (ESL) design tool that supports hardware/software co-design of manycore architectures for streaming applications based on actor models. It takes SystemC implementations and generates hardware and software code and performs automatic design space exploration regarding what part of the SystemC implementation will be converted into hardware and what part into software. In our method, we take this decision during the analysis step right after developing the application. Additionally, our tools support different design options regarding the network-on-chip, memory, and core. This tool uses MicroBlaze softcore processors to execute the software code whereas our tools use rocket core based on RISC-V. It additionally uses a commercial tool, Forte's Cynthesizer [25] , to generate the hardware code. We develop the hardware code generation tool within this study and all our tools are open source.
McPat [26] is a framework that models power, area and timing of multicore/manycore processors. This framework supports different cache models, NoCs and cores available in 2009 such as Sun Niagara and Intel Nehalem. The framework does not provide performance results but an XML interface for the performance tools. It does not provide any register-transfer-level (RTL) implementation or any other synthesizable output.
OpenPiton [27] is an open source framework for building scalable architectures from single core to manycores. It utilizes SPARC v9 ISA with a distributed cache coherence protocol and three 2D mesh NoCs. The framework supports FPGA and ASIC synthesis however, in contrast to the tools in our design method, there is no support for scratchpads or instruction extensions (custom hardware accelerators) and the configuration for the cache sizes is limited to two.
OpenSoC System Architect [28] is another framework that aims to build scalable architectures and synthesize them on ASICs or
FPGAs. This framework utilizes Chisel language and rocket chip generator and supports RISC-V cores and instruction extensions. Additionally, it uses OpenSoC Fabric system-on-chip interconnection and routing framework [29] to generate manycore architectures. The framework provides testbenches and verification tools. Our design method and this framework both use the rocket chip generator tool and therefore have similarities in case of the generated cores. Both frameworks generate rocket cores [30] and support instruction extensions and all the configurations supported by the rocket chip generator. However, our design method starts by application development in a high level language that increases the abstraction level and facilitates the application development. Additionally it includes application profiling for hot-spot identification and automatically generating the custom hardware accelerators whereas in the OpenSoC framework the accelerators must be identified and implemented manually by the developer.
Opencelerity [31] is an accelerator-centric system-on-chip (SoC) design based on a combination of 3 tiers namely general purpose, manycore and specialization tiers. This SoC uses rocket cores and RoCC [32] interfaces as in the architectures generated by our design tools. The first tier consists of 5 rocket cores capable of running Linux, whereas 506 smaller RISC-V cores reside in tier 2, and accelerators in tier 3. The accelerators are generated using SystemC and high-level synthesis tools. The accelerators and the cores are placed in different tiers and all cores share the accelerator tier. In contrast, we generate the accelerators automatically from the CAL language and each core can have its own tightly-coupled accelerators, making the accelerator an instruction extension to the core. Additionally, our design method starts with application development and uses application requirements to configure the architecture in terms of the number of cores, memory size, accelerator types, however, Opencelerity is a single architecture with a configurable accelerator tier.
Hussain et al. [33] proposes an architecture with a RISC core and multiple different accelerators similar to Opencelerity. However, they use a single core and connect it to the accelerators with a network-on-chip. This allows the accelerators to have access to each other and to the RISC core, which means an accelerator can be shared. Our current tools do not allow connecting an accelerator to the network but to the core directly. Therefore, the accelerators are not shared. We see the impact of this fact in our second case study. However, this support can be added to the rocket chip generator by extending it.
Design method
In order to find the most suitable architecture for the application in hand, one needs to execute the application on different architectures and compare them. Instead of going through this effort, we propose developing a specific architecture, which shows better performance than the general purpose counterparts, for the application by augmenting base cores with custom accelerators targeting hot-spots of the application. However, developing an architecture manually from scratch is a great challenge. Therefore, we use open source rocket chip generator for generating the architecture. We integrate the accelerators and a 2D mesh NoC to this generator to support generation of application specific manycore architectures. This process is divided into 4 main steps as follows:
1. Application development 2. Analysis and code generation 3. Accelerator integration 4. System integration The steps of the realization of the design method are illustrated in Fig. 1 together with the tools and their inputs and outputs. The generic description of all the steps and the realization of the first 3 steps are given in our previous study [10] . Still, we will give a summary of these steps together with the changes and additions we have done within this study and focus on the realization of the system integration step.
Application development
In the application development step the developer writes the application code to be fed to the analysis and code generation tools. In the realization of this step we chose the CAL actor language [11] due to its support for parallelism, simplicity, and suitable software tools. A CAL program consists of actors, which are stateful operators. Actors have ports to communicate with each other and actions for computations. The actions are the code blocks that can change the state of the actor, consume inputs, perform computation and produce outputs. The actions can be fired based on the state of the actor and availability or value of the input data. Actors may build a network when several of them are connected to each other.
The number of cores in the generated architecture is determined based on the number of the actors implemented in this step. The Cal2Many framework supports composition and splitting of actors, however, we do not use this feature of the framework in this study. In certain applications the compute intensive parts may not be obvious. The developer can combine or split certain actions to have a visible compute intensive action that can be converted into hardware accelerator.
Analysis and code generation
In the analysis and code generation step we utilize two different software tools namely TURNUS [16] and Cal2Many [17] for profiling and code generation respectively. TURNUS analyzes the CAL application and provides information such as how many times each action is executed (firing rates), the number of executed operations, the number of communication tokens consumed and produced, and communication buffer utilization. It allows setting weights manually for each operation (both memory and arithmetic). In our case studies, we use the number of cycles taken by each operation on the target (rocket) core as the operation's weight. By combining the weights with the firing rates of the actions and the number of the executed operations, TURNUS provides a bottleneck analysis of the CAL application with respect to the model of the target core. In this analysis, compute intensive (or bottleneck) actions are exposed. Fig. 2 presents the bottleneck analysis for the convolution case study. The weight given in the figure is a product of operation weights, number of executed operations and number of the firings. It is clearly visible that the applyFilter action is the compute intensive part (or the bottleneck) of the actor with almost 50% of all the weights. The framework executes the application to collect the bottleneck analysis data.
Once the compute intensive action(s) of the application is identified, a prefix ( __acc__ ) needs to be added to its name for it to be recognized by the code generation tool, Cal2Many. This tool takes a CAL application and generates target specific code depending on the chosen back-end. There are several back-ends generating sequential C , parallel C for Epiphany architecture [34] , aJava and aStruct languages for Ambric architecture [35] , Scala subset for ePuma architecture [36] , and a combination of C and Chisel Chisel language. The hardware library consists of basic floatingpoint operations, complex operations, and 7 × 7 convolution that is added within this study. This library is enough to generate any accelerator that performs the arithmetic operations on integers, floating-point numbers and complex numbers. The back-end is capable of generating the hardware for the data-path automatically, however, it does not support converting loops into hardware yet.
The rest of the application is converted to C that can be compiled for the RISC-V core. The C code includes a communication mechanism for core-to-core communication and custom instruction calls to interact with the accelerator(s).
The new back-end generates a C and a header file for each actor along with a single main C file. All of these files are compiled and linked together. The main file includes a function to synchronize the cores and another function to call the entry (scheduler) functions of each actor to be executed on the corresponding core based on the core id. The latter function for a 5 core implementation is presented in Listing 1 ( splitter, convR, convG, convB, and combiner are separate CAL actors converted into C files). This function is where the mapping of actors onto cores is performed. Each actor must be mapped onto the core that is generated for it. For instance, the convR() function, that is seen in the listing, makes use of an accelerator. Therefore it must be executed by a core that has the accelerator. The entry function is executed on all cores, however, based on the core id, each core executes a different part of it as seen in the switch statement.
The code examples in Listing 2 , 3 and 4 show an example of converting a CAL action into C code with and without using accelerators. Listing 2 presents the original CAL action to be converted into C . (This action differs from the one in Fig. 11 .) The C code in Listing 3 makes use of the accelerator that performs the computations (applying the filter to an image). The C code copies the input data into the source registers of the custom instruction that is forwarded to the accelerator, uses a macro to execute this instruction, and copies the result from the accelerator into a global variable.
Listing 4 shows the version of the C code that does not use an accelerator but perform the operations locally. Both in CAL and in C code, there are global variables declared outside the scope of the code snippets.
The core-to-core communication mechanism is implemented with flags ( ready & valid ) and FIFO buffers to create unidirectional virtual channels between the connected cores. Fig. 3 illustrates two cores with two virtual channels directed to opposite directions. Core1 sends data to core2 through channel-1 whereas core2 sends data to core1 through channel-2 . One such channel is generated for each connection between the CAL actors. The mechanism utilizes only write accesses through the network-on-chip and read accesses only to the local memory. This is due to the read accesses being more expensive than the write accesses on the NoC in terms of clock cycles. Initially, supported buffer size is 1 word that can be of any variable type. The current buffer size is enough for our case studies, however, future applications may require larger buffers. Then, the communication mechanism can be extended. We used generic communication mechanisms [37, 38] , that support variable buffer sizes, with the Epiphany architecture in prior studies [39, 40] . However, these mechanisms added a significant overhead to the communication time, and consequently to the overall execution time. Therefore, we keep the communication mechanism, that is developed within this study, as simple as possible.
During the code generation each actor is mapped onto a different core making the number of cores equal to the number of actors. However, the Cal2Many framework supports split and combination of the CAL actors [41] . Thus, the actors do not need to be mapped on individual cores. The new hybrid back-end does not make use of this feature yet. However, for the future applications, this feature can be embedded into the new back-end.
The data and the code are placed in the local memory of the mapped core with the help of section attribute of the compiler (gcc for RISC-V). The linker description must define the necessary sections, which are used with the section attribute. The code snippet in Listing 5 shows how the section attribute is used with a label that is defined in the linker description, given in Listing 6 . Despite having a single binary file after linking all the C files together, data and code of each actor is stored in the corresponding cores memory. The section attribute combined with the aforementioned switch statements allows each core to read the instructions and the data of the matching actor from their own local memory.
Accelerator integration
We use the rocket custom co-processor (RoCC) [32] interface within the rocket chip generator to connect the generated hardware accelerators to the rocket core that executes RISC-V instruction set. The interface comes with the rocket chip generator and consumes negligible amount of resources. The rocket core is an inorder scalar processor featuring a five-stage pipeline and an integer ALU. The rocket chip generator can generate systems-on-chip with different configurations such as number of cores, availability of the FPU, availability of accelerators, data memory type (cache or scratchpad), and size of data and instruction memories. The accelerators are integrated into the rocket chip generator where they can be instantiated and bound to a core.
The integration of an accelerator is shown in Listing 7 with a simple example. The generated accelerator is a Chisel module called acc . This module is instantiated in a class ( AccModule ) that extends the RoCC interface ( LazyRoCCModule ). The connections between the core and the accelerator are established in the AccModule class, through the io bundle that consists of several signals for requests from the core and responses from the accelerator.
The instantiation of the AccModule and binding to a custom instruction is performed during instantiation and configuration of a core in the system configurations of the rocket chip generator. The code example in Listing 8 shows an example of instantiating an accelerator that is bound to custom instruction 0.
Through the RoCC interface the hardware accelerators become an extension to the instruction set. The accelerators can be fired with a single instruction call. Based on a bit field in the instruction, the core may or may not halt until the accelerator returns a result. In our implementations we halt the cores while the accelerators perform a computation.
The accelerators have direct access to the local data memory through the RoCC interface. In our previous study [10] there was no need to transfer large amounts of data to the accelerators. Therefore the memory connections were not used. However, in this study, the convolution case study requires large amounts of data to be transferred to the accelerator (in the form of arrays in the software code). In order to support the large data transfer we modified the code generation. Previously, all of the data would be transferred with instruction calls (by providing the source registers in the instruction). Now our back-end supports two types of data transfer. If the data to be transferred is not an array it is still transferred through instructions calls. However, if the data is an array, the address and the size of the array are transferred to the accelerator with a single instruction call.
The code snippet in Listing 9 shows how the accesses to the memory are performed within the extended RoCC interface. The accesses can be performed in the accelerator as well, if the io signal are forwarded to the accelerator. The upper part of the code shows a read request to the memory and the lower part shows how the response from the memory is read.
It is required to create a Chisel class in rocket chip generator that extends the RoCC interface to instantiate the accelerator and connect the accelerator to the interface. However, if necessary, it can be extended to read the instruction and the source registers from the core, store the input data, access the memory, provide inputs to the accelerator, fire the accelerator and read the outputs. The task of extending the RoCC interface is not automated yet. This task needs to be done once for each accelerator. There is no limitation on the number of classes extending the interface. These classes can be re-used to instantiate and connect the accelerator to a core while generating an architecture.
System integration
The result of the first three steps of the design method is application specific tiles with core and accelerators. In this fourth step, we connect the tiles with a network-on-chip to generate application/domain specific manycore architectures.
Network-on-Chip
The rocket chip generator supports a crossbar network by default. However, this topology does not scale well [42] , especially when the number of the cores reach to hundreds. Since every core is connected to every other core, the cost of wires and routing components increases exponentially when the number of cores is increased. Additionally, the length of wires may contribute to the critical path and decrease the maximum clock frequency of the architecture. Therefore, we implemented a new scalable networkon-chip [20] in Chisel and integrated it to the rocket chip generator. Inspired by the Epiphany architecture, which we evaluated in our prior studies [39, 40] , we chose the two dimensional mesh network-on-chip with XY routing for the on-chip communication. This network topology is highly regular, scales well (the cost of wires and network routers increases linearly with the number of cores), and can be made deadlock free with the XY routing protocol [42] . The wire connections are to the nearest neighbour. Therefore, the wire delay does not increase with the number of routers. Additionally, since the connections are regular, they can be implemented with loops in the hardware description language. In this topology, the same router is instantiated for all network sizes. The routing decision is distributed, hence there is no central decision point that could be a bottleneck.
Integration of the NoC to the rocket chip generator
By default, the rocket chip generator generates separate instruction and data caches for the rocket core. The core is connected to a memory module that takes care of routing the memory accesses to the local memory (data cache) or to the other components through the crossbar network. The address space is shared by all the cores and other components in the architecture. The default local memory type is cache, however for the data storage we use scratchpad memories in order to avoid dealing with cache coherence, which is highly challenging to maintain while dealing with many cores [43] . Additionally, scratchpads allow a simpler interface to the network and avoid other complexities introduced by the caches. The instruction cache remains untouched as it is not possible to replace or remove it from the rocket chip generator. We name the combination of core, memory and/or accelerator as 'tile' as seen in Fig. 5 .
In order to integrate the mesh network to the rocket chip generator, we performed the following steps including the modifications to the rocket chip generator:
1. Implementation of our NoC router in Chisel .
2. Implementation of an interface between the core and the memory. This interface forwards the local memory accesses to the memory module and the global memory accesses to the router. The interface can be seen as the 'NoC Interface' in Fig. 5 . 3. Generation of a router per tile if the number of tiles is greater than 1. 4. Generation of connections between tiles and routers. The routers get connected to the interfaces developed in the second step. 5. Generation of connections between the routers.
The details of the first step are given in [20] . The connections between the core and the memory module are defined in a class called HellaCacheIO . This class consists of request, response, interrupt, and additional control signals. The network interface uses this class of signals to communicate with both the core and the memory. The connections between the core and the memory (utilizing the HellaCacheIO signals) are shown in Fig. 4 . The request signals are address, data, request type, data size, and return register. The response signals cover the request signals together with a few additional signals to indicate if the response has data, if the response is after a cache miss, etc.
The interface controls the address data on each memory request from the core to forward the request either to the local memory or to the network. If the request is to the local memory, the signals from the core are directly forwarded to the memory module, without any modification. However, if the request is to a remote memory, the request signals are converted into a network packet and forwarded to the router. If the request is to an address that is not a local memory of any core, then it is forwarded to a specific core (core0 in the current implementation). On the receiving side, the network interface receives the request from the router and converts it into the HellaCacheIO signals and sends to the memory. The network interface requires the dimensions of the mesh and the size of the local memories (scratchpads) to generate the destination coordinates from the address and prepare the network packets. The interface arbitrates the memory access between the core and the router in a Round-Robin fashion. The same method is used (between the core and the memory) for accesses to the router. If an interface receives a request from the router and the local memory is not ready yet, it stores the request until the memory is ready. Similarly, if the local memory returns a response to a request, which came from a remote core, and the router is not ready, the response is stored in the interface until the router becomes ready. The interface stores only a single request on each direction. If there is already a pending request in the interface, the requester does not see the ready flag that indicates the interface is not ready. Hence the requester waits for the ready flag.
A write request from core to the memory is sent in two cycles. During the first cycles, the control signals are sent and during the next cycle the data is sent. However, the NoC router requires all of the signals in one cycle. Therefore the interface stores the control signals for a cycle and combines with the data before sending to the router. This raises an issue because the NoC router requires 4 cycles between two consecutive requests, which means either the core needs to check if the router is ready to send a request or simply do not send another request for the next 4 cycles. Otherwise, the router will just discard the request and the core will not know about it. The first option is not possible because the router needs the whole network packet to determine if it is ready but the core first sends the control signals only and then the data signal if the router is ready. (The NoC router uses the AMBA protocol [44] for the ready and valid signals.) Therefore, the interface resets the ready signal to the core for 4 cycles after each request to stop the core to send any new requests until the router is ready.
The signals between the interface and the router consist of ready (1 bit), valid (1 bit), and data (128 bits) signals. The data signal includes the control signals as well as the actual data. Listing 10 shows how these signals are connected between the routers and the tiles that include the interface. In this code snippet, where outer is the list of tiles and router is the list of routers, one can see the usage of functional programming. Fig. 5 shows the changes in the tile after integrating the 2D-mesh NoC. After these changes, the memory accesses with local addresses are forwarded to the local memory whereas the rest of the accesses are forwarded to the network.
In the third step of the integration, we set the number of rows and number of columns for the network, in the rocket chip generator, and generate the routers. The row and column sizes must be provided to the interface as well, for destination coordinate calculation. During the fourth step, we remove the connections between the core and the memory module, connect the core to the interface in the RocketTile.scala file, and connect the interface to the routers in the RocketSubsystem.scala as seen in Listing 10 . Finally, based on the number of rows, columns and core ids, we connect the routers to each other (wMesh and rMesh networks) to form the 2 dimensional mesh. First the horizontal connections and then the vertical connections are done in loops.
After the integration of the 2D-mesh NoC, generating manycore architectures can be performed by setting the configurations in the generator. It is possible to generate a matrix of cores with arbitrary dimensions. The structure of the cores does not have to be a full matrix however due to the XY routing protocol the routers are generated in a complete mesh structure.
The crossbar network is partially removed from the generator, however a part of it still remains as the instructions are moved to the instruction cache through this network after the boot-up process. Additionally, during the boot-up, the cores access the components such as the boot-ROM and the debug-ROM, which are on the shared address space, through the crossbar network. These components are not connected to the 2D mesh NoC that we integrated. Therefore, in order to access these components we keep the crossbar network between a core ( core0 ) and these components. All other cores access these components through core0 . In the future, those components might be connected to the 2D mesh network with proper interfaces.
Future automation directions
Being a first proof of concept, the design process still has a few manual tasks, which are necessary to generate an architecture and execute applications. These tasks are as follows:
• The instantiation and configuration of the cores within a new Chisel class that extends the configuration class within the rocket chip generator. The configurations include memory types and sizes, availability of FPU and accelerators, and many other details regarding the caches, memory lanes and other components.
• Configuration of the NoC interface, including setting the row and column sizes of the mesh network and the scratchpad memory size. By default this size is assumed to be the same for every core in the mesh. However, this can be configured.
In order to execute an application in a bare metal fashion on the generated emulators, the following steps are required:
• Development of new linker description for each different architecture. The code generation uses section attribute with the section names for each core. Hence, these names should be defined in the new linker description.
• Development of new start-up code if the number of the cores and the memory sizes change.
The configuration classes, linker description, and the start-up code can be generated automatically with an extension to the Cal2Many framework whereas for the NoC interface configuration, one may need to edit the interface to retrieve the configuration data from the configuration class. Nonetheless, the productivity of the developers will increase further with the automation of these tasks. 
Case studies
We have implemented two case studies to evaluate the design method, the software tools and the generated architectures. The case studies are a proof of concept to show that the design method works for different application domains. The first case study is different parallel versions of the autofocus criterion calculation [21] . The first version runs on a single core, the second version runs on 2 cores and the third version runs on 13 cores. The second case study is the first convolution layer of GoogLeNet [23] on 1, 4 and 5 cores. The main goal of this case study is not to achieve world leading performance, but to prove that the method can be used to improve the performance of the base, general purpose core. The applications are developed in CAL actor language and analyzed with TURNUS.
We did not need any constraints while designing the architectures in the case studies. Therefore, it was not needed to explore different configurations for the same CAL implementation. However, our Cal2Many framework supports fusion and fission of actors [41] . This allows the developers to map more than one actor on one core if they want to limit the number of the cores. Additionally, the developers can split a large actor into multiple cores if they want to limit the size of the memory on each core.
In the rest of this section, we will first provide the details of the implementations, followed by the practical details, to be able to generate the architectures and execute the implementations on them.
Autofocus criterion calculation
This application is a part of synthetic aperture radar systems [45] . These systems are usually mounted on flying platforms, where they use the motion of the platform over a target region to create two or 3-dimensional images of underlying objects. The path of the movement is not perfectly linear, however, additional processing can be performed to compensate this. The typical information used for the compensation is the positioning information from GPS [21] . However, in some cases this data might be insufficient. In such cases, the autofocus criterion can be used. There are different methods for calculating this criterion [46, 47] . The method that is used in this study, tries to find the flight path compensation that results in the best possible match between two images of the contributing subapertures. This requires several flight path compensations to be tested. The matching is checked with a selected focus criterion. The criterion calculations, including interpolations and correlations, are performed many times.
The implementations consist of three main actions performing range interpolation ( ranger ), beam interpolation ( beamer ), and correlation ( correlator ) on 6 × 6 pixel kernels. Pixel information consists of position and color value represented as complex numbers. The real and imaginary parts of these numbers are represented with floating-point numbers. The ranger performs 36 cubic interpolations on each kernel, whereas the beamer performs 18 interpolations. The cubic interpolations are based on Neville's algorithm [48] . Each interpolation takes four input pixels and produces one pixel value in complex number format.
The ranger action takes a 6 × 6 kernel from one of the input images and applies interpolations to the rows. For each interpolation 4 pixels are consumed and a single pixel is produced. Hence the output of this action is a 6 × 3 matrix, which is forwarded to the beamer action as illustrated in Fig. 6 . Later, the same ranger action takes a kernel from the other input image and performs the same operations. The beamer action applies interpolation to the columns of the input. The result of the beam interpolation is a 3 × 3 matrix. Finally, the correlator action receives two consecutive 3 × 3 interpolation results (one for each image) from the beamer and calculates their correlation. After one iteration of computations, a single value is produced for each 2 input kernels. The flow of the data through the actions is illustrated in Fig. 6 . The two kernels, consisting of 36 pixel each, are reduced to a single floating-point number.
Now we will describe how we have performed the steps of the design method for this case study.
Application development
The single core version is implemented as a single actor with the main actions ( ranger, beamer, correlator ) and several other helper actions for memory movement and scheduling. The actions are scheduled to run in a loop until all the input kernels are processed. There is no communication involved and the input data is stored within the actor. Fig. 7 illustrates the structure of the actor.
The dual core version is implemented as 2 actors, where the first actor performs the range interpolation and the second actor performs the beam interpolation and correlation as seen in Fig. 8 . The results of the range interpolation are continuously sent to actor2 through the communication channels where the beam interpolation and correlation are applied to the input data. The input data is stored in the first actor as seen in the figure.
In the last version, we instantiated the structure of the dual core implementation 6 times and let the actors run in parallel to increase the data parallelism. The number is chosen based on the input size, which is 6 kernels per image. The application can be parallelized further, however, it is not the focus of this study.
The correlation is a small task and it would be a waste of resources to use a correlation accelerator in each actor that performs the beam interpolation. A single accelerator is enough to perform the correlation efficiently for the entire application. However, the rocket chip generator does not support shared accelerators between cores. Therefore, we implemented the correlation operation as a separate actor and forward the results from the beam interpolators to this actor. This increases the task parallelism by letting the cores perform the correlation and the beam interpolation tasks in parallel. However, it adds a slight communication overhead due to moving the 3 × 3 image kernels, which are the outputs of the beamers and the inputs of the correlator , between the cores. In total 6 actors are instantiated to perform the range interpolations, 6 actors are instantiated to perform the beam interpolations and one actor is instantiated to perform the correlation and sum all the results. Fig. 9 shows the structure of this implementation. The input data is distributed to the actors, which perform range interpolations.
Analysis and code generation
In our previous study [10] , we have already compared the results of single core architectures with and without accelerators. Therefore, in this study we do not generate any accelerators but only C code for the single core implementation. We execute the entire application on a single rocket core to get a reference point for performance comparisons.
The analysis results for the dual core implementation show the same results as the single core implementation and identify the cubic interpolations as the hot-spots of the actors. Therefore, the action that performs the interpolations is marked with the prefix ( __acc__ ) and an accelerator is generated for it. The only other part of the application where floating-point operations are performed is in the correlator. In order to avoid the floating-point unit (FPU) and save hardware resources, we marked the correlation action as a hot-spot and generated a second, smaller accelerator for the second actor to perform the correlation computations. The code generation results in parallel C code, to be compiled by the RISC-V gcc compiler and executed on two cores, and two different accelerators to be integrated to these cores. The generated C code for the second core uses two different custom instructions to communicate with the accelerators.
We apply the same modifications to the correlation action in the implementation with 13 cores. The analysis and code generation processes are identical to the dual core implementation. Each actor is mapped onto an individual core and necessary communication mechanism is generated together with the two accelerators.
Accelerator integration
The single core implementation does not have accelerators. Thus, we do not perform the accelerator integration step for this implementation. However, for the dual core implementation two different accelerators are generated. The first core executes the interpolation and therefore requires only the cubic interpolation accelerator, whereas the second core executes both the interpolation and the correlation. Hence, it requires both the cubic interpolation and the correlation accelerators. The custom instructions are bound to the corresponding accelerators inside the configurations of the rocket chip generator. The RoCC interface is extended to instantiate both accelerators and establish the proper connections between the cores and the accelerators. Both accelerators require more than two 64 bits inputs. Therefore, the custom instructions are called more than once to transfer the inputs. The inputs are stored in the extended interface until the last instruction call that fires the accelerator arrives. We use a bit field in the custom instruction to distinguish between the calls. The accelerators for dual core and 13 core implementations are the same (cubic interpolation and correlation). Since the accelerators are already integrated to the rocket chip generator during the dual core implementation, there is no need to take this step for the 13 core implementation.
System integration
The single core implementation does not require any system integration. However, in order to generate a new architecture with the rocket chip generator, one needs to add a Chisel class to the configurations of this tool, where the cores are instantiated and configured. These configurations include memory sizes, data width between components and availability of components such as the FPU. The class for the single core implementation instantiates a tile consisting of a single core with an FPU, 4 KiB instruction cache and 256 KiB scratchpad memory. The memory sizes for the single core and dual core architectures are not based on any heuristics. They are the default values, which are larger than the minimum amount of required memory sizes, and do not impact the critical path. Hence, they are not modified. The required memory sizes are not generated automatically yet and they need to be determined by the developer.
The dual core implementation requires two tiles to be connected with the NoC. With the extensions described in this paper, the rocket chip generator generates the necessary 2D mesh network structure and connects the cores to the NoC routers. The network for two tiles is generated as a 1 × 2 matrix. The first core that computes the range interpolation is instantiated with a single accelerator. The other core computes both interpolation and correlation. Hence, it is instantiated with two accelerators. All of the floating-point operations computed in the application are executed on the accelerators. Therefore, no FPUs are used in the generated architectures to save resources. The rest of the configurations are kept identical to the single core implementation.
Due to the nature of XY routing we need to initiate a 4x4 mesh network with 16 routers for the 13 core implementation, even if only 13 of these routers are connected to a core. This architecture is illustrated in Fig. 10 . In the XY routing protocol, the packets Fig. 10 . The architecture generated for the 13 core implementation of the autofocus criterion calculation.
travel first on the X axis and then on the Y axis. If the core in the last row sends a packet that is addressed to a core on the east then the packets will go through the routers, which are not connected to any core. If the routers were not there the packets would be discarded.
The tiles in Fig. 10 consist of a rocket core, 4 KiB instruction cache, 128 KiB scratchpad memory, an accelerator, and a NoC interface. When compared to the other implementations of this case study, the memory size is decreased in order to decrease the length of the critical path on the FPGA implementation. However, this does not affect the performance results. The tiles are identical except the one in the last row ( Fig. 10 ) . The 12 tiles on the first three rows have the interpolation accelerator, whereas the tile on the fourth row has the correlation accelerator. The range interpolation actors in Fig. 9 are mapped onto the first 6 cores when counted row-wise. The beam interpolation actors are mapped onto the second set of 6 cores and the correlation actor is mapped onto the core in the last row.
The convolution
In this case study we compute the first convolution layer of the GoogLeNet [23] . This application consists of 21 convolution layers, however, the first layer is the largest one in terms of convolution (filter) sizes and therefore it requires the largest accelerator. It has 7 × 7 filters whereas the other layers consist of 1 × 1, 3 × 3, and 5 × 5 filters. This first layer represents roughly 5 to 10% of the overall application. This case study is used as a proof of concept to show that the design method can be used to generate specialized architectures for machine learning domain.
The convolution size is 7 × 7 and the input image size is 224 × 224 in the RGB color space (padded to 230 × 230). The number of filters is 64 and the stride is 2. The image and the filters are stored as 3 × 230 × 230 and 64 × 7 × 7 matrices, respectively. With this configurations, the number of multiply and accumulate (mac) operations is 118,013,952.
Application development
The convolution is implemented with three different approaches. The first approach is as a single actor with two actions namely applyFilter and accumulate . The applyFilter action performs the convolution using a 7 × 7 filter and a 7 × 7 area from one of the color dimensions of the image. The accumulate action stores the result of the applyFilter action and calculates the indices of the filter and the image matrices for the next convolution. At each iteration, the applyFilter calculates a result for one color dimension of the image. Thus, at every third iteration, the accumulate action accumulates three results of the applyFilter to produce an output pixel. In this first approach, the input image is stored in the actor.
In the second approach we use 5 actors to implement the convolution. One actor stores the input image and distributes it to three other actors. Each of these actors receives an area of 7 × 7 from only one of the R, G and B images, perform the convolution and send the results to a final actor that accumulates the R, G and B results. The actors that perform the convolution use the applyFilter action whereas the last actor uses the accumulate action.
In the last approach we store the R, G and B images directly in the actors, which compute the convolution to avoid the distribution of the image and consequently a large amount of on-chip communication. Therefore, we skip the actor that distributes the image and use only 4 actors; 3 actors to perform the convolution and 1 actor to accumulate the results from the convolution actors.
Analysis and code generation
The CAL implementations are fed to the TURNUS tool for the analysis. Predictably, the action that computes the convolution Fig. 11 . Usage of the pragmas to enable the code generator to generate pre-defined hardware blocks. The left hand side code is changed into the right hand side code. ( applyFilter ) is identified as the compute intensive part of the application. The analysis results of the single actor approach are presented in Table 1 . The weights that are given in the table are products of operation weights, number of executed operations, and number of the firings. It is obvious that the applyFilter action is the compute intensive part of the actor with 98% of all the weight. We added a hardware block, that performs convolution on two 7 × 7 input matrices, to our hardware library. In order to instantiate this hardware block, we add a soft pragma to the application code. When the code generator goes through the action that will be converted into hardware, and finds a function call to a function (procedure in CAL ) named conv7 , it instantiates the hardware block and arranges the necessary connections. Therefore, we put the mac operations into the conv7 function and call it within the applyFilter action. Fig. 11 shows how the conv7 pragma is added to the code. The computation within the action are replaced with a procedure call. The necessary calculations can be moved to the procedure. However, the content of the procedure is ignored during the code generation and the pre-defined hardware block is generated. The conv7 procedure requires the filter index (to choose the right filter) and the first element of the image matrix as parameters. These parameters are then used as inputs to the accelerator and forwarded via custom instruction calls. The 64 filters used for the convolution are stored in the hardware block. This block requires the filter index to choose the right filter and the 7 × 7 input image to start the computations. Finally, the applyFilter action is converted into a hardware accelerator that instantiates the hardware block and the rest of the application is converted into C with custom instruction calls.
Using a library of manually implemented hardware blocks increases the efficiency of the hardware and enables re-use of the efficient hardware blocks, while decreasing the complexity and time consumption of the code generation. The hardware code generation is generic, which means it can be used to convert any code into hardware description. However, loops and usage of different data types for the computations are not supported yet.
While converting the multi-actor implementations, the same accelerator is generated for each actor that computes the convolution. For on-chip communication, the communication mechanism, that we integrated to our code generation tools, is used.
Accelerator integration
The difference between the accelerator integration steps of the case studies is the usage of the memory connections. The accelerators of the first case study do not require the memory accesses due to receiving the input data through the source registers of the custom instructions. However, in this case study, the input data to the accelerator is an index value for the filters and a 7 × 7 image that is relatively large. Therefore, instead of sending the whole image data, the core sends the address of the image data to the accelerator through the custom instruction. The interface, that is implemented manually by extending the RoCC interface, accesses the memory for the image data and forwards the memory response to the accelerator. The accelerator is fired when all the image data is sent to it. The result from the accelerator is forwarded directly to the core.
The accelerators, which are generated for each approach, are the same. Therefore we perform the accelerator integration once and use the same extended RoCC interface for each core that utilizes an accelerator.
System integration
A single tile is generated for the single actor approach. The tile consists of core, memories and the accelerator. The scratchpad memory size of this tile is increased to 1 MiB due to the memory requirements. The core retains an FPU due to having floatingpoint operations outside the accelerated action (in the accumulate action). These operations can be moved to a small accelerator to avoid FPU and save hardware resources in a future work.
The 5 actor approach results in a 5 tile (or core) architecture. The first tile consists of 1 MiB memory as the image is stored in this tile. It acts as a storage. The other tiles have smaller memories (64 KiB) and only the tile that is generated for the accumulator actor has an FPU. Each tile that is generated for the convolution actors has an accelerator.
The last generated architecture consists of 4 tiles. Since the input image is distributed to the tiles where the convolution is performed, each of them has a 256KiB memory. These tiles do not have FPU. The tile that accumulates the results have an FPU but a very small memory (16KiB).
Results and discussions
In prior studies [39, 40] , we presented the efficiency of software code generation (of Cal2Many) by comparing the performance results of generated software code to hand-written software code. The performance difference between the generated and hand-written implementations was 2 to 30% (in the favor of handwritten implementations). Additionally, our productivity analysis showed that the development effort was reduced by 25 to 50% in terms of development time and source line of code. In another prior study [10] , we compared the generated hardware code to hand-written hardware code. The generated hardware showed 0 to 12% lower performance while utilizing 0 to 9% more hardware resource.
In this study we present results for generated general purpose architectures along with generated application specific architectures with different number of cores. Additionally, we compare the performance of generated architectures to some commercial architectures. The results that are considered for the evaluation of the design tools and the generated architectures are the performance (execution time), timing (max clock rate), and area (hardware resource usage). The configuration for the core in the rocket chip generator is based on the tiny configuration. We executed the case studies on the cycle accurate emulators for performance results. The timing and area results are provided by Xilinx synthesis tools. The generated verilog implementations are synthesized on a Xilinx VCU108 evaluation kit that features a Virtex Ultrascale XCVU095 FPGA.
Case study 1 -autofocus criterion calculation
We used the parallel implementations that we developed in this study together with the sequential implementation from a prior study [10] of the autofocus criterion calculation to evaluate the generation of single core, dual core and manycore architectures. The rocket core that is used in this study has a smaller configuration and therefore a smaller size. It achieves higher clock frequency due to shorter critical path, however, shows lower performance when compared to the prior study due to using fewer hardware resources for computation and memory management.
The single core architecture is generated without an accelerator. Two different dual core architectures are generated for the dual core implementation. One of the architectures features an FPU, whereas the other one uses accelerators. The 13 core implementation is executed on cores without FPU and with accelerators. The instruction cache size for all cores is 4 KiB, whereas the scratchpad memory size is 256 KiB for single and dual core architectures and 128 KiB for the 13 core architecture. The implementations are tested with images, each consisting of 6 kernels. A kernel is a 6 × 6 pixel matrix. Half of the kernels overlap with each neighbour kernel both on row and column directions. Hence the input image size is 12 × 9. Table 2 presents the performance results, whereas Table 3 presents the hardware resource usage and timing results for all versions of the implemented architectures. The transition from a single core architecture to a dual core architecture shows the effect of parallelization that increases the performance by a factor of 1.48 in terms of clock cycles. The resource usage doubles for memory and DSP units, however, some components of the architecture are instantiated only once regardless the core count. Therefore, the usage of LUTs and FFs do not increase exactly by a factor of 2. The clock frequency decreases due to the size of the architecture. The hardware resources such as block RAMs (BRAM) and DSPs are distributed on the fabric. When more of these are needed, the resources from far corners of the fabric might require to be connected. Therefore, the wire delays become longer and contribute to the critical path. This situation re-occurs when the architecture size increase to 13 cores. The difference between the dual core architecture results shows the effect of replacing FPUs with accelerators. This process increases the performance by 3.1 × , while decreasing the LUT and FF usage. The BRAM utilization increases by a small amount while the DSP count increases significantly, which is an expected result of having accelerators.
When the number of cores is increased to 13 and each core is equipped with an accelerator, the performance (in terms of clock cycles) increases by a factor of 26.7 in comparison to the single core architecture. When compared to the dual core with accelerators, despite the increase of 6.5 × in the core count, the performance increases by 5.8 × . There are two main reasons behind not reaching the theoretical speed-up. The first one is the overhead of the communication between the beamer and correlator actors. The second one is the serialization process within the correlator . This actor reads the results of a beamer and performs correlation before reading the results of the next beamer . Since the chain of the ranger and the beamer actors are identical, their computation times are equal. Thus, they produce and send the outputs to the correlator actor at the same time. This creates a small queue and additional waiting time that contributes to the execution time. There can be different trade-off cases with the architecture of this application. One could keep the correlation within the beamer actors to achieve 6.5 × speed-up, however, that would cost additional hardware resources for the correlation accelerator. Another approach to increase the performance could be using 6 cores instead of 1 core for the correlation. However, this would also increase the hardware resource usage, even more than the previous approach.
In case of resource usage, the increase is less than 6.5 × . This is again due to some of the components being instantiated once, regardless the core count. Additionally, in the architecture with two cores, one of the cores has two accelerators, whereas in the 13 core architecture each core has a single accelerator. The number of interpolation accelerator increases to 12, however, the number of correlation accelerator does not change in the transition to 13 cores. It is clear that the number of accelerators does not increase by 6.5 × .
To summarize, parallelization increases the performance as well as the hardware resource usage. Specialization increases the performance as well, however, in certain cases it might decrease the hardware resource usage as seen in the dual core implementations.
Comparison to prior studies
The same application is implemented and executed on different platforms in prior studies conducted by the members of our group [21, 22] . These platforms are Epiphany E16G3 and Intel i7-M260 as seen in Table 4 . The table presents 3 different prior implementations from [21] and [22] , and 2 different recent implementations from this study. The Intel results are given as a reference point. In terms of throughput, Intel and Epiphany platforms outperform our platforms. However, our clock frequency results are from FPGA implementations. If implemented as ASICs, the clock frequencies should increase dramatically as the major parts of the critical paths are wires. For instance, if our 13 core architecture would run at 1GHz, the throughput/second would be 195,312 pixels, which is better than the 13 core Epiphany results by a small margin.
Single core of Epiphany outperforms our single core platform in cycles/pixels. However, when we replace the FPUs with accelerators, our platform shows a slightly better performance than the Epiphany architecture. These results are seen in Table 4 , on the rows with 13 cores. The parallel implementation of the application, which run on the Epiphany is slightly more optimized than the implementation executed on our 13 core platform. The performance of our platform can be increased further by optimizing the software implementations.
Case study 2 -convolution
We tested the convolution implementations on four different architectures. The first architecture is a single core with FPU, the second architecture is a single core with FPU and accelerator, the third architecture has 5 cores, and the fourth architecture has 4 cores with different configurations. The first architecture is used as a reference point for comparison. All implementations perform 32 bit floating point calculations and therefore, do not aim to compete with other CNN implementations which use reduced precision. Table 5 provide the performance results of these architectures where one can see that the performance of the single core architecture, while executing the application, is increased by a factor of 4, when the accelerator is integrated and utilized. Convolution of an image area of 7 × 7 (49 mac operations) takes approximately 800 cycles for the core (ignoring any memory overhead). The same operation takes 98 cycles for the accelerator including the 75 cycles taken for 49 memory accesses for the input image. In terms of computation, the accelerator is 8 times faster than the core. However, the application does not consist of only computations. The control operations such as scheduling and function calls (action in CAL ) together with the accumulate function contribute to the execution time. Additionally, the core needs to perform address calculation to be able to send the address of the sub-image to the accelerator. The address calculation costs 4 loads, 4 multiplications and 3 additions and takes 23 cycles for the three dimensional input image. This contributes around 55M cycles to the execution time out of a total of 520M as seen in Table 5 . In the future, with changes in the code generation tool, the address calculation can be moved to the accelerator and be executed faster.
The weight, that is generated through the analysis as an abstraction of the execution time, is 98% for the applyFilter action according to TURNUS. (This should not be confused with the weights used in convolutional neural networks (CNNs).) The analysis does not take the scheduler into account. However, when the CAL code is converted into C, the actions become functions and the scheduler becomes a state machine with if statements and calls to these functions. The scheduler adds an overhead to the execution time. Finally, with all the overheads, the applyFilter action takes around 86% of the execution time in the C implementation. Despite running the action 8 times faster with an accelerator, the overall speed up for the whole application becomes 4 × .
In the 5-core architecture, three of the cores are utilized for the actual convolution computation. The impact of the accumulator core is not significant in this case. Therefore, the actual computation power is tripled and thus, the expectation is a 12 × performance improvement when compared to the reference architecture. However, the distribution of the input image adds a communication overhead that reduces the performance. Therefore the speedup stays slightly below 12 × (in terms of clock cycle).
The 4-core implementation is developed mainly to remove the communication overhead. Additionally, by removing the core that acts as a storage, we save hardware resources and the clock rate becomes higher. The computation power is the same as the 5-core implementation, however, the speed-up is higher than 12 × . This is mainly due to performing fewer operations for the address calculation while sending the sub-image address to the accelerator. Additionally, the schedulers are smaller due to dividing the convolution and the accumulation functions onto different cores.
In the 4-core implementation, the input image is divided into R, G and B images and stored on separate cores as two dimensional arrays. Calculating an address within these arrays require two multiplications and two additions. This is two multiplications and one addition fewer than the single core implementation. The clock frequency differs between the single core architectures generated for different case studies. The architecture that is generated for the first case study runs at 128 MHz, whereas the architecture for the second study runs at 113 MHz. The reason of this difference is the utilization of the block RAMs (BRAMs). The first architecture has 256 KiB scratchpad memory and require 66 BRAMs, while the second architecture require 258 BRAMs due to having 1 MiB scratchpad memory. Since the BRAMs are distributed on the FPGA fabric, utilizing more of them require longer connections. This results in longer critical path and consequently lower clock frequencies. The details of the resource usage results are given in Table 6 . The accelerator is larger than the previous accelerators. It features 4 9 (32-bit) floating-point multipliers, 4 8 (32-bit) floating-point adders, and 64 filters. Each filter consists of 7 × 7 32-bit floating point numbers. The DSPs are used by the multipliers. The convolution hardware block that is instantiated within the generated accelerator is developed and added to our hardware library within a few days. With the design tools, it can be instantiated instantly.
The distributor core in the 5-core architecture has 1 MiB scratchpad memory and no FPU. The three computation cores have 64 KiB memory and an accelerator each. The accumulator core has 64 KiB memory and an FPU. The memory of the cores can be decreased to reduce the BRAM utilization. However, the storage/distributor core has to have around 634,800 bytes just for the input image. The LUT and FF usage of the 5-core architecture is respectively 4 and 6 times higher than the usage of the reference architecture. The number of the block RAMs is increased by 70 whereas the DSP usage is increased by factor of 21. These increases are irregular due to the heterogeneous nature of the 5 core architecture. The 4-core architecture has a similar behaviour in terms of hardware resource usage. In order to optimize this usage and increase the clock rate, we removed the distributor core. We used 256 KiB of memory in each computation core, which sums up to 768 KiB in total (for storing the entire image together with the other data and the code). Additionally, we reduced the memory size of the accumulator core as well. Therefore, the BRAM utilization is decreased from 327.5 to 202. The decrease in the resource usage allows a higher clock rate.
The CNN applications usually use lower precision to represent the numbers. In the future, the implementations and the hardware block for the convolution can be optimized further to use fewer bits to represent the numbers. This will lead to smaller accelerators, less memory usage, higher clock rate and higher performance.
Conclusions
The performance requirements of today's applications pushes the computer architectures towards parallelism and specialization. However, designing or exploring the design space of specialized architectures with multiple cores is a substantial challenge. In this paper we address this challenge and propose a design method that can generate application specific or domain specific manycore architectures with software tools that automate the steps of the method. The tools used in the method allows generation of application specific accelerators directly from the application code written in a dataflow language as well as the manycore architectures with different configurations including number of cores, memory types & sizes, and availability of certain components (FPU, accelerator, etc.).
The proposed design method facilitates the development of specialized (heterogeneous) manycore architectures with different configurations. This allows the developers to test their applications on different architectures or generate a specific architecture for the application in hand. The results show that the specialized architectures demonstrate a better performance than general purpose architectures. They might even reduce the area usage of the architecture in certain cases by removing large, general purpose components such as FPU. Otherwise, the area of the architectures increase when they are equipped with application specific accelerators. The clock frequency of FPGA implementations currently decrease when the number of cores increase due to the placement of the memory resources. However, with an ASIC implementation where the hardware resources can be placed arbitrarily, the clock frequency should not decrease when the architectures scale-up due to having encapsulated tiles.
We believe that, computer architectures will continue to move towards parallelism and heterogeneity. Therefore, a fully automated, open design method with a large library of hardware blocks performing different com putations will be an essential tool for exploring the design space for new architectures.
Future works
The future work on automation of the design method is already given in a previous section. In this section we share our ideas which can be performed in the future.
We have not used the actor composition and decomposition of the Cal2Many tool. This can be used in case of having constraints on the number of cores or memory sizes. Therefore, an application together with some constraints can be used as a case study. Even a tool can be integrated to the framework to automatically choose the configurations based on the constraints.
We have integrated a 2 dimensional mesh NoC to the rocket chip generator. Different configurations for the current NoC such as data bus sizes, number of physical meshes, buffer sizes can be explored. New NoC implementations with different topologies can be integrated to the rocket chip generator. This tool can be extended further with new cores to have more design options. A new interface can be added or the RoCC interface can be extended to directly connect the accelerators to the NoC. A new front-end can be developed to generate Chisel code from different languages. The hardware generation back-end does not support loops yet. This support can be added. The generated architectures are evaluated on FPGA platforms, however, these architectures can be implemented as ASICs with a further study. In short, the tools/components, which are used in the realization of the design method, can be extended or replaced to support more configurations and further design space exploration.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
