In cloud computing, software is transitioning from monolithic to microservices architecture to improve the maintainability, upgradability and the flexibility of the applications. They are able to request a service with different implementations of the same functionality, including hardware accelerator, depending on cost and performance. This model opens up a new opportunity to integrate reconfigurable hardware, specifically, FPGA, in the cloud to offer such services. There are many research works discussing solutions for this problem but they focus primarily on the high-level aspects of resource manager, hypervisor or hardware architecture. The low-level physical design choices of FPGA to maximize the accelerator allocation success rate (called serviceability) is largely untouched. In this paper, we propose a design space exploration algorithm to determine the best configuration of partially reconfigurable regions (PRRs) to host the accelerators. Besides, the algorithm is capable of estimating the actual resources occupied by the PRRs on the FPGA even before floorplanning. We systematically study the effects of having more PRRs on the system in various aspects, i.e., serviceability, waiting time and resource wastage. The experiments show that at a certain number of PRRs, upto 91% serviceability can be achieved for 12 concurrent users. It is a significant improvement from 52% without our approach. The average amount of time that each request has to wait to be served is also reduced by 6.3X . Furthermore, the cumulative unused FPGA resources is reduced almost by half.
Request Service A VM0 VM2 VMn
Scale-out Horizontally Scale-out Vertically 
INTRODUCTION
The software architecture has been evolving from monolithic to modular architecture. In the former case, the software stack is written as a long stretch of code which is tightly coupled. The functions are highly dependent on each other. The functions' library compatibility also causes difficulties in deploying complicated and largecode-base software. In the modular approach, the software is decomposed into smaller, loosely-coupled (if not independent) pieces called microservices [8] . It addresses the aforementioned problems in monolithic architecture. Another advantage of microservices is scalability. Since the services are deployed independently, they can be replicated quickly to handle the growing processing demands. The services can be mixed and matched with different performance and memory characteristics or even platform from several vendors. The purpose is to utilize the resources more efficiently [26] as long as the communication protocols are compatible [25] . i The microservices architecture fits really well to the context of utilizing FPGA as an accelerator for an application. Conventionally, a software-based implementation of a service is loaded into memory and CPU when requested. Now, a hardware accelerator, in the form of bitstream, can be configured to the FPGA instead. It runs just as any other microservice as shown in Figure1. The presence of FPGA is transparent to the users [30] . There are many efforts from both industry and research community attempting to incorporate FPGA into the cloud [1, 3, 4, 10, 13, 14, 18, 22, 24, 27, 29, 30, 33, 34] . The most common feature that these works suggest is the use of the partial reconfiguration (PR) [32] to provide the virtualized FPGAbased computing resources.
The aforementioned works cover various aspects of FPGA in the cloud such as (1) the high-level hypervisor to provide virtualization service; (2) the middle-ware to manage, select and reconfigure the PR regions (PRRs); and finally the system architecture on the FPGA with predefined PRRs. The size of the PRRs, their utilized FPGA resources and their quantity are neither discussed nor analyzed. Those PRRs are mostly fully-mapped, i.e, each of them can host all of the existing accelerators (hereby called PR modules or PRMs). In this case, the valuable FPGA resources will be wasted if the smaller PRMs are used most of the time. Additionally, the larger these PRRs are, the fewer of them that can be implemented on a fixed-size FPGA. The number of concurrent users (or tenants) who can have access to the FPGAs becomes lower. In other words, if the number of tenants is kept the same, the access requests from them will be rejected more regularly. It results in low serviceability, the ratio of the serviceable requests. Thus, the number, sizes of PRRs and the mapping of PRMs to PRRs should be optimized to maximize the serviceability of the system. As a result, the mapping of PRMs to PRRs should be heterogeneous instead of homogeneous.
It is a common knowledge that having more PRRs will deliver higher parallelism. Nevertheless, as the number of PRRs and PRMs increase, it becomes harder and almost impossible to analyze and implement the system manually. The vendor system architect (hereby called architect) has to simultaneously determine the best number of PRRs, the mapping of PRMs to PRRs while making sure that they can physically fit in the designated FPGA. Regarding the last task, the architect has to do actual floorplanning for the PRRs on the FPGA based on the PRM-PRR mapping. However, the current PR development flow requires him/her to do that manually [32] . It is especially tedious at the design space exploration (DSE) stage where the system architecture is not yet finalized. Therefore, in this work, we propose a novel automation tool to consider all of these aspects. Our contributions are as followed.
• An optimization algorithm to perform the DSE to find the best possible number of PRRs and the mapping of the PRMs to those PRRs (one PRM can be mapped to multiple PRRs) based on the existing PRM pool. • An estimation method to quickly estimate the actual FPGA resources occupied by the PRRs based on the requested resources during the mapping process. We also integrate an automatic floorplanner [19] after mapping to verify that the resulting system can physically fit in the designated FPGA. • A lightweight simulator to assess the serviceability of the system under test during the DSE.
The contributions presented in this work are not limited to microservice-based applications. They are applicable to all types of PR-based dynamic systems such as [6, 16] . In these systems, the list of hardware accelerators is known at design time. However, their actual usage is only known and determined at runtime. The behavior of those applications at runtime resembles the concurrent users in the cloud environment. Our approach can be used as another layer of optimization during the architectural design phase.
We focus primarily on studying the advantages of having more PRRs and how to map the PRMs to them. Therefore, we need a flexible PR-based system template that can be generated with varying numbers of PRRs. Besides, the relative resource requirements of the PRMs to the FPGA should not be too high. It is to make sure that a large number of PRRs that can be implemented. These requirements are essential to assess the scalability of our mapper.
The experiments are carried out with the architecture template taken from [20] . Since their target FPGA is Xilinx Virtex-6, we assume the same chip throughout the article. We collect a set of 50 real-world hardware accelerators synthesized by Xilinx Vivado HLS version 2016.3 and ISE version 14.7 from the different publicly available sources [9, 12, 21, 31] . The results show that the best system we can find is the one with 11 PRRs. It can maintain a serviceability of more than 91% for up to 12 concurrent users. On the other hand, only 52% serviceability can be achieved for the homogeneous system with 4 PRRs. The average time the tenants have to wait to be served is reduced by 6.3X . Besides, the cumulative wasted FPGA resources are reduced by 50%. It could potentially improve the energy efficiency of the system. More importantly, during the DSE process, we find that the serviceability only increases with the number of PRRs up to a certain point; after that, it degrades. Therefore, it is very important to perform our proposed DSE to find that sweet spot for a specific system and usage scenarios.
The remaining paper is organized as follows. The recent works on PR for shared FPGA in the cloud are discussed in Section 2. The proposed approach is presented in Section 3, followed by experimental results in Section 4. Finally, the conclusions and future works are presented in Section 5.
RELATED WORKS
The integration of FPGA in cloud computing to accelerate computation is becoming a major trend in both industry and academia. In this section, the related works are discussed in more details.
Chen et al. [4] discuss four major issues in enabling FPGAs into the cloud. They are abstraction layers, sharing of resources, compatibility between FPGA tool chains and finally security. The main contribution of [4] , similar to [1, 18, 27, 34] , is to propose a general framework and guidelines to tackle those issues, focusing on the high-level layer where the resource manager, scheduler and hypervisor run. However, the authors overlook the partitioning details of the accelerator slots, i.e, the PRRs, which also affect the serviceability of the system when users request the resources.
Published at the same time as [4] , the work by Byma et al. [3] also proposes a general framework to integrate FPGA with PR hardware accelerators into existing cloud computing models. The PRRs are considered as generic cloud resources in OpenStack to provide seamless FPGA virtualization similar to regular virtual machines. Even though the authors acknowledge the need to have mixed size PRRs to improve the flexibility of the system and propose it as future work, no further analysis has been done.
The cloud management and hypervisor named RC3E developed by Knodel et al. [14] offers multiple models for utilizing FPGA, either full access to the reconfigurable resource (entire FPGA), Reconfigurable Silicon as a Service; or only part of it with the introduction of PR virtual-FPGAs, Reconfigurable Accelerators as a Service and Background Acceleration as a Service. In the last two cases, an FPGA can host up to 4 virtual-FPGAs. The work by Weerasinghe et al. [29] takes an alternative approach by not only utilizing virtual FPGA concept but also proposing an infrastructure to allow large-scale deployment of FPGAs across the cloud. Nevertheless, as in the case of [14] , further information about the virtual FPGA physical implementation is not provided.
Session: High-Level Abstractions and Tools I FPGA '20, February 23-25, 2020, Seaside, CA, USA Fahmy et al. [10] present a framework for cloud computing with virtualized FPGA accelerators in a similar method as [14] . The authors do suggest partitioning the FPGA into various-sized PRRs and use a greedy approach to allocate the hardware accelerators to PRRs. For each accelerator usage request, the smallest PRR that can host it is reserved. The purpose is to maximize the possibility of configuring the larger accelerators for the later requests. If there is no such free PRR, the request is rejected and processed in software. Unfortunately, the impact of the size and the number of PRRs on the success rate of serving the requests is not analyzed. The authors also do not discuss how the PRMs are mapped to the PRRs.
The hypervisor proposed in [30] provides insight into how the PRMs/PRRs can be managed with a similar approach to the softwarebased system. The performance and operation of the hypervisor are assessed on the PR system with 3 PRRs. These PRRs have different sizes. Each of them can only host a subset of the PRMs used in the experiments. Similar to [10] , the authors do not study the impact of the PRRs to the overall performance of the system.
Zhao et al. [33] introduce the hardware project management and building tool called hCode2.0. It provides an easy-to-use framework to map and generate partial bitstreams for the accelerators within a shell. A shell is a system architecture with placeholders for PRRs. When a new accelerator is imported to a particular shell, it will be mapped to all of the available PRRs which have sufficient resources. Unfortunately, the tool needs a pre-defined system architecture as input with already-placed fixed number of PRRs.
The work [13] approaches the problem from a different perspective. Instead of having multiple PRRs on one FPGA for the tenants to share, they target the case where the tenants need one common accelerator. They propose a load balancing and monitoring framework to manage the bandwidth and the request rates from the tenants. However, this method is only suitable for the servers with one dedicated acceleration service. It may not be applicable for a general microservice environment.
In the embedded systems domain, there are several similar attempts in trying to find a suitable system architecture for the applications [5, 7, 23] . These works start by analyzing the task graph of the application. After that, they optimize the mapping and scheduling of those tasks (with the corresponding PRMs) on either software/FPGA or FPGA-only systems with different numbers of PRRs. However, their methods stop mapping the PRMs to PRRs once a feasible schedule is found. Our method, on the other hand, does not work at the task graph because it is not known at design time. We instead optimize the distribution of PRMs to PRRs to maximize the chance of finding a compatible PRR for a PRM at runtime.
All these works discuss many interesting high-level and systemarchitecture aspects of integrating FPGAs as virtualized resources into the existing cloud infrastructure. Still, the idea of improving the success rate in allocating accelerators upon requests from users by optimizing the number of PRRs as well as their sizes is untouched.
PROPOSED APPROACH 3.1 Design Space Exploration
The proposed DSE flow chart is illustrated in Figure2. There are six major steps. It starts by obtaining the PRMs' resources requirement (Step 1). The requests from each tenant are generated randomly in
Step 1 -Pool of Accelerators
Step 2 -Generate Requests
Increase the Number of PRRs

Stop
Step 3 -Gen.&Synth.
PR System
Step 4 -PRM-PRR Mapper
Failed
Step 5 -Floorplan the Design
Step 6 -Simulate the System Figure 2 : Our proposed DSE flow used to find the optimal configurations for PRRs from the accelerator pool and system template. Our contributions are in light orange.
Step 2 (described in Section 3.4). The tenants in the cloud environment are independent of each other [30] . Thus, we can mimic the behaviors of multiple tenants by simply combining their requests. Thereafter, in Step 3, the PR-HMPSoC template provided by Nguyen et al. [20] is used to create the PR systems with different number of PRRs. The choice of PR-HMPSoC is for the experimental purposes; it does not represent the actual cloud-based system. None of the works in [3, 4, 10, 14, 29] is designed as a flexible template to have systems with a varying number of PRRs. Therefore, we are unable to use in the experiments. Nevertheless, our DSE approach is agnostic to the underlying architecture.
Afterwards, each PR system is synthesized on Virtex 6 by Xilinx ISE 14.7 to get the real resources requirement of each component from Step 3. This information and the one obtained from Step 1 is fed to our PRM-PRR mapper presented in Section 3.2 and 3.3 to calculate the best mapping of PRMs to the PRRs (Step 4). The tenants' requests are not known a priori by the PRM-PRR mapper.
Next, we use PRFloor tool [19] to floorplan the system (Step 5). This step is required to make sure that the resulting system can physically fit in the FPGA. It may happen that the PRM-PRR mapper is not able to determine a proper mapping. The PRFloor may also fail to find a feasible floorplan. In both cases, it is due to the limited FPGA resource capacity. The PRM-PRR mapper always tries to map at least 2 PRMs to each PRR. If the number of PRRs is too high, the total resources required by the system may be larger than the FPGA. In floorplanning, the more PRRs are requested, the larger the system is with many more supporting static components. It is more difficult for PRFloor to successfully find a floorplan.
Step 6 is executed when both PRM-PRR mapper and PRFloor have run successfully. Our simulator takes the requests and system specification (list of PRMs, PRRs and PRM-PRR mapping) generated from Step 2 and 5 respectively to simulate the system. The quality metrics (serviceability, waiting time and resource wastage) of the current system are then recorded as one design point to compare with the others. The simulator is presented in Section 3.4. Finally, the requested number of PRRs is increased to go through another iteration with a new system architecture.
When the system architecture and the PRM-PRR mapping are finalized, a TCL script can be used to instruct Xilinx tool to generate the partial bitstreams. This process may take a significantly long time to complete. This problem can be alleviated by using bitstreamrelocation-aware PRM-PRR mapper, floorplanner and related lowlevel techniques. However, it is not considered in this work.
There might be a concern about having too many processing elements. It will affect the overall performance of the system. Specifically, the memory/peripheral access contention may increase. Our approach is made such that it does not concern about the internal architecture. Since a system generator knows best about its architecture, it should have its own performance analysis. When those metrics are worsen, an invalid system should be returned. Our DSE flow will stop processing further. Those metrics can be combined with ours to explore the Pareto multi-objective optimization.
PRM-PRR Mapper
The problem we are trying to solve in this work is to serve as many hardware accelerator requests as possible. However, since the FPGA resource is limited, it is impossible to implement all accelerators on the FPGA. Conventionally, the architect tried to analyze the application use-cases to generate a set of predefined FPGA configurations; each contains a set of accelerators. When a request for the accelerators is received, one bitstream is chosen to reconfigure the whole device [15] . This method is impractical in the cloud environment where the behaviors of the tenants are nondeterministic [30] ; even if they are, the number of use-cases will explode exponentially. Furthermore, triggering the full reconfiguration will interrupt other tenants who are sharing the same FPGA.
As a result, PR systems are the most suitable platforms in this case. The FPGA is partitioned into multiple slots (called PRRs), the accelerators (or PRMs) can be dynamically loaded into those slots when needed. Unfortunately, there are several technical challenges. The PRRs must be specified at design time [32] . The architect has to decide which PRRs one PRM can run onto to define the sizes and required resources of the PRRs appropriately. The mapping decision is affected by many factors such as how often the PRMs are used, how many PRRs that each PRM should be mapped to, what types of resources that each PRM requires, how many resources are actually occupied by the PRRs after floorplanning, etc. For this reason, we propose the automatic PRM-PRR mapper. The optimization goal of the mapper is to map one PRM to as many PRRs as possible to maximize the chance of finding a suitable PRR when that PRM is requested. Since the PRRs are heterogeneously mapped, there are more PRRs compared to the homogeneous case. Hence, more tenants can be served at the same time.
Based on the characteristics of the goal, we model it as an Integer Linear Programming (ILP) problem [28] . We define the function F (PRM i ) = F i to represent the number of PRRs that PRM i is mapped to. The naive objective is to maximize the sum of these F i as shown in Equation 1.
There are three issues with this objective. Figure 3 : The issue of one PRM is being mapped to too many PRRs compared to the others. In a), PRM1 is mapped to PRR1-4 while PRM3 is only mapped to PRR3. In b), a better mapping, PRM1 is mapped to PRR1-3, PRM3 is mapped to PRR4-5.
• Issue 1: the architect, after analyzing the tenant usage behaviors, may observe that several PRMs are more common than the others. It would be better to increase their F i . • Issue 2: some PRMs can be mapped to too many PRRs compared to others, affecting the fair share between PRMs to FPGA resource as illustrated in Figure3. In Figure3a, PRR1-4 can host PRM1 and PRM2; PRR5 can host PRM2 and PRM3. If PRM1 is mapped to PRR5, there will not be enough DSP because PRR5 must cover a larger region. Here,
The total number of F (PRM i ) in both cases is 10. However, in Figure3b, PRM3 has more chance of finding a suitable PRR to load to. • Issue 3: if the numbers of PRMs mapped to each PRR are not balanced, some PRRs will become hot-spots where they are used more frequently than others. It will potentially cause more resource contentions. Consequently, the final ILP program is built as follows. The objective function is presented in Equation 2. The most important part of the ILP constraint section is described by Equations 3 to 5. Minimize:
Subject to:
The first term of Eqn. 2 introduces a new measure of priority for each PRM, priority i , to solve the aforementioned Issue 1. The highest possible priority i is 1. Therefore, the lower the priority (higher number) is, the smaller the impact of the corresponding PRM on the summation of F (PRM i ).
Session: High-Level Abstractions and Tools I FPGA '20, February 23-25, 2020, Seaside, CA, USA
The second and third terms of Eqn. 2 address the last two issues respectively. The DEV P RM i j measures the difference, or deviation, between the number of PRRs that PRM i and PRM j are mapped to. The intrinsic idea is that, minimizing DEV P RM i j (∀i j, i, j = 1 → n) will balance the values of F () between all PRMs. By this way, the second issue can be avoided. For instance, in Figure3, the total deviation of PRMs in example a is 8 while being just 6 in b. The same method is applied to the third issue related to PRRs by using DEV P RR i j (∀i j, i, j = 1 → m, m is the number of PRRs). Equations 6 to 9 illustrate how to compute these deviations as constraints in the ILP program.
Eqn. 3 shows the calculation of the number of PRRs that PRM i is mapped to. The number of PRMs that PRR j can host is described in Eqn. 4. The set of binary variables MAP_PRM i PRR j (∀i = 1 → n, j = 1 → m) indicates the mapping of PRM to PRR. If MAP_PRM i PRR j = 1, then PRM i is mapped to PRR j . We rely on these variables to determine the final PRM-PRR mapping.
Eqn. 5 is used to restrict the total number of CLBs occupied by all PRRs from exceeding the available CLBs (after deducting the static modules). The requested number of CLBs of PRR i , PRR i C LB , is the largest requested number of CLBs among all PRMs that are mapped to PRR i . Eqn. 10 -11 present the computation of PRR i C LB from two representative PRMs, PRM j and PRM k . The same computation is applied for CLBM, BRAM and DSP.
In Eqn. 2, there are three weight parameters, α, β and γ . They are used to balance the preference of the architect over three objectives. To make it easier to adjust these parameters, each objective should be normalized to its corresponding maximum possible value. The first objective, maximizing all F (PRM i ), reaches its maximum when each PRMs can be mapped to all PRRs, hence the value is m n i=1 priority −1 i . Calculating the maximum values for the deviation metrics is not as straight forward. Eqn. 12 -13 are the simplified functions used to calculate the total deviations of DEV P RM i j and DEV P RR i j . It is assumed that F (PRM i ) is sorted in the decreasing order of value when i = 1 → n. The same assumption is applied for G(PRR i ).
In our emperical analysis, the total deviations of all PRMs get its maximum value when half of the PRMs is mapped to all PRRs; and the other half is mapped to only 1 PRR. A similar observation can be drawn for DEV P RR i j . Therefore, Eqn. 12 -13 are simplified to Eqn. 14 and 15.
M AX (Eq. 12) = 0.25n 2 (m − 1), n is even
M AX (Eq. 13) = 0.25m 2 (n − 1), m is even
After calculating the maximum values for three objectives, the parameters α, β and γ are then adjusted as in Eqn. 16 . The architect balances the weights via α′, β′ and γ ′.
Estimate Occupied Resources
In PRFloor work [19] , the authors explain that the actual FPGA resources occupation of the PRRs can be very different from the initial requirements. It is due to the heterogeneity and non-uniform distribution of the FPGA resources. If the PRM-PRR mapper does not have the notion of this issue, it may unknowingly over-assign (assign more than it should be) the PRMs to the PRRs. Consequently, the floorplanner will fail to find a feasible floorplan for the system because the PRRs become too big. The DSE process described in Section 3.1 may terminate prematurely. As a result, the potential optimal design points later are lost. We propose a method to estimate the actual occupied resources for PRRs based on the initial requirements. The pseudo-code used to estimate the resources occupation of a PRR is shown in Algorithm 1. This algorithm is only run once for each type of FPGA. The basic idea is to apply the curve fitting algorithms to determine the best-fit functions whose the inputs are the number of requested resources. In line 4, a placement of a PRR is the smallest rectangle region (starts from any location of FPGA) that covers at least the same amount of resources requested by that PRR. The algorithm finds all placements for the PRR from every possible location of the FPGA. In this work, the fitting functions found from line 9-12 are obtained by using the Matlab Curve Fitting Toolbox [17] . These estimations are used in constraint section of the PRM-PRR mapper (Eqn. 10 -11) for the PRMs instead of the resources from the synthesis reports. The reason we can use the estimation directly to the PRMs instead of PRRs is explained as follows. The required CLB resource for the PRR in which PRM i , (i = 1 → k) are mapped to, is: PRR r eq C LB = max PRM i r eq C LB . The estimated resource occupation of PRR is: est P RR i C LB = est(max PRM i r eq C LB ) = max est(PRM i r eq C LB ). The accuracy of the algorithm is discussed in Section 4.3. end for 7: end for 8: determine 4 fitting functions to estimate the occupied CLB 9: got: est _clb_f r om_clb(num cl b ) ≈ mean_clb num cl b 10: got: est _clb_f r om_clbm(num cl bm ) ≈ mean_clb num cl bm 11: got: est _clb_f r om_br am(num br am ) ≈ mean_clb num br am 12: got: est _clb_f r om_dsp(num d sp ) ≈ mean_clb num d sp 13: // estimate the CLB from r eq cl b , r eq cl bm , r eq br am , r eq d sp 14: est _clb(r eq cl b , r eq br am , r eq d sp )
= max(est _clb_f r om_clb(r eq cl b ), est _clb_f r om_clbm(r eq cl bm ), est _clb_f r om_br am(r eq br am ), est _clb_f r om_dsp(r eq d sp ), r eq cl b ) 15: similar calculation is used to compute the other resources est _clbm, est _br am and est _dsp
Request Generator and Simulator
In the microservice-based environment, when the tenants request for services with a specified workload, performance requirement and cost preference, the hypervisor takes responsibility to process the requests. It decides whether to scale out horizontally, allocating conventional virtual machines (VMs) to serve the requests, or to scale out vertically, choosing a VM with a suitable processing elements such as FPGA (Figure1). This process is transparent to the tenants. Each tenant works independently of each other. The hypervisor can assign multiple FPGA slots (PRRs) to the tenants. It may happen that there is no available PRR left. The hypervisor has to either wait for the other tenants to release the PRRs within a predefined timeout period, or allocate the conventional VMs [10] . These behaviors of the tenants and hypervisor are considered here.
Request
Generator. We propose a request generator shown in Algorithm 2to mimic the behaviors of the tenants. The requests from each tenant are generated independently of the others. For each request, the required service (or PRM), the start time when the request must be served and the duration that the tenant will use the service upon successful allocation are randomly generated. Additionally, the generator can specify the timeout period during which the hypervisor is allowed to postpone the request to wait for a suitable PRR. The maximum number of PRMs that one tenant is allowed to ask for at any particular time is configurable (the concur_prm parameter in line 4). The requests parameters are uniformly generated. However, in practice, it is uncommon for the tenants to use all kinds of services from different application domains. Each tenant is only primarily interested in a particular domain. Our request generator takes that into account by offering an option to specify the primary domain for each tenant. In that case, most of the services requested by the tenant will be from that domain (line 10 of Algorithm 2).
Algorithm 2 The Request Generator
Require: num_r eq, pr ior ity, pr imar y_domain, dur at ion min , dur at ion max , t imeout max , concur _prm, window l e nдt h 1: i = 0; window s t ar t = 1; st ar t _t ime pr ev = 1 2: while i < num_r eq do 3:
l at est _end_t ime = 0 4: for j = 1 to concur _prm do 5: // gen_req() randomly decides whether to generate a new request 6: if дen_r eq() == 0 then continue 7: дen_st ar t _t ime(r eq i , window s t ar t , st ar t _t ime pr ev ) 8: // st ar t _t ime >= max(window s t ar t , st ar t _t ime pr ev ) 9: generate dur at ion, t imeout 10: generate app_domain 11: generate r eq_prm ∈ app_domain 12: expect ed _end = st ar t _t ime + dur at ion + t imeout 13: if l at est _end_t ime < expect ed _end then update it 14: i = i + 1 15: st ar t _t ime pr ev = st ar t _t ime 16: end for 17: window s t ar t = window s t ar t + window l e nдt h
18:
if window s t ar t < l at est _end_t ime + 1 then update it 19: end while
Algorithm 3 The Simulator
Require: l ist _of _t enant s, l ist _of _prm, l ist _of _f pдa 1: дlobal _t ick = 0; queue cur _r eq ← ∅ 2: while all requests are not served do while (r eq = pop_nex t _r eq(cur _t enant , дlobal _t ick )) do 8: push_t o_queue(queue_cur _r eq, r eq) 9: end while 10:
end for 11: serve_r equest s(queue_cur _r eq) 12: end while 13: report quality metrics 3.4.2 Simulator. Our simulator is developed to act as a simplified hypervisor serving the requests from the tenants. Algorithm 3 describes how it works. One simulation time is called a tick.
The list_o f _tenants and the corresponding requests are provided by the request generator. The list_o f _f pдa and the PRRs detailed information and the PRM-PRR mapping are obtained from the previous steps of the DSE flow. The remove_all_timed_out_requests() in line 4 removes all timed-out requests that the hypervisor defers to serve because there was no suitable PRR at the time they arrived. The provided timeout periods are generated randomly by the request generator. The function serve_requests() serves the requests in the following order of (1) increasing start_time and (2) tenant's priority (the list_o f _tenants is sorted based on the priority). On the mapping aspect, the PRR assigned to the request is chosen such that (1) it can host the requested service and (2) it is the smallest available PRR. The mapper and scheduler could be developed further with a more sophisticated strategy to take other metrics into account such as application performance or energy efficiency.
RESULTS
Experiment setup
All of our experiments are run on a computer with CPU Intel Core T M i5, 2.5 GHz x4 (2 physical cores with hyper-threading) and 12GB of memory. The operating system is Ubuntu 14.04 LTS 64-bit. The PR-HMPSoC template and PRFloor are provided by [19, 20] . Even though our method is made general enough for all kinds of Xilinx FPGA, the one we are experimenting with (due to the restriction of [20] ) is Virtex-6 XC6VLX240T. Gurobi Solver [11] is used to solve the PRM-PRR mapper with default settings. The weight parameters in the PRM-PRR mapper objective function are α′ = 5, β′ = γ ′ = 1 (unless stated otherwise).
The request generator generates the requests from tenants based solely on the information of the PRMs, or the IP pool, and the tenants' configuration as discussed in Section 3.4.1. In our experiments, all PRMs have equal priority unless stated otherwise. Each tenant issues 10000 requests and can use up to 3 PRRs at any one time. The PRMs are chosen randomly with uniform distribution, regardless of the application domain. The usage duration of each PRM upon successful allocation is randomly generated between 10 and 400 simulation ticks. The window_lenдth parameter in Algorithm 2 is set to 500 ticks. The number of tenants in each simulation is from 2 to 12 in the increment of 2. We have two sets of tenant requests, the first one disables the timeout mechanism for the requests that cannot be served immediately as discussed in Section 3.4.2; the second one enables that with the random timeouts of upto 200 ticks.
IP Pool
We collect 50 real-world hardware accelerators, or PRMs, from CH-Stone [12] , Opencores [21], EPFL [9] and Xilinx XPS IP core library [31] . These PRMs are categorized into 8 application domains: digital signal processing (DSP), cryptography, arithmetic, communication, soft-core processing unit, image processing, video processing and others. Figure4 depicts the resources requirements of the PRMs. As seen, the sizes and types of resources of the PRMs vary quite significantly which reflect microservice scenarios.
Accuracy of the Resource Estimation
In this section, the accuracy of the resource estimation method presented in Algorithm 1 is assessed. We randomly generate 10000 PRRs whose sizes can be up to 80% of the device. Afterwards, the mean resources occupied by each PRR are computed by the floorplanner. The error histogram of the requested resources compared against the actual occupation on the device is provided in Figure5a. The Algorithm 1 is then used to estimate the occupied resources. The corresponding error histogram is illustrated in Figure5b. The results in two figures indicate that our algorithm does offer a highlyaccurate estimation on the final resource occupations. The error of the CLB and CLBM estimations are mostly within the 5% range, at most 10%. In the case of BRAM and DSP, there is a small amount of outliners that are more than 20% off from the expected values. This high error is due to the irregular distribution of the resources, especially BRAM and DSP, on the FPGA. This irregularity causes large variations in the size of the placements. The architecture of the baseline system is obtained naturally by just executing the DSE flow. If it is possible to fully map all PRMs to every PRR in the system, the PRM-PRR mapper will converge to that point, thanks to the formulation of the objective function shown in Eqn. 2. The baseline system returns the best possible objective value among the systems with the same number of PRRs. It is −m * n in which m and n are the number of PRRs and PRMs respectively. All generated systems are later simulated with the same set of tenant requests.
We generate the PR system from the PR-HMPSoC template starting with 3 PRRs. In these experiments, the PRM-PRR mapper restricts the CLB (including CLBM), BRAM and DSP utilization of both PRRs and static modules to 85% of the Virtex 6. We keep increasing the number of PRRs until PRM-PRR mapper or PRFloor fails to find a feasible floorplan. However, during the DSE, we notice a decline in the serviceability of the systems and decide to stop the DSE sooner. At this point, we have already obtained the system with up to 15 PRRs. Table 1 presents the time the PRM-PRR mapper takes for each system. The floorplanning time, the average values of F (PRM) and G(PRR) are also given. From the table, the baseline system has 4 PRRs. This baseline system has the same number of PRRs as our most similar related work [10] . During the DSE, there are only a few cases where the final resources occupation after floorplanning exceed the 85% constraint. All mappings except the systems with more than 12 PRRs given by PRM-PRR mapper are proved optimal, i.e, the objective functions are mathematically proved by Gurobi that they get the lowest possible values. For the first 6 systems, it takes less than 5 seconds for PRM-PRR mapper to find an optimal solution. However, in the subsequent cases with 9 to 12 PRRs, the mapper has to spend nearly 2 minutes. For the system with 13 to 15 PRRs, the ILP solver stops exploring to find the optimal solutions because the timeout is set to 5 minutes. The thorough investigation of this issue is left for future work to optimize the ILP program.
We also run the experiments in which the resource estimation is turned off. However, PRFloor fails to floorplan the design with only 6 PRRs because the PRM-PRR mapper over-assigns the PRMs to PRRs. From our empirical results, even though each PRR can host more PRMs in these experiments, the serviceability of these systems is almost identical to the system with the same amount of PRRs with resource estimation turned on. The PRM-PRR mapper only over-assigns one or two PRMs to some PRRs. This does not have much impact on the overall serviceability. of the systems significantly. For instance, comparing the baseline system in Figure6 with the 11-PRR system, the average serviceability increases from 0.74 to 0.95, which is 28% improvement. If we only consider the case of 12 tenants, the improvement is much higher, 75%. Additionally, having more tenants will decrease the serviceability of the baseline system drastically, about 11% for every increment of 2 tenants. With the 11-PRR system, it is just 1% . Another interesting observation from Figure6 is the decrease in serviceability when we have more than 11 PRRs. This behavior is better observed in Figure7. In this graph, the systems are rearranged based on the number of tenants. The serviceabilities of the systems with more than 11 PRRs are even worse than the baseline when there are 2 and 4 tenants. It can be explained as follows. As the number of PRRs increases, the resources left for PRRs become smaller. Therefore, each PRR now hosts a fewer number of PRMs (Table 1) . Some PRRs become hot-spots where multiple tenants try to request for the PRMs that can only be configured into those regions. As a result, the overall serviceability of these systems goes down. Figure7 also assists the architect on how to choose the best configuration based on the expected number of tenants. If there are less than 8 tenants, then the 9-PRR system delivers the best serviceability. If more tenants are needed to be served, then the 11-PRR system is the best all-around.
We also have an extended experiment to evaluate the chosen system under unknown conditions. It is to make sure that after the DSE, the system of interest can still maintain its quality of service as long as the distributions of the requested accelerators are the same. Figure8 illustrates the average serviceability offered by the same set of systems with three other different sets of requests from Figure 9 : The serviceability differences when the reconfiguration overhead is considered versus the cases without that overhead. The requests are kept in the queue for at most "wait time" ticks. 12 tenants. The first one is created with the same configuration as stated in Section 4.1. In the second set, each tenant uses the PRMs for a longer duration, up to 800 ticks instead of 400. The third one allows the tenants to request up to 5 PRMs at any time instead of 3. Each request set is generated 5 times with different random seeds. The reported results are the average of these. As shown in the figure, the 11-PRR system still outperforms others across three new sets of requests.
4.4.3
The Serviceability -With Request Timeout. In PR systems, when a PRM is requested, there is a small reconfiguration overhead as presented in [2] . In their experiments, this overhead is about 4% of the execution of the PRM. It depends on how big the PRRs are and how long the PRMs are in use. However, the bigger PRMs tend to be used longer than the smaller ones. It also takes longer to reconfigure them. These assumption may not be entirely true in general cases; but it adequately reflects how big the latency is. Therefore, we run another set of experiments in which the reconfiguration overhead is 8 ticks, i.e. 4% of the average 200 ticks execution time of the PRMs. The requests from tenants are kept the same. We also allow the requests to stay longer in the queue for 0, 8, 16, or 24 ticks while waiting for their turn to be served. The ratio of the differences in the average serviceability of the systems is shown in Figure9. As expected, the serviceability decreases when the requests are only allowed to wait for a small amount of time. However, the change is very subtle, at most 2.7%. When the requests wait for a longer time, the systems indeed gain better serviceability. Thus, if the tenants do not strictly require their requests to be processed immediately, they will be able to have more requests served.
To assess our observation, we run the second set of tenant requests in which the larger timeout is enabled. In this time, we measure the average wait time, i.e, the time that each request has to wait to be served. The reconfiguration overhead is still 8 ticks. The The number of PRRs is 11. All PRMs have the same priority. Each system obtained is then simulated with 12 tenants, using the same set of requests as the experiment presented in Figure10. As expected,F (PRM) decreases when α′ decreases (with respect to β′ and γ ′). When PRM-PRR mapper tries to even out the F (PRM i ) in favor of high β′ and γ ′, it has to lower the F (PRM) of some PRMs significantly to increase the F (PRM) of larger PRMs. As a result, the standard deviation becomes smaller as shown in the figure. β′ and γ ′ also have an impact on the serviceability. As mentioned earlier, we include the DEV P RM calculation in the objective function to provide a means to increase the chance of finding a compatible PRR for some PRMs. These PRMs initially can only be mapped to a smaller number of PRRs even though all of them have the same priority. As seen, when we increase β′:γ ′ to up to 4:4, the serviceability is improved slightly, around 1.5%. After this point, the serviceability is reduced upto 8%. In this case, on average, each PRM has only 2 compatible PRRs. This number is too small considering that there are 12 tenants sharing the FPGA and each tenant can request for up to 3 PRMs at a time. The DEV P RR also helps reducing the hot-spot issue when too many PRMs are mapped to only a few number of PRRs. The downward trend of the normalized standard deviation of the PRRs' usage time illustrates that the PRRs are being utilized more fairly. The aforementioned effects of α′, β′ and γ ′) emphasize the ability of PRM-PRR maper in controlling the behavior of the system. The architect can tune the weights to suit the requirements.
4.4.5
Assigning priorities to PRMs. In our PRM-PRR mapper objective function, we consider the situation where some PRMs are used more often than the others. These common PRMs should be mapped to more PRRs. In this section, we present the effect of assigning a higher priority to some specific groups of PRMs. Our pool of PRMs is classified into application domains. These are Arithmetic (multiplication, addition, sine, division), Cryptography (SHA, MD5, AES, RC4), Image/Video Processing (JPEG decoder, image statistics, color filtering array, stream scaler), and many more. The experiment is set up such that, first, all PRMs under each of the three aforementioned domains will be assigned higher priority than the others. Then, the PRM-PRR mapper is executed for each case to find out the optimal mapping in the systems with 3 to 15 PRRs. Finally, the ratio of F (PRM i ) over the available PRRs in each system are calculated and averaged based on the application domain across all systems. Figure12 presents the results. It can be seen that the corresponding F i is improved for each of the application domain (arithmetic, cryptography and image/video processing) when their priorities parameter are set higher. The F i values of the other PRMs are automatically reduced to compensate for the PRMs under consideration. This flexibility gives the architect a freedom to adjust the mapping based on the statistics of the applications. Each system in the server farms can be tuned individually to offer even better performance for tenants with a suitable runtime mapping/scheduling strategy.
Wasted resources.
We introduce the wasted resource cost metric to further inspect the usefulness of having a larger number of PRRs. If all PRRs must be large enough to accommodate the largest PRMs, which may not be used regularly, the valuable FPGA resources will be wasted. During the simulation, we compute the wasted resource cost by accumulating the difference in the resources of the requested PRMs with their allocated PRRs. Each type of resource -CLB, BRAM or DSP -is assigned different weight similar to [19] . The results are reported in Figure13. The state of the system is sampled 100 times at fixed intervals. As shown, the wasted The usage of PRRs during simulation PRRs Figure 14 : The utilization of each PRR throughout the simulation. For each PRR, the red/blue boxes mean that it is occupied/free. resources in the 11-PRR system are almost half of the baseline. It implies that the 11-PRR system uses the FPGA resources much more efficiently than the baseline.
The number of PRRs and how they are being utilized during the simulation is captured in Figure14. In this experiment, α′ : β′ : γ ′ = 5 : 1 : 1. The result suggests that utilizing the 11-PRR system could also improve the energy efficiency. In the figure, there are more blue gaps in the 11-PRR system compared to the baseline. It is possible to disable the PRRs during the free periods to save dynamic power. In the baseline system, all PRRs are active most of the time. But it can only serve half of the requests from tenants as discussed in Section 4.4.2.
CONCLUSION AND FUTURE WORKS
In this work, we propose a DSE process to find the best possible PR system configuration to optimize for the serviceability. The DSE process is composed of our novel ILP-based PRM-PRR mapper, the resource estimation method, request/simulator engines and the integration with an automatic floorplanner.
In the future, we will extend the ILP program and the DSE process to determine the best number of not only the PRRs, but also the FPGAs for the specific QoS requirements. The bitstream-relocationaware approach will be explored to generate partial bitstreams more efficiently. The real-world cloud applications will also be examined.
