Online machine learning (OML) algorithms do not need any training phase and can be deployed directly in an unknown environment. OML includes multi-armed bandit (MAB) algorithms that can identify the best arm among several arms by achieving a balance between exploration of all arms and exploitation of optimal arm. The Kullback-Leibler divergence based upper confidence bound (KLUCB) is the state-of-the-art MAB algorithm that optimizes exploration-exploitation tradeoff but it is complex due to underlining optimization routine. This limits its usefulness for robotics and radio applications which demand integration of KLUCB with the PHY on the system on chip (SoC). In this paper, we efficiently map the KLUCB algorithm on SoC by realizing optimization routine via alternative synthesizable computation without compromising on the performance. The proposed architecture is dynamically reconfigurable such that the number of arms, as well as type of algorithm, can be changed on-the-fly. Specifically, after initial learning, on-the-fly switch to light-weight UCB offers around 10factor improvement in latency and throughput. Since learning duration depends on the unknown arm statistics, we offer intelligence embedded in architecture to decide the switching instant. We validate the functional correctness and usefulness of the proposed architecture via a realistic wireless application and detailed complexity analysis demonstrates its feasibility in realizing intelligent radios.
I. INTRODUCTION
Online machine learning (OML) algorithms such as multiarmed bandit (MAB) and reinforcement learning offer a simple but very powerful framework to enable decision making over time in an unknown uncertain environment [1] , [2] . The MAB algorithm aims to identify the best arm among several arms and various extensions such as multi-play, multi-player, adversarial, contextual, and linear MABs have been explored to cater a wide range of applications [1] . Few of them include content (news or advt.) selection to maximize the number of clicks, dynamic pricing to maximize the total profit, medical trials to identify suitable drug and resource selection in robotics, data-center, IoT and wireless networks [1] - [3] .
An optimal MAB algorithm guarantees logarithmic regret (i.e. loss due to sub-optimal arm selection) by achieving a balance between exploration of all arms to gain knowledge that may improve future performance and exploitation of an arm which can maximize the immediate performance. The Kullback-Leibler divergence based upper confidence bound (KLUCB) is the state-of-the-art MAB algorithm that optimizes such trade-off but it is computationally complex due to underlining optimization routine [4] . Other algorithms include UCB, Bayes UCB and Thompson Sampling (TS) which incur higher regret than KLUCB [1] - [3] . From an architecture perspective, none of these algorithms have ever been realized on the hardware. The usefulness of MAB algorithms in robotics, IoT and wireless applications and strict latency constraints demand efficient mapping to area and power-efficient architecture and tight integration with the physical layer (PHY) algorithms [5] .
In this paper, we explore synthesizable computation to replace the optimization routine in KLUCB along with the hardware-software co-design approach to efficiently map the KLUCB algorithm on the Zynq system on chip (ZSoC) and validate its performance. To the best of our knowledge, this work is the first attempt towards the hardware realization and performance analysis of MAB algorithms. Next, we explore dynamic partial reconfiguration (DPR) to realize reconfigurable architecture that allows on-the-fly configuration of a number of arms as well as the type of algorithm. We demonstrate around 10-factor improvement in latency and throughput by enabling on-the-fly switch from KLUCB to light-weight UCB after initial learning. Since learning duration depends on the unknown arm statistics, intelligence embedded in architecture offers the capability to optimize switching instant. We validate the functional correctness and usefulness of the proposed architecture via a realistic wireless application and detailed complexity analysis demonstrates its feasibility in realizing intelligent radios/robots. Please refer to [6] for additional supplementary tutorial and source codes.
II. SYNTHESIZABLE UCB AND KLUCB ALGORITHMS In MAB setup, each experiment consists of N, n ∈ {1, 2, .., N } sequential slots with K, k ∈ {1, 2, ..K} arms and the aim is to select the arm with highest reward as many times as possible. However, the reward distribution of the arms is unknown and needs to be learned. In this paper, we limit our discussion to Bernoulli reward distribution though proposed architectures can be tuned for Exponential and Poisson distributions as well. We consider single-play MAB where the algorithm can select only one arm in each slot. The arm selected in slot n is denoted by, I n and R n denotes the reward received for the selected arm (i.e. only one feedback in each slot). Both UCB and KLUCB algorithms select each arm once in the beginning (i.e. first K slots). Thereafter, in each subsequent time slot, quality factor (QF), Q(k, n) is calculated for each arm.In UCB, the value of Q u (k, n) is given by [1] ,
where X(k, n) = X(k, n − 1) + R n−1 · 1 {In−1==k} ∀k
T (k, n) = T (k, n − 1) + 1 {In−1==k} ∀k
where 1 cond is an indicator function and it is equal to 1 (or 0) if the condition, cond is TRUE (or FALSE). The parameter, α, is an exploration factor that can take any value between 0.5 and 2. Based on calculated QFs, the arm with the highest QF is selected and it is denoted by, I n [1] .
In the literature, various extensions of UCB such as UCB V and UCB T have been discussed and they slightly differ in terms of number of the arithmetic operations in QF calculation. We have realized all three UCB algorithms on ZSoC and compared their performance in Section V. However, due to limited space constraints, we limit the discussion to UCB and KLUCB since KLUCB QF calculation is significantly different and needs computationally complex optimization routine along with KL divergence, d, as shown below [4] .
Q kl (k, n) = max q ∈ [0, 1] , d X(k, n) T (k, n) , q ≤ Y (k, n)
To realize Eq. 5, we need KL divergence computation for large possible values of q ∈ [0, 1] which makes QF computation extremely expensive to realize in hardware. To overcome this limitation, we present an alternative heuristic approach shown in Algorithm 1 for Bernoulli reward distribution. The main idea is to dynamically and intelligently refine the range of q based on comparison of the KL divergence between learned arm statistic, i.e. X(k,n) T (k,n) and expected arm statistics, m id (line 8) with the exploration factor, S 2 (line 9). Number of iterations of for loop, i.e. parameter β, depends on ∆ > 0 which is the minimum gap between statistics of any two arms and we set β = 1 ∆ . We denoteμ(k, n) = X(k,n) T (k,n) which is the learned mean of the reward distribution of k th arm till slot n and µ(k) is its actual value which is unknown. Then, we have,
We assume > ∆ and hence, β = 1 .
Algorithm 1 Modified Q kl (k, n) Calculation in KLUCB 1: Input: X(k, n), T (k, n), n 2: Parameter: β 3: Output: Q(k, n) 4: l id = S 1 = X(k, n)/T (k, n) 5: S 2 = [log n + c log(log n)]/T (k, n) 6: u id = min(1, S 1 + S 2 /2) 7: for i = 1 : 1 : β do 8: 
III. PROPOSED ARCHITECTURE
We first discuss the KLUCB architecture details followed by a brief discussion on modifications needed for UCB realization. Each slot in the KLUCB algorithm consists of three tasks: 1) Initialization (only for first K slots) and parameter update based on the reward feedback, 2) QF calculation for each arm, and 3) Arm selection using calculated QF values.
A. Initialization and Parameter Update Block
The initialization (INIT) and parameter update block is identical in both algorithms. At the beginning of a new experiment (n = 0), the algorithm enters into the INIT phase and its duration is K slots. The aim is to select each arm only once and in our architecture, this is accomplished using a pseudo-random sequence generator of length K.
In each slot, values of {X, T } are updated based on the feedback as shown in Eq. 2 and 3 in Section II and the value of n is incremented by 1. The parameter update can be done at the end of the current slot after receiving the reward for the chosen arm or in the beginning of the subsequent slot. We choose the latter approach and hence, the feedback signal contains the information, I n−1 and R n−1 i.e., the reward received from the selected arm in (n − 1) th slot. The format of the feedback signal is shown in Fig. 1 . The first bit is the reward (0 or 1 for Bernoulli case), second is the restart bit (1 to begin new experiment and 0 to continue the same experiment) and the remaining bits indicate the arm selected in the previous time slot resulting in total log 2 (K max )+2 bits. For easier understanding, the arms are shown to be selected in deterministic order in the INIT phase. For reward distributions with non-integer rewards, additional bits are needed. The architecture for the first task is shown in Fig. 2 . In all figures, a double-headed arrow indicates AXI4 protocol where M and S symbols denote master and slave ports, respectively. In Fig. 2 , the input decoder decodes the AXI4 feedback signal and generates various enable signals. For instance, n en is generated once every slot which increments n by 1 using update block as shown in Fig. 2 . If the k th arm is selected in the previous slot, then only T k en is generated. Similarly, Xk en is generated only if k th arm is chose and its reward is 1.
B. QF Calculation
In each slot after the INIT phase, the algorithm calculates the value of QF for each arm using the updated parameters, X(k, n), T (k, n) and n. We map the QF calculation steps of the modified KLUCB discussed in Algorithm 1 to the suitable architecture shown in Fig. 3 . Since the QF calculation is identical for each arm and can be done in parallel without any interdependence, we limit our discussion to a single arm.
The QF calculation needs three AXI4 stream inputs which are pre-processed to get S 1 , S 2 , l id and u id using the Steps 4-6 in Algorithm 1. Note that all operations are performed using the IPs with the AXI4 stream interface and for maintaining the clarify of architecture, we have omitted the AXI4 signals in Fig. 3 . Furthermore, though reward, X(k, n), can have only integer (1 and 0) values for Bernoulli distribution, the architecture supports single-precision floating-point arithmetic incurred in exponential and Poisson distributions.
After pre-processing, QF calculation needs β number of sequential loops. Note that due to interdependence between loops, these loops cannot be executed in parallel. For illustration, we have shown the architecture depicting various arithmetic operations and KL divergence (Eq. 6) calculation in each iteration of the loop. Necessary care has been taken to generate appropriate valid signals for each intermediate outputs so that the arithmetic blocks are enabled only when needed. Since only one iteration is active at a time, the same hardware is re-utilized for all β iterations in KLUCB.
Using the reconfigurable and intelligent architecture discussed later in Section IV, the proposed architecture can switch to a light-weight UCB algorithm after the initial learning period. To highlight the need for such switch from a complexity perspective, the difference in computational complexity of QF calculations in two algorithms is highlighted in Fig. 3 . It can be observed that QF UCB is obtained by disabling all iteration blocks of KLUCB along with logarithm, additional and minimum number identification sub-blocks in the preprocessing block (shown using the shaded pattern). Though savings in power and latency is evident, we need embedded intelligence and reconfigurability to auto-enable such switch.
C. Arm Selection
In each slot, a new arm with maximum QF value is selected by comparing the updated QF values as shown in Eq. 4. Corresponding architecture for K = 4 is shown in Fig. 4 where the output is I n i.e. index of an arm having the highest QF. Note that QF and arm selection blocks are bypassed in the INIT phase. Using these three blocks, proposed reconfigurable and intelligent architecture is presented in the next section.
IV. INTELLIGENT AND RECONFIGURABLE ARCHITECTURE
The proposed intelligent and reconfigurable architecture on ZSoC consisting of ARM processor (processing system i.e. PS) and FPGA (programmable logic i.e. PL) is shown in Fig. 5 . PL contains three blocks (1-3) corresponding to three tasks discussed in the previous section. The fourth task of generating the feedback signal with appropriate reward is realized in the ARM processor (PS) thereby making PS act as an environment. The INIT and parameter update block in PL can be integrated with the QF calculation block. However, resultant architecture demands four AXI4 handshakes between PS and PL in each slot which in turn incurs significantly penalty due to K AXI write transactions compared to one transaction for architecture in Fig. 5 . Other realizations such as 1) Only ARM, and 2) ARM+NEON Co-processor, are also considered and please refer to Section V for details.
The proposed architecture is made reconfigurable via the DPR property of ZSoC. Specifically, PS is responsible for generating appropriate DPR signals for changing the number of arms and types of algorithms, i.e., on-the-fly configuration to UCB, UCB T, UCB V and KLUCB algorithm. To enable such reconfiguration, we have incorporated PS controlled DPR via Processor Configuration Access Port (PCAP). For our architecture in Fig. 5 with K max = 4, we have four reconfiguration regions (RR), i.e. the region whose functionality can be changed on-the-fly. Since, each region can be configured with blank, UCB, UCB V, UCB T or KLUCB QF block, these five partial bit-streams are stored in the main memory or SD card. Via bare-metal application deployed on the ARM processor, the desired bit-streams are sent to the FPGA for appropriate RR configuration using the device configuration (DevC) direct memory access (DMA).
Reconfiguration of the number of arms depends on the environment and hence, can be user-controlled. Similarly, the user can decide the type of algorithm at the beginning of the experiment. The proposed architecture offers additional intelligence to automatically switch between the algorithms in an ongoing experiment to optimize latency and power without compromising on performance. For example, KLUCB is optimal because it reduces exploration by quick identification of optimal arm compared to UCB. This means though both KLUCB and UCB are asymptotically optimal, i.e. they both can identify the optimal arm, KLUCB is better as it identifies the optimal arm in fewer exploration than UCB. Based on this observation, we embed additional intelligence in our architecture to deploy KLUCB in the initial slots and onthe-fly automatic switch to light-weight UCB after an initial learning period. As shown in Fig. 3 , we obtain Q u (k, n) as well as Q kl (k, n) simultaneously. These two values are then compared in arm selector block to see whether both leads to the selection of the same arm. Based on this comparison, C n is generated which is 1 when the same arm is selected else 0. The intelligence unit in the ARM processor regularly checks C n over a suitably chosen window period and enable a switch to UCB if C n is observed 1 for the majority of times in the window period (indicating completion of the KLUCB exploration).
In Fig. 6 , we demonstrate the functioning of the proposed architecture. As shown, the user can add a new arm or remove the arm by choosing the appropriate option. When a user runs a new experiment with 3 machines, it can be observed that the switch between KLUCB to UCB happens at slot number 1526 and arm 3 with the highest reward is chosen maximum number of times. Next, the user adds new arm on-the-fly via DPR and in an experiment with four arms, algorithm needs more time for exploration which means KLUCB to UCB switch is delayed till slot 1809. As expected, arm 4 is chosen highest number of times. Please refer to [6], [7] containing a supplementary tutorial explaining detailed block design, DPR steps, and source codes.
V. PERFORMANCE AND COMPLEXITY ANALYSIS
To begin with, we compare the reward performance of the modified KLUCB in Algorithm 1 (referred to as KLUCB+UCB) with β = 16, KLUCB [4] and UCB [1] algorithms realized on ZSoC. We consider K = 4 arms with a horizon consisting of N = 10000 slots. The arms offer Bernoulli rewards with two different sets of mean distributions: 1) µ 1 = {0.2, 0.4, 0.6, 0.8} and µ 2 = {0.51, 0.52, 0.53, 0.54}. For easier analysis, the last arm has been chosen as the best arm i.e. arm with the highest reward. The average reward per slot for µ 1 and µ 2 is 0.54 and 0.8, respectively and this happens when the algorithm consistently selects the fourth arm. However, algorithms need exploration to learn arm distribution before converging to ideal average reward as shown in Fig. 7 . As expected, optimization-based KLUCB offers the highest reward while proposed KLUCB+UCB (modified KLUCB with an intelligent switch to UCB after exploration) closely matches with the KLUCB and significantly outperforms UCB. Next, we compare the resource utilization of various architectures in Table I . First, we consider two architectures for modified KLUCB discussed in Algorithm 1: 1) Reconfigurable via DPR, and 2) Velcro (conventional) approach. In the Velcro approach, all arms and hence, all QF calculation blocks are active at all times while the proposed reconfigurable architecture allows dynamic activation and deactivation of each arm. Note that none of these algorithms have been mapped to architectures yet in the literature. As shown in Row 1 and 2 of Table I , a reconfigurable approach offers lower resource utilization and power consumption except for a small increase in LUT when K = K max . Similarly, we consider two more architectures for the proposed KLUCB+UCB algorithm Table I ) offer lower resource utilization and power consumption. Furthermore, when K max = 4 and K = 2, the proposed approach consumes only 1.648W compared to 1.826W in the Velcro approach thereby offering around 5-10% power saving for small-to-medium (K/K max ) ratio. Also, if we replace algorithm X in row 5 and 6 with more complex algorithms such as Bayesian UCB or TS or KLUCB extension or when the number of arms is large, i.e. K max > 20, there will be further improvement in these savings due to proposed reconfigurable approach. In Row 7, we consider only the ARM-based realization of the algorithm. Though its power consumption is around 1.567W and lowest due to hard processor, it has poor latency and hence, throughput as discussed next.
In Table II , we compare the execution time of various algorithms on three platforms: 1) Complex ZSoC (ARM + FPGA) as shown in Fig. 5 , 2) Only ARM, 3) ARM+NEON Coprocessor. The execution time on the ARM is highest followed by the ARM+NEON platform while ZSoC offers the best performance validating the proposed hardware-software co-design approach. Between KLUCB and KLUCB+UCB approach, the latter offers more than 88% reduction in execution time over the former. Thus, the proposed KLUCB+UCB approach offers lower execution time (Table II) without compromising on reward performance (Fig. 7) . In wireless applications, MAB algorithms are realized in upper layers (MAC/Network) i.e. in ARM or other processors while the PHY is present in the SoC [5] , [8] . The proposed architecture enables the shifting of the MAB algorithms from MAC to PHY layers along with an embedded intelligence unit that offers an accelerator factor ranging 50-100. Based on the results, we can say that the acceleration factor increases with the increase in K. Next, we highlight the effect of β on the performance of the KLUCB algorithm. As shown in Table III , as the value of β increases, the execution time increases due to sequential iterations in Algorithm 1. However, the rate of increases is substantially low in the proposed hardware-software co-design approach. In addition, we can see that the reward improves with β and thus, the appropriate value of β should be chosen to meet the desired trade-off between execution time and performance. In terms of learning performance, we observed that the error between actual and learned statistics decreases with an increase in β.
We also analyze the usefulness of the proposed architecture in for cognitive ad-hoc wireless networks where radio user aims to select the optimum channel for throughput maximization [5] . Here, throughput refers to the number of bits transmitted per second (bps). We assume K = 4, N = 10000 and consider two different types of channels with statistics, µ, randomly generated. As shown in Table IV , proposed ZSoC based architecture offers a higher number of transmission of data bits as well as throughput than ARM+NEON based architecture. Another interesting observation is that KLUCB leads to the transmission of a higher number of bits but the proposed KLUCB+UCB offers significantly higher throughput due to lower execution time. For applications where FPGA is not available, the throughput of ARM+NEON based KLUCB+UCB realization is closed to that of ZSoC based KLUCB and at least 10 times higher than ARM-based KLUCB. We may be able to achieve further improvement in throughput if we select an appropriate UCB algorithm (UCB, UCB V or UCB T). Embedding such intelligence to select an algorithm is a challenging problem and focus of future work. VI. CONCLUSIONS AND FUTURE DIRECTIONS A novel intelligent, reconfigurable, fast and computationally efficient architecture for KullbackLeibler based Upper Confidence Bound (KLUCB) algorithm is presented in this paper. The performance analysis based on average reward, execution time, resource utilization and throughput highlights the advantages of the proposed architecture and its suitability for applications such as intelligent radio based wireless networks. Building on this state-of-the-art platform, future work will focus on open research problems such as intelligence to select the algorithm as well as optimal adaption strategy in dynamic and uncertain environment.
Introduction
In this lab, you will use Vivado IPI and Software Development Kit to create a reconfigurable peripheral using ARM Cortex-A9 processor system on Zynq. You will use Vivado IPI to create a top-level design, which includes the Zynq processor system as a sub-module. During the PR flow, you will define four Reconfigurable Partitions having three Reconfigurable Modules (UCB_V, UCB and UCB_tuned). You will create multiple Configurations and run the Partial Reconfiguration implementation flow to generate full and partial bitstreams. You will use ZC706 to verify the design in hardware using a SD card to initially configure the FPGA, and then partially reconfigure the device using the PCAP under user software control.
Github Link: https://github.com/Sai-Santosh-99/PR_UCB
Design Description
The purpose of this lab exercise is to implement a design that can be dynamically reconfigurable using PCAP resource and PS sub-system. The system consists of four peripheral (arms), having three unique function calculation capabilities (UCB_V, UCB and UCB_tuned). Then, the output (Quality-factor) of these machines goes into a Comparator IP which selects the machine with the largest Q-factor value. The user verifies the functionality using a user application. The dynamic modules are reconfigured using the PCAP resource available through Device Configuration block. The design is shown in Figure 1 .
Figure 1. The design
The Sources directory provides the machine cores, source file for UCB_V, UCB and UCB_tuned, the software application (C code TestApp.c), and a place holder for the floorplan constraints (floorPlan.xdc). The Synth and its sub-directories structure will hold the synthesized checkpoints, the Implement and its sub-directories will hold the implemented configurations, the Checkpoint will hold the static, and the two configuration checkpoints, and the Bitstreams directory will hold the generated full and partial bitstreams. In the home directory, there are several Tcl scripts which will perform several tasks including the processor system creation and the bottom-up synthesis of the reconfigurable modules.
Procedure
This lab is separated into steps that consist of general overview statements that provide information on the detailed instructions that follow. Follow these detailed instructions to progress through the lab.
General Flow for this Lab

Generate DCPs for the Static Design and RM Modules
Step 1 This script will create the block design called system, instantiate ZYNQ PS with SD 0 and UART 1 interfaces enabled. It will also enable the GP0 interface along with FCLK0 and RESET0_N ports. The provided machines IP, PR decoupler and the Comparator will also be instantiated. It will then create a top-level wrapper file called system_wrapper.v which instantiates the system.bd (the block design).
Step 1: Generate DCP for Static and RM modules
Step 2: Load Static and one RM for each RP
Step 3: Define Reconfigurable Properties
Step 4: Define Reconfigurable Partitions
Step 5: Run Design Rule Checker
Step 6: Create and Implement First Configuration
Step 7:
Create Other Configurations
Step 8: Run PR_Verify
Step 9: Generate Bit Files
Step 10: Generate Software Application
Step 11: Test the Design 
1-2.
Synthesize the design to generate the dcp for the static logic of the design.
1-2-1.
Click Run Synthesis under the Synthesis group in the Flow Navigator to run the synthesis process.
Wait for the synthesis to complete. When done click Cancel.
1-2-2.
Using the windows explorer, copy the system_wrapper.dcp file from tutorial\tutorial.runs\synth_1 into the Synth\Static directory under the current lab directory.
1-2-3.
Copy design checkpoints for the auto_pc, machine_arms, comparator, pr_decoupler_0, xbar_0, rst_ps7_0_20M, and processing_system7_0 instances to Synth\Static to sit alongside system_wrapper.dcp 1-2-4. Close the project by typing the close_project command in the Tcl console or selecting File > Close Project.
1-3.
Since we have RMs in HDL format, we need to synthesize them and generate the dcp for each of the RMs. The generated dcps should be stored in appropriate directories so they can be accessed correctly; particularly, the dcp files for RM must be in separate directories as their dcp file names will be same for a given RP.
1-3-1.
The HDL files for all the algorithms corresponding to each have been provided. Synthesize each each RM by creating a separate Vivado project. For the algorithm, synthesize the KL_UCB IP as kl_ucb, synthesize the UCB_V IP as q_variance, synthesize the UCB_T IP as q_tuned, and synthesize the UCB IP as Q_function. You can also see these IPs instantiated in the given HDL files.
1-3-2.
Synthesize each of the RMs and write the design checkpoint (dcp) in the respective destination folder under the Synth directory. After each RM's dcp is generated, close the design.
1-3-3.
At this point the directory content will look like shown below. Here, UCB_1 denotes that this dcp corresponds to the UCB algorithm for the first machine. You just need to synthesize each algorithm for each machine separately and add the corresponding .dcp files to the respective folders.
Figure 7. Synth directory hierarchy and content
Load Static and one RM for the RP in Vivado
Step 2
Since all required netlist files (dcp) for the design are now available, you will use Vivado to floorplan the design, define Reconfigurable Partitions, add Reconfigurable Modules, run the implementation tools, and generate the full and partial bitstreams.
2-1. In this step you will load the static and one RM designs for the RP.
2-1-1. In the Tcl Shell window enter the following command to change to the lab directory and hit Enter.
cd c:/Summer/Tutorial 2-1-2. Execute the following Tcl script to load the static design checkpoint.
source load_design_checkpoints.tcl
The script will do the following:
• Load the static design using the open_checkpoint command. You can now see the design structure in the Netlist pane with an RM for the u1, u2, u3 and u4 module loaded. 
Define Reconfigurable Properties on each RM
Step 3
3-1. In this design you have one Reconfigurable Partition having two RMs.
Define the reconfigurable properties to the loaded RM. 
Define the Reconfigurable Partition Region
3-2. Next you must floorplan the RP region. Depending on the type and amount of resources used by all the RMs for the given RP, the RP region must be appropriately defined so it can accommodate any RM variant.
3-2-1. You execute the following command to define the region for each RP, perform the DRC.
read_xdc floorplan.xdc
Create and Implement First Configuration
4-1. Create and implement the first Configuration.
4-1-1. Execute the following command to implement the first configuration, the UCB variant.
source create_first_configuration.tcl
The script will do the following tasks:
• The script will optimize, place and route the design by executing the following commands.
opt_design place_design route_design
• Save the full design checkpoint.
write_checkpoint -force Implement/UCB/top_route_design.dcp
At this point, a fully implemented partial reconfiguration design from which full and partial bitstreams can be generated is ready. The static portion of this configuration must be used for all subsequent configurations, and to isolate the static design, the current reconfigurable module must be removed.
4-2.
After the first configuration is created, the static logic implementation will be reused for the rest of the configurations. So it should be saved. But before you save it, the loaded RM should be removed.
4-2-1.
Execute the following command to update the design with the blackbox and write the checkpoint.
source lock_placement_with_blackbox.tcl
• Clear out the existing RMs executing the following commands.
update_design -cell system_i/machine/inst/machine_arms_v1_0_S00_AXI_inst/u1 -black_box update_design -cell system_i/machine/inst/machine_arms_v1_0_S00_AXI_inst/u2 -black_box update_design -cell system_i/machine/inst/machine_arms_v1_0_S00_AXI_inst/u3 -black_box update_design -cell system_i/machine/inst/machine_arms_v1_0_S00_AXI_inst/u4 -black_box
Issuing this command will result in design changes including, the number of Fully Routed nets (green) decreased, the number of Partially Routed nets (yellow) has increased, and rp_instance may appear in the Netlist view as empty.
• Lock down all placement and routing by executing the following command.
lock_design -level routing
Because no cell was identified in the lock_design command, the entire design in memory (currently consisting of the static design with black boxes) is affected.
• Write out the remaining static-only checkpoint by executing the following command.
write_checkpoint -force Checkpoint/static_route_design.dcp
This static-only checkpoint would be used for any future configuration, but here, you simply keep this design open in memory.
Create Other Configurations
5-1. Read next set of RM dcp, create and implement the second configuration.
5-1-1.
Execute the following command to create and implement the second configuration, the UCB_T variant.
source create_second_configuration.tcl
• First, it will open the blanking configuration using the tcl command: The script will do the following tasks:
• Open the static route checkpoint.
open_checkpoint Checkpoint/static_route_design.dcp
• For creating the blanking configuration, use the update_design -buffer_ports command to insert LUTs tied to constants to ensure the outputs of the reconfigurable partition are not left floating.
update_design -buffer_ports -cell system_i/machine/inst/machine_arms_v1_0_S00_AXI_inst/u1
update_design -buffer_ports -cell system_i/machine/inst/machine_arms_v1_0_S00_AXI_inst/u2 update_design -buffer_ports -cell system_i/machine/inst/machine_arms_v1_0_S00_AXI_inst/u3 update_design -buffer_ports -cell system_i/machine/inst/machine_arms_v1_0_S00_AXI_inst/u4
• Now place and route the design. There is no need to optimize the design.
place_design route_design
The base (or blanking) configuration bitstream, when we generate in the next section, will have no logic for either reconfigurable partition, simply outputs driven by ground. Outputs can be tied to VCC if desired, using the HD.PARTPIN_TIEOFF property.
• Save the checkpoint in the BLANK directory.
write_checkpoint -force Implement/BLANK/top_route_design.dcp 
source verify_configurations.tcl
The script will perform the following tasks:
• Execute the pr_verify command and then close the project:
pr_verify -initial Implement/TUNED/top_route_design.dcpadditional {Implement/BLANK/top_route_design.dcp Implement/UCB/top_route_design.dcp Implement/UCBV/top_route_design.dcp}
Introduction
In this lab, you will use Vivado IPI and Software Development Kit to create a reconfigurable peripheral using ARM Cortex-A9 processor system on Zynq. You will use Vivado IPI to create a top-level design, which includes the Zynq processor system as a sub-module. During the PR flow, you will define four Reconfigurable Partitions having three Reconfigurable Modules (KL_UCB, UCB and Comparator). You will create multiple Configurations and run the Partial Reconfiguration implementation flow to generate full and partial bitstreams. You will use ZC706 to verify the design in hardware using a SD card to initially configure the FPGA, and then partially reconfigure the device using the PCAP under user software control.
Design Description
The purpose of this lab exercise is to implement a design that can be dynamically reconfigurable using PCAP resource and PS sub-system. The system consists of four peripheral (arms), having two unique function calculation capabilities (KL_UCB, UCB) and a reconfigurable Comparator design. The architecture is designed such that it switches automatically to a much resource-efficient UCB algorithm after learning the arm statistics using KL-UCB. The output (Quality-factor) of these machines goes into a Comparator IP which selects the machine with the largest Q-factor value. The user verifies the functionality using a user application. The dynamic modules are reconfigured using the PCAP resource available through Device Configuration block. The design is shown in Figure 1 .
Figure 1. The design
The Sources directory provides the machine cores, source file for KL-UCB, UCB, the software application (C-code TestApp.c), and a place holder for the floorplan constraints (floorPlan.xdc). The Synth and its sub-directories structure will hold the synthesized checkpoints, the Implement and its sub-directories will hold the implemented configurations, the Checkpoint will hold the static, and the two configuration checkpoints, and the Bitstreams directory will hold the generated full and partial bitstreams. In the home directory, there are several TCL scripts which will perform several tasks including the processor system creation and the bottom-up synthesis of the reconfigurable modules.
Procedure
General Flow for this Lab
Generate DCPs for the Static Design and RM Modules
Step 1 This script will create the block design called system, instantiate ZYNQ PS with SD 0 and UART 1 interfaces enabled. It will also enable the GP0 interface along with FCLK0 and RESET0_N ports. The provided machines IPs, PR decoupler and 2 Comparator IPs will also be instantiated. It will then create a top-level wrapper file called system_wrapper.v which instantiates the system.bd (the block design).
Create Other Configurations
Step 11: Test the Design 1-1-6. Select File > Save Block Design.
1-2.
Synthesize the design to generate the dcp for the static logic of the design. 
1-3.
1-3-1.
The HDL files for all the algorithms corresponding to each have been provided. Synthesize each each RM by creating a separate Vivado project. For the algorithm, synthesize the KL_UCB IP as kl_ucb, and synthesize the UCB IP as Q_function. You can also see these IPs instantiated in the given HDL files.
1-3-2.
1-3-3.
At this point the directory content will look like shown below. Here, UCB_1 denotes that this dcp corresponds to the UCB algorithm for machine 1. You just need to synthesize each algorithm separately and add the corresponding .dcp files to the respective folders.
Figure 7. Synth directory hierarchy and content
1-3-4. Also, synthesize the dcp for both the comparator modules. One of the modules will be switched to blanking configuration once the architecture switches to the resource-efficient UCB algorithm.
Load Static and one RM for the RP in Vivado
2-1.
In this step you will load the static and one RM designs for the RP.
2-1-1.
In the Tcl Shell window enter the following command to change to the lab directory and hit Enter. source load_design_checkpoints.tcl
• Load the static design using the open_checkpoint command. source create_first_configuration.tcl
write_checkpoint -force Implement/KL/top_route_design.dcp
4-2.
4-2-1.
source lock_placement_with_blackbox.tcl
update_design -cell system_i/machine/inst/machine_arms_v1_0_S00_AXI_inst/u1 -black_box update_design -cell system_i/machine/inst/machine_arms_v1_0_S00_AXI_inst/u2 -black_box update_design -cell system_i/machine/inst/machine_arms_v1_0_S00_AXI_inst/u3 -black_box update_design -cell system_i/machine/inst/machine_arms_v1_0_S00_AXI_inst/u4 -black_box update_design -cell system_i/ComparePR/inst/comparatorPR_v1_0_S00_AXI_inst/u1 -black_box
lock_design -level routing
write_checkpoint -force Checkpoint/static_route_design.dcp
Create Other Configurations
5-1. Read next set of RM dcp, create and implement the second configuration.
5-1-1.
source create_second_configuration.tcl
• First, it will open the blanking configuration using the tcl command: write_checkpoint -force Implement/UCB/top_route_design.dcp
• Close the project close_project 5-2. Create the blanking configuration.
5-2-1.
Execute the following command to create and implement the second configuration source create_blanking_configuration.tcl
update_design -buffer_ports -cell system_i/machine/inst/machine_arms_v1_0_S00_AXI_inst/u1 update_design -buffer_ports -cell system_i/machine/inst/machine_arms_v1_0_S00_AXI_inst/u2
update_design -buffer_ports -cell system_i/machine/inst/machine_arms_v1_0_S00_AXI_inst/u3
update_design -buffer_ports -cell system_i/machine/inst/machine_arms_v1_0_S00_AXI_inst/u4
update_design -buffer_ports -cell system_i/ComparePR/inst/comparatorPR_v1_0_S00_AXI_inst/u1
place_design route_design
write_checkpoint -force Implement/BLANK/top_route_design.dcp
• Close the project Close_project Run PR_Verify 6-1. You must ensure that the static implementation, including interfaces to reconfigurable regions, is consistent across all Configurations. To verify this, you run the PR_Verify utility 6-1-1. Run the pr_verify command from the Tcl Console.
source verify_configurations.tcl
pr_verify -initial Implement/KL/top_route_design.dcp -additional {Implement/BLANK/top_route_design.dcp Implement/UCB/top_route_design.dcp}
You should see the message indicating the KL configuration is compatible with BLANK, and the KL configuration is compatible with UCB. Execute the following command to close the project.
close_project Generate Bit Files 7-1. After all the Configurations have been validated by PR_Verify, full and partial bit files must be generated for the entire project
