Run-Time Management (RTM) systems are used in embedded systems to dynamically adapt hardware performance to minimise energy consumption.
implementations in C. Formal verification is used to ensure correctness of the Event-B models. The portability offered by our methodology is validated by modelling a Reinforcement Learning (RL) based RTM for two embedded applications and generating implementations for three different platforms (ARM Cortex-A8, A7 and A15) that all achieve energy savings on the respective platforms.
Keywords: Run-Time Management, Code Generation, Formal Methods, Verification
Introduction
Dynamic Voltage and Frequency Scaling (DVFS) has been used to reduce the energy consumption of mobile and embedded systems at run-time, while maintaining a required Quality of Service (QoS) [1, 2, 3] . In a cross-layer approach to DVFS control, a Run-Time Management (RTM) system interacts with both the application (to ensure that QoS requirements are met) and the hardware platform (to monitor and control core activities). RTM typically includes workload prediction and machine learning algorithms [4, 5] .
There are a number of complexities associated with the RTM system implementation. One of the challenges is that it is coupled with the hardware 10 platform specifications, and is implemented individually for each specific platform. Hardware specifications vary from one platform to another, and include a number of characteristic including performance parameters and interaction interfaces. Performance parameters include the range of Voltage and Frequency (VF) settings, the range of workload types to execute an application, and the DVFS latency. Interaction interfaces define the connection between the hardware platform, the application and the RTM. The former influences the RTM algorithms. The later influences the way the RTM interacts with the application and the hardware, e.g. to read QoS from the application, to control VF in the core, and to monitor the workload from the core. 20 In this paper, we present a framework for developing RTM systems in a way that is independent of the platform specification diversity, making RTM designs portable across different platforms. Our approach uses a formal method to design a high level model of the RTM system, and generate the implementation automatically from the formal model. Formal methods are mathematically-based techniques used for specifying and reasoning about software and hardware systems [6] . We use the Event-B formal method [7] to model and verify RTM systems. The performance parameters and interaction interfaces are instantiated for a specific platform in order to generate a platform-specific RTM implementation from a platform-independent design 30 model. Code generation has been introduced in the Event-B formal method to bridge the gap between abstract specifications and implementation [8] .
While the design model is independent of the platforms, the generated code is specific to each platform.
The other challenge associated with the RTM system is its correctness.
An RTM mechanism should not compromise the reliability or performance of the platform it is managing. Formal modelling is associated with the verification techniques which can ensure the correctness of the RTM design.
The use of formal methods helps to reduce costs by identifying specification and design errors at early development stages when they are cheaper to fix [7] . 40 To validate the portability offered by our approach, we have modelled a platform-independent Reinforcement Learning (RL) based RTM for deadlinebased applications and generated platform-specific implementations for three different platforms: ARM Cortex-A8, A7, A15. The impact analysis shows energy saving on the respective platforms. The run-time algorithms are based on the work of [9] , which uses prediction for estimating the workload, and RL to select the VF setting; but it [9] does not include any formal approach nor a design model.
In [10] we presented work on automatic code generation of an RTM implementation from a platform-dependent Event-B design model for a specific 50 platform (ARM Cortex-A8) and a video decoder application. In this paper we present a general model-based framework for RTM generation that deals with platform diversity through model parameterisation and customised code generation and we present a more comprehensive validation through experimentation with three different platforms and a wider range of applications. To the best of our knowledge, no work on a formal design of platform-independent RTM systems followed by automatic generation of RTM implementation has been reported.
The paper is structured as follows: Section 2 outlines the platform architectural diversity that motivates our research and an approach to address 60 platform-independent design. Background knowledge including learningbased power management and the Event-B formal method, are presented in Section 3. Section 4 explains our model-based framework for embedded RTM in detail. Finally Section 5 presents the experimental results for three platforms and Section 6 concludes and outlines the future work.
Motivation: Addressing Platform Specification Diversity
Our RTM approach uses Reinforcement Learning (RL) [11] to achieve optimal decisions for VF settings. An RL based RTM is highly dependent on hardware platform specifications. The objective of RL is to learn and make better decisions under workload variation. In the exploration stage, random 70 actions are taken and the corresponding responses (rewards or penalties) are recorded in a lookup table called a QTable. In the exploitation stage, the decisions that can achieve highest rewards are applied. Decisions in RL terminology are known as actions and the workloads are known as states.
This information is stored as rows and columns in the QTable.
To implement the RL correctly in various platforms, the differences between the platforms need to be identified. We have implemented a video decoding application in three different platforms: ARM Cortex-A8, A7 and A15 processors. Table 1 shows the platform specific parameters including number of VF pairs, DVFS switching latency and relative performance of 80 each platform measured by Cycles Per Instruction (CPI) normalised to that of A8. CPI can be different for different applications (due to different instructions), and the CPI data presented in Table 1 represents average CPI measurements based on the video decoder application. These platform specific parameters can influence the implementation of the RL algorithm, for example the size of the QTable is different for each platform because of the difference in the number of VF pairs (2nd column) and the CPI (last column). In addition, the switching between different VF pairs is not instant, and the DVFS switching latency (3rd column) needs to be subtracted from the deadline when calculating rewards and penalties. The other difference is to do with the interfacing functions between the application, runtime manager and hardware such as reading the deadline, controlling VF settings and monitoring the workload. All these factors will affect the implementation of the RTM code. To ensure the correct functionality of RTM code, a systematic approach is needed to identify the difference in platform parameters and generate the correct implementation for each platform.
90
To address platform-independent RTM development, we propose a framework in which the RTM design model is independent of the platform diversity, and the RTM implementation can be automatically generated specifically for 100 each platform. Figure 1 shows an example of the framework being used for two different platforms: a Cortex-A8 and A7. The generic framework, and details of the steps, will be explained later in Section 4. From top to bottom, Requirements High) level)description) of EWMA)prediction) algorithm,) RL)algorithm. Platform)Parameters:)frequencies,) DVFS)latency.
Design3Model: an3action3 from3the3Event'B3model
Step)1
Step)2
Step)3 Both implementations are automatically generated from the same model, even though one has 4 branches and the other has 13. In contrast to the automatic generation, modifying one version of an implementation to a different number of branches manually, would require re-coding and can be error-prone.
The presented model-based framework to build the RTM is intended to 150 achieve increased productivity of RTM software in embedded systems. Our previous work [10] presents our initial effort to apply formal methods in embedded software area and the outcome model was specific to one platform, whereas the proposed framework in this paper is demonstrating a general platform-independent model. The Platform Independent (PI) design model is reusable across different platforms with diverse core characteristics. Platform Specific (PS) core characteristics are used to instantiate the PI design model to be transformed to the PS executable software. Moreover the framework addresses the correctness of the RTM design; the Event-B formal model is verified using theorem proving and model checking to ensure the correctness of the modelled properties and consistency between different refinement levels of the design model.
Background

Learning-based RTM
In this paper, we apply our approach to an RTM that manages applications with epochs of varying workload, e.g., each frame in a video decoder is an epoch and workload varies between frames. Our RTM algorithm [9] works in two phases, Prediction and Decision Making. For each frame, the RTM first predicts the workload to be executed, and then it decides the VF setting so that the predicted workload can finish execution before the epoch 170 deadline. After the epoch has completed, the RTM learns by using feedback to update its parameters for computing future frames. To achieve the first objective, predictions of the workload for the next frame are performed using an Exponential Weighted Moving Average (EWMA) [12] . For the Decision Making, Reinforcement Learning (RL) is used [11] , using the Q-Learning algorithm. The objective of RL is to learn how to make better decisions under variations. Decisions in RL terminology are known as actions, and the environment is represented as states.
The RTM algorithm is shown in Algorithm 1. For every new epoch, the RTM first predicts the workload, based on this it selects a VF value.
180
After processing the frame, the performance is determined to fine tune the prediction and the decision algorithms.
Exploration and Exploitation Phases: initially there is no knowledge of the system workloads, so the decision algorithm must start exploring deci- It is important to note that the VF pairs are discrete, so the best decision This phase of the algorithm is called the Exploitation phase. The learning algorithm [11] also penalises in case of system overload, even at the highest 200 frequency. However the penalty is proportional to the deadline miss time, therefore even though running at the highest frequency incurs a penalty, it will be smaller than running at lower frequencies. In this paper, we have not modelled the system overload and so we do not support penalisation for it.
Event-B Formal Method
Event-B [7] is a formal method for system-level modelling and analysis which allows us to produce a precise formal model of the RTM algorithms that abstract away from platform dependent parameters and interfaces. Key features of Event-B are the use of set theory and first order logic as a modelling notation, the use of refinement to represent systems at different ab-210 straction levels and the use of mathematical proof to verify correctness of models and consistency between refinement levels. Instead of building a single big model which can be complex and error-prone, Event-B refinement allows us to build the model gradually by introducing details of the system in each refinement level. Therefore we can verify the correctness of a model step by step. The Rodin platform [13] is an Eclipse-based IDE for Event-B that provides support for modelling and mathematical proof. The verified design level: Our design methodology uses the Event-B formal method to create a verified PI model of the RTM system using incremental refinement.
Code generation: The gap between the design level and implementation level is bridged by the code generation tool which automatically transforms the instantiated design model into the executable C code. The code generation in performed by the code generation plugin [8] in the Rodin platform.
The implementation level: The generated RTM implementation is 270 specific to each platform as well as the interfaces to access the QoS, control knobs such as VF setting and monitor core activities.
The execution level: Finally the RTM implementation is complied by GCC compiler and executed in the platforms using Hardware Abstraction Layers (HAL) in the Linux operating system.
We have applied our framework to develop a Reinforcement Learning (RL) RTM system for applications with soft deadlines. The Event-B design model of RL RTM system is instantiated per HW platform and the RL algorithm is automatically generated from the instantiated Event-B design model of the RTM system for each platform. This section presents details of 280 each framework level separately.
Requirement Level
According to the framework illustrated in Figure 2 , our requirement level includes the high level descriptions of the platform-independent RTM algorithms and platform parameters. We outline requirements on the RTM algorithms in this section and the corresponding design model is outlined in the next section.
An overview of the RTM is illustrated in Figure 3 . First the application provides the required deadline, e.g., frames-per-second (FPS) for video decoding, to the RTM; then the optimal value of VF is decided by the RTM.
290
The RTM controls the VF in the hardware and the frame is executed in the hardware. After that the actual value of workload to decode the frame is monitored.
Application* Layer Soft%Deadline%Application
OS Layer Run2Time%Manager
Hardware Layer CPU epoch%deadline VF%setting CPU%cycles To achieve the required deadline set by the application, we use the learning approach outlined in Section 3.1. Details of the prediction and learning algorithms are explained and modelled next.
Design Level
The top level of Figure 4 illustrates our design architecture for Event-B
modelling of the RTM. This figure presents details of the verified design level in Figure 2 .
As shown in Figure 4 , the Event-B model of the RTM system comprises an abstraction level and two refinement levels. In the abstract model we focus on the main functionalities of the RTM system including the variables and actions modelling the interaction of the RTM with the application in a platform-independent way. The abstract model is followed by two levels of refinement, where the workload prediction and the RL algorithms are introduced respectively.
To manage the complexity of the final refinement, and also to prepare the model for code generation, the model is decomposed into two sub-models: 
Prediction Refinements
In the abstract level, we do not model details of the workload prediction nor the decision making. In a refinement level (the middle region of Figure 5 ), the details of the prediction algorithm are added to the abstract events:
select vf and monitor workload.
The select vf event is refined into two concrete events: predict workload,
where the workload is predicted and select vf, where the value of VF is de-360 cided based on our prediction. The monitor workload event is also refined 20 into two events: monitor workload (monitoring the actual workload) and
update prediction (updating the prediction factors). In ERS, the line types indicate whether the corresponding event is a refining event (solid line) or a new event (dashed line). In refining the select vf event, predict workload is a new event and the concrete select vf event refines the abstract select vf.
The prediction algorithm estimates the workload for the next frame using a modified form of Exponential Weighted Moving Average (EWMA). The EWMA algorithm is widely used in the literature [17, 1, 18] because of its lightweight implementation.
370
The EWMA predictor is modelled in two levels of refinements. In the first level, the predictor is defined in terms of the full history of measured workloads; and in the second level, the predictor generates a prediction of the future value based on the average of the previous values weighted exponentially, where the most recent values have greater weight than the older ones. In Section 4.2.4, it is proved that the second definition is a correct refinement of the first one.
In the first refinement of the prediction, the specification of predict workload and update prediction events are as follows:
Event predict workload = 380 act1 : pwl := predict(l, n, wl hst)
Event update prediction = act1 : wl hst := wl hst ∪ {n → w} act2 : n := n + 1 l is a constant specifying the weighting factor, n and ws hst (ws hst :
(1..n) → IN T ) are variables specifying a frame counter and history of mea-sured workloads respectively. In the predict workload event, the predict workload variable (pwl ) is assigned to the predicted value through the predict operator from the EWMA theory. In the update prediction event, the history of the workloads (wl hst variable) is updated to include the last (n th ) 390 monitored actual workload (w variable).
A theory is an Event-B component where we can introduce new mathematical operators. In this development, we have defined a theory of EWMA where the prediction operators are defined. The predict operator is defined in terms of the full history of measured workloads, with three arguments as follows 2 (Z is the set of natural numbers):
Here w(i) is the actual workload (for the i th frame).
400
In the second refinement of the prediction, the specification of predict workload and update prediction events are as follows:
Event predict workload = refines predict workload act1 : pwl := avgwl Event update prediction = refines update prediction act1 : avgwl := update(l, w, avgwl)
In the predict workload event, the pwl variable is assigned to the average workload variable (avgwl ); where avgwl is updated in the update prediction 410 event, according to the definition of update operator in the EWMA theory:
Using the Event-B proof techniques (Section 4.2.4), we verify that the abstract definition, based on the full history of actual workloads, is correctly refined by maintaining a running average. The abstract definition is more clear and thus easier to validate. The refined definition is much more efficient to implement.
The value of freq is calculated based on the predicted workload in select vf event:
420 Event select vf = refines select vf act1 : f req := pwl * f ps This event is refined in the next refinement (decision making) where the freq is selected based on the decision making algorithm.
Decision Making (Reinforcement Learning) Refinement
The bottom region of Figure 5 shows a further refinement, where details of RL are modelled. The select vf event and monitor workload event are refined to include the details of the RL.
At the bottom region of Figure 5 , the select vf event is refined to spec- ing is needed, the may be reduced to allow for more exploration to take place.
Below is the Event-B description of the explore and exploit events. These events are guarded based on the value of the random variable (nondeterministically chosen in the ranGenerator event). If random is greater than the exploration-exploitation ratio ( ), explore executes, otherwise exploit ex-450 ecutes. In the body of the explore event, the freq is assigned to a random VF value (generated in the VFGenerator ). The exploit event assigns freq value into the optimal value of VF according to the predicted workload (pwl 
where re pe is the reward/penalty, t is the runtime and d is the deadline.
At the bottom region of Figure 5 , the monitor workload event is refined to 470 include the update qTable event, where the workload is rewarded or penalised.
The Event-B specification of the update qTable event is as follows:
Event update qTable = 25 any i when The Ouptput specifies the translation of the Event-B Formula to the appropriate syntax in the C programming language.
26
The value of the re pe variable is assigned in cost reward assign and cost penalty assign events. The cost reward assign event is as follows:
Event cost reward assign = when grd : (w/f req) ≤ d then act1 : re pe := min(1, cost reward (w, freq, d))
510
The cost reward assign can executed only when its guard (grd ) holds. grd condition specifies when the finish time is less than or equal to the deadline, meaning the deadline is achieved and the QTable needs to be rewarded. The cost reward is defined as an operator in the RL theory based on Equation 2
. Figure 6 shows the evolution of the QTable. Initially, the values in the QTable are all zeros ( Figure 6(A) ). In the exploration phase the QTable will be filled with values indicating rewards or penalties (Equation 2)). In the exploitation phase, the 'best' actions are determined based on the QTable entries with highest rewards (highlighted in Figure 6(B) ).
520
Model Decomposition: as shown in Figure 4 , the final refinement is divided into two smaller sub-models. The controller sub-model includes the RTM actions: predict workload, ranGenerator, explore, exploit, VFGenerator, updateE and update qTable. The environment sub-model includes the actions to interact with the application and hardware: set fps, execute frame and monitor workload. 
Verification
The Event-B model of the RTM was verified using Rodin theorem proving.
In the last refinement before model decomposition, 76 POs were generated, of which 96% are proved automatically, mostly associated with correct se-530 quencing of events. A manually proved PO is presented here as an example of verification.
As presented in Section 4.2.2, the prediction refinement consists of two levels. The following invariant captures the relationship between the avgwl variable of the refinement and the workload history (wl hst) of the abstract model:
inv1: avgwl = predict(l, n, wl hst)
This invariant is required to prove that the action of the refined prediction event correctly implements the abstract event. To prove this invariant, we introduce the following theorem which shows the algebraic connection 540 between abstract update operator and the concrete predict operator: thm1: ∀n, w·n > 0 ∧ w ∈ Z ⇒ update(l, w, predict(l, n, wl hst)) = predict(l, n + 1, wl hst ∪ {n → w})
The invariant and theorem are proved interactively with the Rodin theorem prover.
We also analysed our model using ProB to ensure that the model is deadlock free and convergent. At any point during model checking, at least one of the events of the model should be enabled to ensure that the model is deadlock free. For each new event added in the refinements, we have verified that it would not take control forever (convergence). Also INV POs ensure that the new events keep the existing ordering constrains between the ab-550 stract events. The ordering between events are specified as invariants, the PO associated with each invariant ensures that its condition is preserved by each event.
Code Generation and Implementation Level
The Event-B model of the RTM system is automatically translated to executable C code using the code generation plugin of the Rodin toolset.
The bottom level of Figure 4 illustrates the procedure of generation of RTM software to be executed on the hardware. To generate code, the controller is instantiated by the platform-specific parameters for one platform and translated to the "Controller.c" file. The platform specific parameters are trans-560 lated to the C variable definitions in the "Common.c" file. The environment, modelling the interactions, is translated to the signature of C functions, representing the interactions. Since in the independent design model we abstracted from details of interactions between the RTM and application and 29 HW layers, the specific interaction APIs for each platform needs to be called in the generated environment file. Our experimental results from executing generated code in various hardware platforms, are presented in Section 5.
How Code Generation works: As shown in Figure 4 , after decomposition, the sub-models are refined to be prepared for translation into C code.
Tasking Event-B sub-models define the control flows between events. Part is translated to a set of "if then else" branches in the number of frequencies (number of qTable columns), to modify the qTable.
As shown in Figure 1 , the generated environment is similar for both plat- and read cpu A7 cycles use platform specific assembly instructions to read the cycle counter for Cortex-A8 and Cortex-A7 respectively. These API functions need to be implemented specifically for each platform to address the differences in both OS and hardware controls. The next section will describe the adopted architecture for the generated RTM at the OS layer.
Execution Level and Hardware Abstraction Architecture
As discussed in previous sections, the model of the RTM is automatically translated into C for its implementation. To provide genericity to the RTM model, the Controller sub-model does not take into account the hardware/application-specific calls needed to interact with the hardware and 630 application layers (included in the Environment sub-model). Dividing into
Controller and Environment sub-models is in-line with the HAL 4 (Hardware Abstraction Layers) principle [19] ; The Controller sub-model is abstracted from platform dependencies, while the Environment sub-model provides platform-specific calls to get/set monitors/knobs. An interface to provide these functions has been designed. Figure 7 shows the modified RTM diagram from Figure 3 , where the box in the centre represents the generic RTM auto generated code, and the highlighted boxes provide the interactions with the hardware and application layers. The translated environmental functions call these interaction interfaces.
640
Application* Layer Soft%Deadline%Application
OS Layer
Code%Generated Run5Time%Manager
Hardware Layer CPU
FPS
VF%setting CPU%cycles
App.%Annotations
Freq Changer
Perf.%Monitor Hardware Abstraction Architecture: In order for the generated RTM to sit at the OS layer, it has been implemented as a Linux Governor [20] through a Loadable Kernel Module (LKM), which provides the interface and drivers to make the VF changes and monitoring workload. This Governor provides the three interfaces needed for the algorithm: the Frequency Changer, the Performance Monitor and the Application Annotations.
As part of the HAL in Linux, the RTM implementation uses sysfs for After the RTM C code is generated, it is cross-compiled with the installed Linux and processor architecture to create the respective LKM. When the LKM is loaded, it waits for the read deadline from the application and the start governor calls to start working. The deadline t deadline is given by the frame rate f ps, so:
The time taken to process the frame (t f rame ) is obtained by getting timestamp(n) of the global system clock given in microseconds every frame, obtaining the difference with the previous frame timestamp(n − 1). The t f rame is then compared with t deadline to decide whether the deadline was passed or not:
Experimental Results and Evaluation
Our experiments demonstrate that we can automatically generate different platform-specific software for different architectures, from the same platform-independent model and observe the effectiveness of the generated implementations in terms of energy management. We validate our work experimentally for three different platforms and two applications in terms of performance and power consumption. Experiments were conducted on the 680 BeagleBoard-xM with Cortex-A8 processor and ODROID-XU3 with both
Cortex-A7 and Cortex-A15 processors. Both platforms were running the We also used the RTM for an application with different characteristics to the video decoder on an ODROID-XU3 Cortext-A15 platform: a Jacobian matrix solver followed by least-squares solution computation, targeting 10 solutions per second with a 1024x1024 randomly seeded matrix. This application demonstrates higher compute load but lower frame rate than the video decoder. Table 2 compares the average performance and power saving between the code generated RTM and the ondemand governor for the video decoding application on three platforms, e.g. for Cortex-A7 the generated RTM achieves 98% of the performance of the ondemand while using 61% of power used 720 with the ondemand. It can be seen that across 3 different processors the generated RTM provides better power and energy savings while maintaining similar performance. The amount of saving varies with platforms due to the difference in number of VF pairs and in relationship between power, voltage and frequency. The power/energy factors are calculated by using system calls to measure time, and current is measured using a current sensor by setting the operating voltage. Performance is the percentage of deadlines that are passed (achieved). This is computed by dividing the number of frames with deadlines passed over the total frames processed. Regarding the Jacobian matrix solver experiment on Cortex-A15, we 730 achieved 100% of the performance of the ondemand while using 18% of power used with the ondemand. We experimented with the matrix solver at exactly 10 times a second. The results are different to the video decoder experiment, since ondemand chooses close to the maximum frequency whilst the RTM, monitoring the application throughput, recognises that the second lowest frequency is sufficient, i.e. 2000MHz vs 300Mhz.
The code generation that we used performs a fairly direct translation of the refined Event-B models to C so the generated algorithms will be as efficient in terms of complexity as the source Event-B model. We had one 38 manually-written RTM implementation of the same RTM algorithm for the 740 Cortex-A7 to compare against in terms of size and complexity. The number of lines of code for the generated code was less than the manually written code (1175 lines for the manually written code versus 475 for the generated code); the difference is largely down to coding style. Both implementations have similar algorithmic structure and thus similar algorithmic complexity.
Conclusion
We presented a model-based framework addressing complexity in RTM software programming due to the diversity of hardware platform characteristics. Although the designer needs to know the formal language and the associated toolset, the formal design model is built once and specific RTM 750 software for different platforms is automatically generated from an identical formal design model. This can result in time saving compared to manual adjustment of the RTM implementation.
In addition to the automatic code generation, formal modelling is augmented by verification techniques. The correctness of the RTM design specifications and consistency of the refinement levels can be ensured by theorem proving and model checking.
We have validated our framework by applying it to develop an RL-based RTM system for a deadline-based application. The Event-B formal language is used to develop a single design model supporting platforms with differ-760 ent characteristics, and the RTM implementations are generated in the C programming language specifically for each platform.
We instantiated the RTM design model for three platforms with different 39 characteristics and performed code generation for each of them; this is followed by evaluation of the effectiveness of the generated implementations in terms of power consumption. In all of the three experiments, energy saving is achieved compared to the Linux-ondemand governor.
To the best of our knowledge, this is the first reported investigation into automatic generation of embedded RTM and verification using high level model specification. The focus of this paper is evaluating the support for 770 portability of RTM embedded across multiple hardware platforms. We envisage the framework working for wider experiments; In our ongoing work the Event-B models are being refined to support RTM algorithms for multi-core architectures and concurrent application.
