Abstract This work presents a formal verification process based on the Systerel Smart Solver (S3) toolset for the development of safety-critical embedded software. In order to guarantee the correctness of the implementation of a set of textual requirements, the process integrates different verification techniques (inductive proof, bounded model checking, test cases generation, and equivalence proof) to handle different types of properties at their best capacities. It is aimed at the verification of properties at system, design, and code levels. To handle the floating-point arithmetic (FPA) in both the design and the code, an FPA library is designed and implemented in S3. This work is illustrated on an Automatic Rover Protection system implemented onboard a robot. Focus is placed on the verification of safety and functional properties and on the equivalence proof between the design model and the generated code.
Introduction
Even though significant progress has been made toward the integration of formal methods in the industry of safety critical systems, their usability is still impaired by their cost. It makes sense to formally verify the safety-critical parts of a system by combining different verification techniques at their best capacities. Moreover, the hope is that once the initial integration is done, subsequent verifications can be achieved at significantly lower costs. This work investigates how this could be achieved using a formal verification toolset, Systerel Smart Solver (S3), 1 and draw some lessons from our experience.
S3 [13] is built around a synchronous language and a model checker (S3-core) based on SAT [3] techniques. As the proof engine, S3-core relies on Bounded Model Checking (BMC) [2] and K-induction [5, 39] techniques. S3 supports different activities of a software development process: property proof, equivalence proof, automatic test cases generation, simulation, debug, and provides necessary elements to comply with the software certification processes. S3 is usually applied in the development process relying on SCADE [10] /Lustre [25] design models and the implementations in C and Ada. It has been exploited as industrial solutions to formally verify the railway signaling systems for years by various industrial companies in this field.
Critical applications used to rely on fixed-point arithmetic that requires less memory and less processor time than floating-point arithmetic (FPA) to perform non-integer computations on executing processors with no floating-point unit (FPU), while leading to a limited-precision. Floating-point numbers support a trade-off between range and precision thanks to its formulaic representation which approximates a real number, and its standardization based on solid mathematical grounds [27] . Nowadays, FPA is more and more used in the space, aeronautics, and automotive industries, as required by the increasing complexity of the computations and because FPUs are becoming standard for most processors. However, a common problem for the safety-critical software is the erroneousness due to the rounding and exceptions in floating-point computations [32] . It is necessary to provide verification means to guarantee the correctness of critical software with floating-point arithmetic.
This article is aimed at implementing a formal verification process (shown in Fig. 1 ), and providing approaches & tools to guarantee the correctness of the safety-critical embedded software. In this process, the textual requirements defined by the end-users are specified as formal design models by the designer. The verification engineer is responsible for verifying the conformance between the textual requirements and the design model. Once the design model is verified, the target code is either implemented by the developer or generated by using specific code generator. In both cases, the compliance between the design model and the code shall be verified by the verification engineer. In this work, we rely on the Lustre language as the design modeling language. For the verification activities, we focus on the bit-level reasoning on the Lustre design model. Every Lustre model, once executed on a concrete context (i.e., on machine), should be first translated to machine executable languages. As we know, almost every such translation uses floating-point numbers as the target machine type for real numbers. Our bit-level reasoning thus follows the same principle on the translation. Both Lustre design model and the implemented/generated code are translated to bit-level models with floating-point number semantics, making it possible to prove the compliance between them.
Our main contributions are twofold. On the one hand, we have designed and implemented a library of floatingpoint arithmetic (FPA-Lib) in S3 that is compliant with the IEEE Standard for FPA [27] , and suitably optimized the library to obtain the bit-level model. To evaluate the performance of this implementation, we show the experimental results on a triplex sensor voter using our FPA-Lib and other SAT/SMT solvers. On the other hand, we present how to use integrated verification process in a typical development process using S3 on an Automatic Rover Protection (ARP) system deployed on a three-wheeled robot. We demonstrate a set of significant activities including specifying textual requirements, making design choice, formatting and proving properties, proving equivalence, etc. Focuses are placed on three main activities: (1) formally specify textual requirements of the embedded software, (2) ensure the conformance between the design model and the textual requirements by proving properties using appropriate formal techniques, and (3) guarantee the compliance of the generated code with respect to the design model by proving the equivalence between the design model and the generated code. We have drawn some lessons about the equivalence proof and the proof-driven design guidance from this experiment. This work provides a guidance to the engineers who need to work on the formal development process, including the proof work. An additional purpose of this work is to make the ARP use case publicly available to the research community.
This article is organized as follows: Sect. 2 presents the S3 toolset; Sect. 3 depicts the design and implementation of floating-point arithmetic in S3, shows experimental results on the triplex sensor voter, and discusses the important issue on the solving performance; Sect. 4 describes the ARP use case; Sect. 5 exposes the verification of safety and functional properties using inductive proof, BMC, and test cases generation techniques to guarantee the correctness of the design model; Sect. 6 illustrates the process of equivalence proof to guarantee the correctness of the generated C code; Sect. 7 discusses the experience derived from the verification activities, and Sect. 8 gives some concluding remarks and discusses perspectives.
The HLL modeling language and S3 toolset
This work relies on the S3 verification toolset and its modeling language, called high-level language (HLL). The architecture of S3 is depicted in Fig. 2 . S3 facilitates the construction of formal verification solutions compliant with certification standards, e.g., DO-178C [34] . Toward this goal, S3 is organized in a set of small, independent components, from which the most critical ones-an equivalence model constructor, and a tool to verify the validity of the proofare developed according to the highest integrity levels. We briefly introduce these components in the following part: Fig. 2 The S3 toolset -A synchronous declarative language similar to the Lustre language [25] , HLL, that is used to model the system, its environment constraints as well as expected properties of the system. To give an overview of the language constructs, Fig. 3 shows the HLL model of a saturated counter defined in the namespace Counter and its properties on the range of output values defined in the namespace Counter_Verif. The counter reacts to the input (Command in, modeled as an HLL enumeration): incrementation (INC), decrementation (DEC) or reset (RESET), and outputs the counting value (cnt). The saturation range is defined by HLL integer constants (C_MIN and C_MAX). The counter is initialized by the value zero and then is periodically updated according to the command. The effect of commands INC and RESET is directly defined in the schedule. The effect of command DEC is defined as a function (Fun_dec(cnt)), the contract of which is defined using an intermediate variable dec_input of the function and two HLL constraints. -Several translators to convert models (SCADE/Lustre) and code (C and Ada) to HLL. Some restrictions are imposed on the use of code, including dynamic allocated memory (usually forbidden by the code generator, e.g., SCADE Suite KCG), function pointer, and some features in C99 such as inline function and variable-length array. The reason is that these features are not recommended in embedded system development. -Two expanders to translate the HLL models into a bitlevel language, called LLL (low-level language) that only preserves Boolean streams and is restricted to three bitwise operators: negation, implication, and equivalence. -A SAT-based proof engine, named S3-core, to check LLL models. The performance of this proof engine allows users to manage the proof of industrial size problems:
the size of those models routinely attains ten million variables and several hundred million clauses. -Tools to build equivalence proof between models, or between models and code. -Tools to animate and debug models (cex-simulator, why) that provide counterexamples. The counterexamples allow the users to define user-level simulators, which requires the definition of mapping semantics between the user-level model and the bit-level or HLL-level model.
S3 supports the following activities of a typical formal development process:
-Static detection of runtime errors and standard conformance check, including array bounds check, range check, division by zero check, over and underflow check, output and constraint initialization check, etc. Proof obligations are also generated to ensure that the generated HLL models show no undefined behavior with respect to the semantics of the source language. -Property proof : Fig. 4 presents the workflow of property proof. The design model, e.g., Lustre, is translated into an HLL model. Combined with properties expressed in HLL as well, it is then expanded to an LLL model that is fed to the S3-core. If a property is falsified, a generated counterexample can be simulated at the HLL level. This activity will be detailed in Sect. 5. -Equivalence proof : Fig. 5 presents the process of proving the equivalence between the design, e.g., the Lustre model, and the generated/implemented code, e.g., the C code. Models and code are translated into HLL models, which are then expanded to LLL models using diversified expanders. 2 Equivalence models are, respectively, constructed at the HLL level and the LLL level. Equivalence proof is performed on one of the equivalence models or both. This activity will be detailed in Sect. 6. -Test cases generation: Test scenarios are generated from properties expressed as test goals using BMC. This activity will be detailed in Sect. 5.3.
Floating-point arithmetic library in S3
Floating-point numbers are not real numbers. Floating-point operations behave in quite different way from the real counterparts, due, for instance, to rounding and cancellations [32] . 2 The diversified expanders are designed and implemented by different teams using different programming languages. Consequently, a software implementation of some mathematical expressions usually provides results that are not strictly, mathematically, exact. As it is often difficult to foresee the behavior of floating-point programs, formal verification of floating-point programs is mandatory. This part presents a new component in S3, the FPA library (FPA-Lib). This library addresses the verification of embedded software with FPA by means of bit-blasting (also called bit-flattening), which is a classic method that translates bitvector formulas into propositional logic expressions. We first present the principles followed by our FPA-Lib implementation of in Sect. 3.1, then give some evaluation data about the FPA-Lib on a triplex sensor voter case, and finally discuss the performance issue of the SAT/SMT solving in Sect. 3.3.
Implementation of floating-point arithmetic
The basic approaches to address formal verification of floating-point programs include abstract interpretation, static analysis, formal proof, and bit-blasting. Abstract interpretation partially executes a program on an abstract domain. This approach performs well in static program analysis with floating-point variables to ensure that the critical software never executes an instruction with "undefined" or "fatal error" behavior, such as out-of-bounds accesses to arrays, overflows, or division by zero [6] . Formal proof is supported by proof assistants, which is very powerful, but require guidance from highly skilled experts to direct the reasoning toward target properties. Interactive theorem provers such as ACL2, Coq, HOL Light, and PVS have been applied to floating-point verification [26] . Both abstract interpretation and formal proof approaches lack the ability to generate counterexamples when the property does not hold. The approaches based on bit-blasting represents floatingpoint operations as circuits, which are then translated to Boolean formulas with bitwise operators to be solved by SAT solvers. The bit-blasting approach is fully automatic reasoning for verifying floating-point programs. This method has been implemented in SMT solvers, such as Z3 [18] , MathSAT 5 [12] , SONOLAR [28] , CBMC [9] , etc. Bitblasting is the elementary but most significant part of other floating-point verification strategies using SAT solvers as the backend. The published work [36] aimed at a theory of floatingpoint arithmetic for the SMT-LIB 2.0 standard. It defined an SMT-LIB format for encoding floating-point numbers and a set of operations on them following the IEEE standard. Since the publication of [36] , solvers are starting to support it using some advanced QF_FP solving strategies, such as mixed abstraction in CBMC [9] , non-conservation approximations in Z3 [21] , abstraction into interval arithmetic in MathSAT [7, 8] , translation into nonlinear reals in Realizer [29] , etc. In our approach, in addition to the definition of an HLL format for encoding FPA, we have implemented the floating-point operations using bitblasting and achieved some optimization for the bit-level model.
We have implemented a standalone FPA library for S3 to enable the verification of floating-point programs by means of bit-blasting. This optimized FPA library establishes a solid foundation and basic strategy for our future investigation on advanced FPA strategies in S3. The implementation of FPA library in S3 is based on the IEEE FPA standard 754-2008 [27] . A single(double) precision floating-point number is expressed as a binary of 32(64) bits that contains a sign of 1-bit, a biased exponent of 8(11)-bit, and a mantissa of 23 (52) We briefly present the implementation of addition/subtraction to illustrate the principles of encoding. As shown in Fig. 6 , this implementation follows three steps: (1) align, to shift mantissa and render exponents equal; (2) addition/subtraction, to add/subtract resulting mantissas; and (3) round, to shorten mantissa and obtain a number in F.
The FPA library in S3 includes the constructs in Table 1 . The square root and trigonometric operations are implemented using both the interpolation table and the function proposed by Cody and Waite [14] .
Evaluation of FPA-Lib on triplex sensor voter
The triplex sensor voter (TSV) 3 is used in a common form of redundant aircraft system triplex modular redundancy (TMR), which relies on three identical sensors to compute an output value from the three input values by the voter, as shown in Fig. 7 published in the work [19] . The TSV subtracts an equalization value from each of the three inputs. The voter output is middle value of equalized values. The equalization is based on integration-three memories of floating-point type, and is implemented using linear arithmetic operations as well as conditional expressions (i.e., saturation here).
The TSV case is representative as an avionic domain model in the industrial context. It is used as a benchmark for formal analysis on the functional and non-functional properties such as stability, absence of runtime errors. In order to help the formal analysis, some parts of the model are sometimes parameterized. The formal analysis of TSV was first studied by Dajani-Brown et al. in the work [17] , where real values were abstracted by integer values and integrators were not used. The work [19] analyzed the Simulink model with real numbers by both simulation and formal verification, and then estimated the impact of rounding errors caused by the floating-point implementation using SMT solvers and abstract interpretation. The work [11] strengthened the stability property by generating lemmas using a property-directed approach.
In our experiment, we target the stability property, 4 expressed in Eq. 1. We start from a Lustre model of the TSV and translate it to HLL using the SCADE(Lustre)-translator in S3. The FPA library is then applied to the verification of the stability properties by S3. Relying on this use case, we also evaluate the SMT solvers Z3 v4. The experimental results in Table 2 show that under our best solver tuning expertise, neither Z3 v4.4 (bit-blasting strategy, floating-point strategy) nor MathSAT 5 (bit-blasting strategy, abstract CDCL algorithm) or SONOLAR are able to handle the inductive(also called step) instance in the Kinduction proof, be it in simple or double precision. We managed to prove the inductive instance using a combination of SONOLAR bit-blasting to a CNF and a pure SAT solver proved the inductive instance in 6 min using glucose 4.0 and 5 min using S3's own solver for the simple precision instance, and in 9 h 32 min using glucose 4.0 for the double precision instance.
Discussion on solving performance
The performance of SAT/SMT solvers depends a lot on the tuning techniques and on the characteristics of target problems. Therefore, our evaluation on the TSV case does not lead to any conclusion about the performance of the concerning solvers.
Our objective is to demonstrate that the optimized bit-blasting strategy for the floating-point analysis used by our solver is comparable with other solvers. It is thus worth to investigate novel methods based on it. That is why the accent is not put on the TSV challenge by giving a full benchmark. In addition, we aim to highlight that, among a set of stateof-the-art solvers capable of floating-point analysis, there are still some performance difference due to unknown reasons. In the viewpoint of industrial user who may want to deploy one of those, he/she can grab a first opinion.
Regarding to the tuning expertise, we have worked with some experts of SAT/SMT solvers when handling the TSV challenge. These experts are not developers of Z3/Glucose/MathSAT/SONOLAR, but they have rich experience on developing other solvers and tuning solvers. The strategy we used is in fact the "best one to our knowledge" for each solver on the TSV challenge.
Unfortunately, the root reason why some generally behaved well solver (e.g., Z3) cannot scale-up to solve the TSV remains not clear enough. One guess is that under floatingpoint context, the inner abstraction strategy used through the reasoning is not well guided by or combined with those bitblasting-oriented metrics.
This bit-level method indeed has potential performance limitation compared to other more abstract forms of reasoning. However, it has advantages when debugging counterexamples and searching for lemmas for the induction proof. Our future work is to investigate how to improve the bitblasting method by introducing abstract forms of reasoning using feedbacked information as the guidance of abstraction.
The contribution of FPA library in S3 is that it has been suitably optimized for the expander (HLL-to-LLL, also a part of S3 tool chain), especially for the single precision floating point, since this type is largely used in industry (legacy) system. 5 Comparing to the version without optimization, the reduction is significant. But since such optimization is specific to an commercial expander, and it is difficult to apply the same method to other SAT/SMT preprocessing for generalization, we choose not to address this "self-improvement" aspect in this article and to focus on the audience who are more interested in the integrated verification methodology.
Despite of no details, one conclusion that may interest people: during our optimization, it has been noticed that if we are sure that the backend is bit-blasting-only, we can more easily alter the implementation of FPA by using some simple but dependable metrics (e.g., state). In our experience, if we manage to have these metrics decreased, the solving performance will be systematically improved, regardless of the problem type. In other words, it seems that there is a trade-off between "bit-blasting-only + state-oriented-reduction" and "bit-blasting + adequate abstraction + appropriated mixed strategy." So far according to our experience, it is easier to handle the first one.
4 Specification and design of the automatic rover protection system
The context of use case
The verification approach presented in this article has been applied on the Automatic Rover Protection (ARP) System embedded in TwIRTee, a small three-wheeled robot (or "rover") used as the demonstrator of the INGEQUIP project. 6 It is used to experiment and evaluate various methods and tools in the domain of hardware/software co-design, virtual integration of equipments [15] , and formal verification [13, 20, [22] [23] [24] . TwIRTee's architecture and its software and hardware components are representative of typical aeronautical, spatial and automotive systems [16] . The overall system is composed of a unique stationary supervision station and a set of TwIRTee rovers moving in a controlled environment (Fig. 8) . The architecture of rover is composed of a mission and a power control subsystems. The power control subsystem is in charge of power supply, motor control, and sensor acquisition. The mission subsystem is composed of a pair of redundant channels A and B. Each channel contains a monitoring unit (MON) in charge of monitoring the data and a command unit (COM) in charge of calculating commands for the rovers. The mission and power control systems communicate via CAN bus. In the nominal case, each rover moves autonomously on a set of predefined tracks so as to perform its missions, i.e., moving from a start waypoint to a target waypoint under speed and positioning constraints. In this system, the ARP 6 The INGEQUIP project is conducted at the IRT-Saint Exupéry. Main partners include academic members from LAAS, IRIT, ONERA, and ISAE; and industrial members from Airbus, Thales, Continental, Airbus D & S, ACTIA, SAGEM, Systerel, etc.
function is aimed at preventing collisions between the rovers. It generates the maximal accelerations and minimal decelerations that are taken into account by the rover trajectory management function. The communication between rovers is carried out via WIFI.
Here, we introduce several terms used in the paper. A mission is defined by a list of waypoints to be "passed-by" by the rover. A segment, defined by a couple of waypoints on the track, corresponds to a straight path. Segments only intersect at waypoints. The set of all waypoints and segments constitutes a map. Dedicated monitoring mechanisms ensure that if the rover gets out of the track, it is placed in a stopped safe mode and the supervisor is alerted. Accordingly, we consider that all displacements of rovers comply with the map. In the use case, we consider 3 rovers moving on a map of 45 segments and 150 waypoints. A mission contains at most 20 waypoints.
System-level safety and functional requirements
The requirements of ARP use case come from the industrial partners of the INGEQUIP project. The safety requirements are defined in Table 3 .
The ARP is expected to ensure system-level safety requirement (REQ-SAF-1) in Table 3 . This requirement states that at any time, the minimal distance between the centers of two rovers shall be greater or equal to 0.4m. It is split in two subsets of requirements: one is about the exclusive access to segments (REQ-SAF-1-1) and the other is about the design of a map (IR-F1, IR-F2 and IR-F3). The compliance with the requirements of map data is under the map supplier's responsibility. In Jackson/Zave terminology [40] , the IR-F1/2/3 are rather assumptions about the environment than requirements. In classic concept, the "moving part" (here the rover) is often intuitively referred as the system, and the "static part" (here the map) as the environment. The environment often cannot be influenced by system design. In our case, the map is considered as a component of the transportation system and can be changed to help guarantee the safety of the system if such a safety requirement is difficult to insure by adjusting the rover design. What we presented in the requirement 
IR-F3
No intersection: There shall not be any intersection between two segments a balanced list regarding engineering criteria, because some design of rover (e.g., the physical size) is harder to change than the map. To ensure the main safety requirements, separation of rovers is implemented. Missions are elaborated off-line and transmitted via the supervision station. They are considered to be validated onboard (according to the REQ-F1 in Table 4 ). The waypoints are considered as global resources shared by all rovers, and the reservation is ensured at system-level. "System" here includes all rovers and the map. A rover may only enter a segment if it has been granted exclusive access to both the beginning and the end waypoints of the segment. The reservation mechanism is a protocol executed between each rover, to ensure that each segment (the shared resource) can only be exclusively reserved by one rover at most.
Our system is designed as globally asynchronous and locally synchronous. Usually, the synchronous programming schema used in synchronous languages, such as Lustre and HLL, supposes that time is defined as a sequence of instants. To preserve determinism, these languages use the concept of instantaneous broadcast [1] when several processes in parallel communicate, which means that message reception is synchronous (or simultaneous) with their emission. The choice using which synchronization protocol is rather an engineering consideration: some needs less communication (bandwidth), some is more robust to ad hoc point failure, some has a better performance (less decision delay). Our project aims more at system development method, rather than system design; thus, our choice of protocol may not be the most appropriate one to the common sense for the autonomous vehicles.
To support the synchrony hypothesis, we rely on two components:
-A distributed clock synchronization protocol to guarantee a common clock for all units. This protocol has been modeled and verified using model checking relying on TINA (TIme petri Net Analyzer), 7 which is a toolbox for the editing and analysis of Petri Nets, with possibly inhibitor and read arcs, Time Petri Nets, with possibly priorities and stopwatches, and an extension of Time Petri Nets with data handling called Time Transition Systems. -An implementation of the physically asynchronous logically synchronous (PALS) protocol [38] . The PALS protocol is aimed at providing logical synchrony on a physically asynchronous system. From a programming perspective, under the PALS protocol, a system composed of n units behaves as if all units were all ideally synchronized. The correctness of PALS protocol for distributed real-time systems has been proved by the work [31] .
The functional & performance requirements are defined in Table 4 . The system-level functional requirements (REQ-F1 and REQ-F2) are mainly about excluding trivial implementations that would prevent collisions by, e.g., freezing all rovers. In the same manner, REQ-QoS-1 is introduced to guarantee the performance of the design, and to prevent trivial solutions of anti-collision, e.g., by performing missions sequentially. Note that the ARP is not to schedule the movement of the rovers but to ensure safety. Accordingly, if missions are schedulable, they shall remain schedulable with the ARP.
High-level software requirements and software design
During the software design process, the system-level requirements are refined into a set of high-level software requirements (HLRs), given in Table 5 . The HLRs represent "what" to be implemented, while the low-level requirements (LLRs) represent "how" to implement it. In this work, the LLRs are expressed by a Lustre model. 8 Some figures about the size of the design are provided. For an ARP system containing 3 rovers and missions of at most 20 waypoints performed on a map of 45 waypoints and 150 segments, there are about 50 variables and 1700 lines of code in the Lustre model. For space reasons, the Lustre model is not presented in the paper. 9 We briefly describe some of its key points. Note that the model is generated by using a tool depending on some hyper-parameters, including the number of rovers and other components of the system (waypoints and segments). It takes a lot of effort to (manually) revise the Lustre model once some parameters have been changed, because Lustre is a data flow language. The ARP model generator is thus necessary.
In order to validate the design by simulation, the C code generated from the Lustre design has been embedded in a simulation model developed in Scicos language, 10 which is a graphical dynamical system modeler and simulator, and allows the user to compile models into executable code. Figure 9 shows the Scicos model. Here we use this simulation model to explain the architecture of our design. The ARP is split in two parts: the decision model that manages segments reservation and the behavior model that calculates the speed and position of the rover with respect to the reservation decision. As mentioned in Sect. 4.2, the problem of reserving a track segment can be reduced to the problem of managing access to critical sections in a distributed system. In our design, this problem is solved by decomposing time into "time-slots" and allocating a dedicated reservation slot to each rover. In this multi-agent system, the time-slots allocation is composed of a low-level time-triggered scheduling which is self-contained in each rover to "tic-toc" its own slot, and of a high-level clock synchronization protocol which is a system-level mechanism to ensure that the slot counting in each rover is synchronized. So that only one rover at a time can perform a reservation. Each time-slot is split in four subslots, respectively, for request, reply, reservation, and empty tasks. For example, if there are two rovers (R 1 and R 2 ) in the system, the first time-slot (sub-slots t0-t3) is assigned to R 1 , while the second time-slot (sub-slots t4-t7) is assigned to R 2 .
Property verification against Lustre design
We have specified the system and produced a candidate validated Lustre design model in Sect. 4. Before generating C code from the Lustre model using lus2c generator, it is required to guarantee that this design actually complies with the set of requirements. S3 allows the users to formally express these requirements and verify them against the design model. The verification process combines inductive proof, BMC, test cases generation and equivalence proof techniques. The first three techniques are used to verify properties of the design model. The equivalence proof technique is applied to prove the correctness of the generated code against the set of requirements by checking the equivalence between the design model and the generated code. We illustrate the property verification in this section and present the equivalence proof in Sect. 6. Figure 10 depicts the workflow of property verification using S3. The Lustre model is translated into an HLL model, to which properties and environment constraints expressed in HLL are concatenated. 11 The HLL model is then expanded to the LLL model used as the input of the S3-core. This verification workflow can be split in two phases: first, the properties are checked for a certain time length n. If no property is violated, n is increased until either a counterexample (cex) is found, or some pre-known upper bound of n is reached. In case a safety property 12 fails, a cex in the form of a sequence of states is generated, where the last state contradicts the property. The cex trace is then directly exploited to debug the property, the design model, or the environment constraints. The BMC represents a partial decision procedure for a model checking problem, which is not complete. The completeness of a safety property can be achieved with kinductive proof based on strengthening inductive invariants (also referred to as lemmas hereafter) if needed. 13 The kinduction relies on an iterative process to search for lemmas by analyzing the repeatedly produced step-counterexamples, until the proof is complete. Examples of k-induction proofs and BMC verification are given in Sects. 5.2 and 5.3, respectively. 11 It is the verification engineer's duty to translate the natural language requirements to the HLL properties. 12 Usually, the safety referred by requirements means the system is safe, while the safety referred by properties is related to the deterministic process. Here is the latter case. 13 Lemma searching is not a must. It is possible that a property is kinductive. Fig. 10 The workflow of property verification Fig. 11 Step-counterexample in inductive proof
The workflow of property verification

K-inductive proof of safety property
Many works have shown that k-induction often gives good results in practice when implemented by SAT-or SMT-based model checking [5, 39] . Mathematical induction is the classical proof technique that consists of proving a base case (Eq. 2) and an inductive step case (Eq. 3). Let a transition system S be specified by an initial state condition I (x) and a transition relation T (x, x ) where x, x are vectors of state variables. A state property P(x) is invariant for S, i.e., satisfied by every reachable state of S, if the entailments in Eqs. 2 and 3 hold for some k ≥ 0.
A counterexample trace for the base entailment indicates that the property P is falsifiable in a reachable state of S. This is similar to the counterexamples produced by BMC, but a counterexample trace for the induction step entailment may start from an unreachable state or an over-approximated reachable state of S. In Fig. 11 , we distinguish the reachable part of the state space and the over-approximated reachable state space. The transition T (x n , x n+1 ) starts from an over-approximated reachable state in step n and ends in a unreachable state in step n + 1. One way to rule out such step-counterexamples is to increase the depth k of the induction. However, some invariant properties are not inductive for any k. So, instead of increasing k, the method to enhance k-induction of a property is to strengthen the induction hypothesis using new lemmas to reduce the overapproximation of the reachable state space.
Many recent efforts are dedicated to the automatic generation of invariants (used as lemmas in this work): automatic invariant checking based on BDDs [37] ; unbounded model checking using interpolation [30] ; property-directed reachability (PDR) [4] ; quadratic invariant generation using templates based on abstract interpretation [33] . S3 provides a lemma generation tool based on a speculation strategy that searches for equivalent variables at bit-level. According to our experience, it is still very difficult for those tools to generate all necessary lemmas for an arbitrary system, and manual elaboration of lemmas to complete the proof remains important. So, to keep the approach as generic as possible, we do not apply invariant generation methods. Instead, we show how lemmas can be found "manually" on the basis of the step-counterexamples. This will be discussed in detail in the following paragraphs. We pick the property HLR-06-1 in Table 5 as an example to illustrate the process of inductive proof.
Example 1 HLR-06-1 states that the rover position shall be in front of or at the initial position of the reserved area.
It is formally expressed in Eq. 4, where pos_r(t) is the position of the rover r at time t, and pos r (init rsv ) is the initial position of the reserved area of rover r at time t. This requirement is expressed in HLL using the FPA operators, given by Eq. 5, where i is the id of rover r, and the FLT_ge() is the floating-point greater-or-equal operator. The notion of time cycle does not appear in Eq. 5, because time is implicit in HLL. To simplify the explanation, we suppose that the mission of each rover contains at most 5 waypoints.
∀r ∈ Rovers, t ∈ Time pos r (t) ≥ pos r (init rsv )
Following the workflow defined in Fig. 10 , BMC is executed first, and no counterexample is found within a time length of 20 cycles. Then k-induction is executed. With k = 1, a step-counterexample is found in the next inductive depth (depth = 2), shown in Fig. 12 . The FPA-lib of S3 follows the IEEE 754 FPA standard; thus, a variable of float type (here variables pos1 and init_rsv1) is composed of a sign, an exponent, and a mantissa. To facilitate the explanation, the converted decimal values of floating numbers are given in Figure 12 . The Boolean variable rsv1[i] represents the reservation status (by the local rover) of waypoint i of a rover's mission. Values of variables pos1, Fig. 12 Step-counterexample of Property HLR-06-1 init_rsv1, and rsv1 are given for steps 0-3, where a step-counterexample is produced in step 2.
This step-counterexample contradicts the property HLR-06-1 because of pos1 < init_rsv1 in step k = 2. This means that the rover locates outside the reserved area. The reserved area is in fact a set of continuous 14 reserved waypoints of rover's mission; therefore, the calculation of init_rsv1 depends on the reservation status of the waypoints (variable rsv1). We notice that in step k = 1, the waypoints P0 and P2 in the mission are reserved (i.e., rsv1[0]=t and rsv1 [2] =t), but the waypoint P1 is not (i.e., rsv1 [1] =f), which means that the reserved area is not continuous. This step-counterexample does not indicate a design error. Indeed, HLR-09 in Table 5 requires that any positive reply to a reservation request shall contain a set of continuous waypoints. Unfortunately, we cannot use it as lemmas of this property because its inductive proof also produces step-counterexamples and needs to be analyzed. We thus have two solutions: (1) express and prove a property about the continuity of the reserved area, if valid, use it as a lemma to prove HLR-06-01; (2) investigate the stepcounterexamples of HLR-09 to make it proved, and then use HLR-09 as a new lemma to prove HLR-06-1.
For the first solution, the added lemma is expressed in HLL as Eq. 6, where N is the number of waypoints in a mission. Using this additional lemma, HLR-06-1 as well as all other indeterminate 15 properties are proved by 1-induction. Although this step-counterexample is not due to any missing or wrong property in the specification, we still suggest to report it to the designer. Then s/he might decide to add the new lemma as a complementary requirement about the 14 As explained by the REQ-01-4 in Table 5 , we use continuous (continuity) hereafter for the fact that each waypoint has a unique precedent waypoint in a mission or in a reserved area, except that it is the initial one. 15 Indeterminate means neither valid nor violated, or unknown. continuity of reserved areas in the specification. This may reduce the re-verification effort. In this case, as the designer thinks this implicit property is important, and he decides to add it in the specification as a derived requirement from the development process.
For the second solution, we can first consider HLR-09 as an axiom. Inductive proof demonstrates that even if HLR-09 were proved, HLR-06-1 would remain indeterminate and a step-counterexample similar to the one in Fig. 12 would be produced again. Following the same idea, we assume all indeterminate properties except HLR-06-1 are valid, all the step-counterexamples indicate that the step k + 1 contains non-continuous reserved areas. This leads the verification engineer to add the same lemma as the one in the first solution.
BMC and test cases generation
The formal verification cannot completely replace the testing. Model-based test cases generation has been developed as an important technique to perform software testing and system testing. Usually, we derive the abstract test suite from formal specification of requirements against the model of the system under test. The executable test suite is then derived from the abstract ones relying on the mapping, which converts model-level APIs, e.g., rover.mission(mis1, map) in an abstract test suite into platform-level APIs, e.g., rover_move(wp1, wp2, map), rover_move(wp2, wp3, map) in an executable test suite, etc.
In general, properties are classified as safety or liveness properties. The former declares what should not happen (or should always happen), while the latter declares what should eventually happen. The vast majority of properties in the ARP system are safety ones, except the system-level functional property REQ-F2 in Table 4 and the software-level functional property HLR-13 in Table 5 .
Example 2 REQ-F2 states that at any time, if the definitions of schedulable missions are free of deadlock, a deadlock shall not occur due to the ARP. HLR-13 states that the ARP shall ensure that the schedulable mission is completed within worst case mission time (WCMT).
HLR-13 is a bounded liveness property because an overapproximated WCMT can be used as the upper bound of checking. Hence, it is a good candidate for BMC. If no counterexample 16 is found before the time bound, the property is valid. In the case of HLR-13, a counterexample is easily produced using BMC. A precondition of HLR-13 is REQ-F2, because a rover may not complete its mission when deadlocks occur. The validation of REQ-F2 requires that missions are schedulable; otherwise, it is possible that deadlocks occur, and HLR-13 fails. As we cannot check these two properties considering the actual mission schedules, we use BMC to generate test case scenarios containing deadlocks due to unschedulable missions. These test cases can be used later to check the verification tool of mission schedules.
To explain the generation of deadlock scenarios, we consider a system with two rovers. REQ-F2 is satisfied if the property expressed in Eq. 7 is false, where rovers r i and r j are stopped, r i (r j ) requests waypoint p j (p i ), but p i (p j ) is reserved by r i (r j ). Both rovers wait for a locked resource.
We launch BMC for this property for some time length, and test case scenarios are extracted from the generated counterexamples.
Safety property and map data validation
Once the design is delivered to the verification engineer, it is the his/her duty to express and verify the properties. S/He might have several ways to express one property. Some safety properties can hardly be verified by induction or BMC due to, for example, the complexity of calculation. In that case, we may take benefit of divide and conquer strategy by decomposing the property into a set of simpler ones, even static ones that depend on static data, e.g., the map data. Take the REQ-SAF-1 in Table 3 as an example.
Example 3 REQ-SAF-1 states that at any time, the minimal distance between the centers of two rovers shall be greater or equal to 0.4 m.
This property can be verified by calculating the distance between two rovers at any time and then checking its value; unfortunately, this solution is expensive due to the nonlinear floating-point arithmetic. To alleviate this problem, REQ-SAF-1 is split in another safety property about the reservation of waypoints (REQ-SAF-1-1) and a set of properties about the map data (IR-F1, IR-F2, and IR-F3) in Table 3 . REQ-SAF-1-1 is proved by k-induction using similar process as described in the Sect. 5.2. IR-F1, IR-F2, and IR-F3 are requirements about the length of segment, the distance between a waypoint and a segment, and the absence of intersection between segments. In this work, the map data are modeled in Lustre, as same as the software. In fact, these static requirements on the map data could be easily checked using a dedicated verification program. However, when these map properties are used as sub-properties of the safety property REQ-SAF-1, they need in any case to be re-verified in the Lustre model. Our work uses a unique tool chain for the data validation. This approach allows the users to reuse the properties expressed on the map data for the verification of software.
Property verification results
The safety, functional, and performance properties of ARP are formally expressed. As shown in Table 6 , some safety properties can be directly proved by 0-induction or 1-induction, while some others need additional lemmas. REQQoS-1 is a system-level performance property. It is difficult to verify it at system-level without having software design, it is thus expressed as HLR-12 and verified at software-level by inductive proof.
Equivalence proof between design and generated code
The property verification activities depicted in Sect. 5 demonstrate that the design model complies with the set of requirements. However, there is still a gap between the design model and the code embedded in the system. The code can be either implemented by the developer or be generated automatically from the Lustre model. In our case, we use the lus2c translator 17 to generate the C code from the Lustre model. However, as this translator is not qualified, 18 it is still unknown whether this C code satisfies the requirements.
To prove the correctness of the generated C code, two approaches are applicable. The first one follows the strategies presented in Sect. 5. We first translate the C code into the HLL model using the C translator in S3, and verify that this HLL model satisfies all requirements defined in Sect. 4. The second approach demonstrates that the code is equivalent to the design model, i.e., the same inputs generate the same outputs. This guarantees that the properties (related to inputs and outputs) satisfied by the design model will be satisfied by the code. Figure 13 presents several verification activities (A) in the process of equivalence proof: A1 generates C code from Lustre model with a qualified translator; A2 translates Lustre models into HLL models, where properties are combined and verified; A3 translates C code into HLL models, where properties are combined and verified; A4 proves that the HLL models generated from the Lustre model and the C code are equivalent; A5 proves that the LLL models generated from the Lustre model and the C code (through the HLL model) are equivalent. Based on different development contexts and the activities in Fig. 13 , we summarize a set of possible strategies (S) for the verification of the C code as follows, in which the strategy S2b is actually taken for the ARP development.
-S1: The code generator is qualified as a development tool at the same level as the application. Properties are verified on the Lustre model (A2). Thanks to the qualified translation (A1), properties are preserved in the generated C code. -S2: The code generator is not qualified at the same level as the application.
-S2a: Properties are directly verified on the C code (A3). -S2b: Properties are verified on the Lustre model (A2).
The C code is proved to be equivalent to the Lustre model (A4 or A5). Thanks to the equivalence proof, properties are preserved by the C code.
• S2b1: The equivalence is proved at HLL level (A4).
• S2b2: The equivalence is proved at LLL level (A5). -S2c: Properties are verified at both Lustre and C code level (A2 and A3).
In our case study, we have proved the equivalence between the Lustre model (including the map data) and the generated C code using the strategy S2b. The reasons for choosing this strategy are explained as follows:
1. The C code generator lus2c is non-qualified. (rule out S1) 2. It is reasonable to assume that only a subset of the requirements will be formally expressed and verified. One will probably use other more classical approaches, such as testing. The cost of test increases as the abstraction level decreases; thus, test is less expensive at Lustre level than at C level. (rule out S2a and S2c) 3. Specific formal verification techniques can be applied on Lustre thanks to its abstract semantics, which is lost once the C code is generated. This implies that proving properties at Lustre level is simpler than at C level. (rule out S2a and S2c) 4. S2b supports two complementary approaches of equivalence proof S2b1 and S2b2. S2b1 allows debugging counterexamples at the HLL level (as counterexamples can be simulated at this level), but might need additional lemmas for most cases. Manual search of lemmas is possible, but it is easier to introduce advanced methods to automatically search and add necessary lemmas at bit level (LLL level), such as using speculation techniques in S3. However, once counterexamples are met, it is more difficult to exploit them and debug the design than at the HLL level. For these reasons, S2b2 is performed first; if a property is falsifiable or indeterminate, the S2b1 is then applied to analyze the (step-)counterexample (keep S2b).
Discussion
In this part, we provide more details and discuss some lessons learnt from the experiments.
Proof of generated code
The strategies of equivalence proof discussed in Sect. 6 have pros and cons. One can select appropriate strategies under the development contexts and the available resources.
-S1 requires a qualified code generator. This was not an option in our case, but this is the usual strategy in the domain of safety critical applications where the cost of a failure largely exceeds the cost of qualification. A qualified code generator saves a lot of effort, but is very expensive. -S2a and S2c require to express and verify properties and lemmas at code level. As the code is less abstract and Fig. 13 Activities in the process of equivalence proof more complex than the Lustre model, property verification requires more effort. -S2c seems redundant as property proofs are performed at both Lustre and C level. However, it might be useful to determine the origin of an error: a property satisfied in Lustre but falsifiable in C reveals probably an error during translation. -S2b is "S1 without qualified generator." The equivalence proof between Lustre model and C code ensures that the generated C code implements exactly the properties expressed in Lustre. S2b does not need expensive qualified generator, but needs more effort to carry out equivalence proof. Each time the Lustre model is modified and the new code is generated, the equivalence needs to be re-proved.
In the industrial context, a commonly inquired question is about the effort put in the qualification of tools. A Lustre-to-C generator needs to be qualified in a certification process, and a solution is to prove C code against Lustre. In a similar way, a C-to-HLL translator also would require qualification, as well as the whole equivalence checking tool. The effort for this might be even larger. The proposed methodology in S3 is a globally cost-effective solution for tool qualification for the following two main reasons.
1. In fact, three distinctive objectives are required for the tool chain: (a) to verify the Lustre design model; (b) to verify the C program; and (c) to facilitate the qualification of the Lustre-to-C translator. In current industrial development, according to the tool qualification document DO-330 [35] , the qualification effort for tools used in the V-upside phase (at most TQL-4) is much less than those used in the V-downside phase (at least TQL-3/4). Among the above three objectives, (a) and (b) are required in the V-upside phase, while (c) is required in the V-downside phase. Therefore, the qualification of Cto-HLL translator and the equivalence checking tools is much less expensive than that of Lustre-to-C translator. 2. In addition, the equivalence checking will save the effort for the qualification of Lustre-to-C generator, compared to the classic testing approach.
Proof-driven design guidance
The formal verification of a system could fail because of the complexity of the system, the lacking of complete requirements to support formal verification, etc. For instance, in Sect. 5.2, the HLR-06-1 is proved by k-induction after searching and adding a lemma. If we consider that the verification engineer has not a complete or deep knowledge about the design, s/he reports a scenario that contains the stepcounterexample to the designer. If necessary, the designer may then decide to add a complementary requirement derived from this lemma in the initial specification, in order to reduce the cost of subsequent verification. The other way round, the verification engineer may ask the designer to state as many detailed requirements as possible about the system. These properties may be written as comment and/or assertions to be checked at runtime. Sometimes, a lemma may not be provable from the initial hypotheses. This might be the case that some environment hypotheses have been considered as granted by the designer without ever being explicitly expressed. This case could be handled either by a modification of the requirements to make the hypothesis explicit or by a modification of the design to make it independent from these hypotheses.
Conclusion and perspectives
The contributions of this work are twofold. On the one hand, we have designed and implemented a library of floating-point arithmetic (FPA-Lib) in S3 following the IEEE FPA Stan-dard, and suitably optimized the library for the expander (HLL-to-LLL translator). On the other hand, we present a typical formal development process using S3 on an ARP system deployed on a three-wheeled robot, which is representative as a safety critical embedded system. The work shows how multiple formal verification techniques (inductive proof, BMC, test cases generation, and equivalence proof) can be integrated to verify an actual system with an industrial grade toolset. Some significant activities in the development process have been addressed, from the specification and design to the formal verification. Focus has been placed on the verification of safety and functional properties and on the equivalence proof between the design model and the generated code. We have drawn some lessons about the equivalence proof and the proof-driven design guidance from this experiment. This verification process is classic when the proof of property is based on SAT/SMT solvers. The main effort lies in searching for lemmas for the property proof using k-induction. This needs good understanding of the proof techniques. As our verification tool provides step-counterexamples feedback, the debug process can be seen as an engineer work. This case study is built on the Lustre modeling language and S3 toolset. Similar property proof process can be applied to other modeling languages and SAT/SMT tools. This work provides a guidance to the engineers who need to work on the formal development process, including the proof work.
In the future, we aim to investigate how well it works on a wider range of industrial designs from automotive, aeronautic, industry, energy, etc, and to establish benchmark (or reuse existing SMT benchmarks) to evaluate the performance of floating-point verification by S3. The bit-level reasoning method indeed has potential performance limitation compared to other more abstract forms of reasoning. One conclusion resulted from our work that may interest people: during our optimization, it has been found that if we are sure that the backend is bit-blasting-only, we can more easily to alter the implementation of FPA by some simple but dependable metrics. (e.g., state). In our experience, if we managed to have these metrics decreased, the solving performance will be systematically improved, regardless of problem type. In other words, it seems that there is a trade-off between "bitblasting-only + state-oriented-reduction" and "bit-blasting + adequate abstraction + appropriated mixed strategy." So far according to our experience, it is easier to handle the first one. Our future work is to investigate how to improve the bitblasting method by introducing abstract forms of reasoning using feedbacked information as the guidance of abstraction.
