Abstract
Introduction
Testing consists of executing a program or a physical system with the intention of finding undiscovered errors. Conformance testing is a black box approach that aims at checking that the behavior of an implementation complies to the behavior of the specification. Testing is the most dominating validation activity used by industry today. However, there are two well documented problems with the current state of the art. First, testing is very expensive; depending on application as much as 20% to 50% of the total development time is spent on testing. Second, despite this effort and the hard work of dedicated test engineers, a significant amount of errors remain.
A potential improvement that is being examined by researchers is to provide theoretically well founded tools that automate test generation and execution. Test generation tools are usually much faster than humans, and has the potential for being more systematic than humans, and thereby generate important tests that could easily be overlooked. This approach has experienced some level of success However, these tools do not address realtime systems, or only provide a limited support for testing the timing aspects. They often abstract away the actual time at which events are supplied or expected, or do not select these time instances thoroughly and systematically. Exhaustive testing is usually infeasible, and because a real time system consists of an enormous amount of time instances that could be relevant to test, it is not likely that an arbitrary or random choice of such time instances will result in good coverage. It is therefore important to make good decisions of when to deliver an input to the system, and when to expect an output. This paper presents a tool for automatic generation of timed tests from a restricted class of dense, but possibly non-deterministic, timed automata specifications. The technique is applied to a realistic case: a Phillips Audio Protocol specification. The test cases are generated systematically from a coverage criterion of the specification. The state space of the specification is partitioned into coarse grained equivalence classes which preserve essential timing and synchronization information, and a few tests for each class are generated. This approach is inspired by sequential black-box testing techniques frequently referred to as domain-or partition testing [3] . We regard the clocks of a timed specification as (oddly behaving) input parameters. Our technique guarantees that every reachable equivalence class will be covered by a set of relevant tests, and that every generated test is sound, i.e., failure to pass the test implies that the system under test is non-conforming.
Analyzing a timed specification is no easy task, and nearly impossibly to do by hand, even for moderate size specification. We therefore employ efficient automatic symbolic reachability techniques based on constraint solving that have recently been developed for model checking of timed automata [9, 131. The emphasis of this paper is our test generation tool and an application thereof. The underlying algorithms and testing theory, which is based on Hennessy's work [ 161, are described in detail in [18, 171 . Section 2 summarizes the related work. Section 3 presents the specification language, and Section 4 gives an overview of the techniques implemented in our tool, which is further described in Section 5.
Section 6 provides a first, but realistic case study. Section 7 concludes the paper and suggests future work.
Related Work
Springintveld et al. proved in [22] that exhaustive testing wrt. trace equivalence of deterministic timed automata with a dense time interpretation is theoretically possible, but highly infeasible in practice. Another result generating checking sequences for a discretized deterministic timed automaton is presented by En-Nouaary et al. in [ 101. Although therequireddiscretization step size (l/(lXl+2), where (XI is the number of clocks) in [lo] is more reasonable than [22] , it still appears to be too small for most practical applications because too many tests are generated.
Clarke and Lee [8] -like we-propose domain based test selection for real-time systems. Although their primary goal of using testing as a means of approximating verification to reduce the state explosion problem is different from ours, their generated tests could potentially be applied to physical systems as well. Compared to timed automata their language for specification of time requirements appear very restricted.
Castanet et al. presents in [7] an approach where timed test truces can be generated from timed automata specifications. Test selection must be done manually through engineer specified test purposes (one for each test) themselves given as deterministic acyclic timed automata. Such explicit test selection reduces the state explosion problem during test generation, but leaves a significant burden on the engineer. Our goal has been fully automatic test generation.
Cardell-Oliver and Glover showed in [6] 
Event Recording Automata
Two of the surprising undecidability results from the theoretical work on timed languages described by timed automata are that 1) a non-deterministic timed automaton cannot in general be converted into a deterministic (trace) equivalent timed automaton, and 2) trace (language) inclusion between two non-deterministic timed automata is undecidable [2]. Thus, unlike the untimed case, deterministic and non-deterministic timed automata are not equally 
Test Generation Technique

Partitioning
Since exhaustive testing is generally infeasible, it is important to systematically select and generate a limited amount of tests. A test selection criterion (or coverage criterion) is a rule describing what behavior or requirements should be tested. Coverage is a metric of completeness with respect to a test selection criterion. Our stable edge set criterion partitions the state space of the specification into coarse equivalence classes, and requires that the test suite for each class makes a set of required observations of the implementation when it is expected to be in a state in that class.
The states (a pair consisting of automaton locations and clock valuations) of the automaton are partitioned such that two clock valuations belong to the same equivalence class iff they enable precisely the same edges from the set of states that the automaton currently possibly can occupy, i.e. the states are equivalent wrt. the enabled edges, We justify this partitioning by the following observations:
Because the enabled edges change when the specification moves from one equivalence class to another by executing an action or letting time pass, the implementation must somehow perform a matching action or an internal timeout to change the enabled edges, and it must consequently be checked that the implementation responds correctly after that internal action or timeout.
0 Because the enabled edges are the same we believe that it is reasonable to expect that the implementation treats these states uniformly. Interior and extreme clock values for the equivalence class can be used to support this hypothesis.
The partitioning has the nice formal property that all states in the same equivalence class also satisfy the same basic Hennessy observations. 0 This partitioning is based on the guards that actually occur in a specification, and is therefore much coarser than e.g., the region partitioning which is based on the guards that could possibly occur in an automaton according to the syntax in Definition 1.
In conclusion, it is more important to test different classes than it is to test the same one many times.
Symbolic Analysis
Densely timed automata cannot be analyzed by enumerative finite state techniques, but must rather be analyzed symbolically [I] . We employ the so-called zone and difference boundnzatrix techniques [9] that have proven efficient [ 131 for model checking of timed automata.
A zone z over clocks 3: E X is a constraint system consisting of conjunctions of linear inequalities on clock variables of the form :
where + E {<, s}, c i j , ai, b, are integers including 00, and xi, xj E X .
Zones can be represented efficiently by the difference boundnzatrir (DBM) data structure [9] . A DBM represents clock difference constraints of the form x, -xJ 5 c2j by a (n+ 1) x ( n + 1) matrix such that cij equals matrix element 0 Given a symbolic path to a symbolic state, a concrete trace leading to it can be computed.
To ensure soundness of the produced tests, symbolic (reachability) analysis is needed to select only states for testing that are reachable, and to compute only timed traces actually in the specification.
Timed Trace Computation
When a desired target symbolic state is reached, it can be concluded that all concrete states in the symbolic target states are reachable. However, it is not ensured that all states along the path of symbolic states used to reach it, necessarily will end in a state in the target symbolic state, but only that some of the states traversed underway will end up in the target state. Therefore, when a trace leading to the desired target is to be computed, the trace must pass through the states only that can reach the desired target. It is relatively straight forward to compute the preconditions for the required subsets by back-propagating the zone constraints of the target states back along the path used to reach it. Back propagation results in a strengthened symbolic trace representing a possibly infinite set of concrete traces leading to the target.
From this set the tester can choose a specific trace by controlling when actions are offered and observed, i.e., by choosing the specific delay to wait between actions. This process is started at the initial state. The possible delays which can be chosen are defined by the strengthened symbolic states. Let D be the set of possible delays before an action. There are three immediate strategies for choosing del ay s :
TOO1 Facilities
We have implemented our approach and algorithms in a prototype tool called RTCAT. RTCAT Choose the delay to be the largest delay in D. This tests the patience of the system, i.e., that the succeeding action is also executable at the latest required enabled time.
Of the above strategies, it seems most important to check the promptness of the system as this checks for missed deadline errors, which are common in real-time systems. But also the patience may be important, since this may detect errors where a timer times out prematurely.
Overall Algorithm
The test generation procedure, outlined in Algorithm 3, first constructs the equivalence classes and stores them in a data structure which we refer to as the equivalence class graph. It preserves all timed traces of the specification, and furthermore preserves the required synchronization information for our timed Hennessy tests. All timed Hennessy tests that the specification passes can thus be generated from this graph.
Algorithm 3 Overall Test Case Generation Algorithm:
input: ERA specification S. output: A complete covering set of timed Hennessy tests. We have also implemented a few pragmatic strategies for handling specifications whose reachability or equivalence class graphs are too large to be completely computed, stored, or tested. Construction of both graphs can also be terminated by specifying a maximum trace depth, using bit-state hashing, or both. Bitstate hashing [ 1 13 is a technique that limits the number of nodes in a graph, and is believed to result in a better (under) approximation of the state space than random exploration, which has a tendency of co'nfining itself to small parts of the state space.
Compute S, = Equivalence Class Graph(S).
Compute S, = Reachability(S,
Construction Order: Both breadth first and depth first construction of the equivalence class and reachability graphs are implemented. The tests for a given equivalence class are generated the first time it is reached during forward reachability analysis. Consequently, the traversal order may affect the number and length of tests generated.
Our experience suggests that breadth first construction results in the most economic test suite in terms of length. Depth first results in slightly fewer test cases but much longer test suites, and should be used when a covering test suite should be obtained that also tests the behavior after relatively long sequences of actions.
Test Structure: Tests can be constructed either as individual timed Hennessy tests (Algorithm 3) or as test trees which merge the individual tests when possible.
Trace Generation: Timed traces can be generated using prompt, interior, or patience selection as described in Section 4.3.
Extreme value selection is currently not supported, but can easily be implemented. The prototype operates in four distinct phases, i.e., the preceding must be completed before a new is started: parsing and initialization, equivalence class graph construction, reachability graph construction, and finally timed trace computation and output of the test suite to a file in DOT format [12] . RTCAT occupies about 22K lines of C++ code, and is based on code from a simulator for timed automata (part of an old version of the UP-PAAL toolkit [14] ). Its AUTOGRAPH file format parser was reused with some minor modifications to accommodate the ERA syntax. Also its DBM implementation was reused with some added operations for zone extrapolation and clock scaling.
6 A casestudy
Example 1
The ERA example in Figure 4a demonstrates that computing test cases from a timed automata specification by hand is non-trivial, even for very small specifications. For example, to compute a test that visits the edge SI -S O , the edge so -S O must be visited at least three times in succession for the guard on the b edge to become satisfiable, Furthermore, the b edge must be visited before X, equals 1 time unit; otherwise, the guard on the succeeding SI -SO edge is not satisfiable. The tool generates the test automaton shown in Figure 4b ; its locations are labeled with the visited location of the specification, and the test verdict (p=pass, f=fail) to be given if the test execution stops in that location. A total of 12 such tests is generated to cover the specification. It is thus easy to overlook an important scenario. 
Example 2: Philips Audio Protocol
The Philips Audio Protocol is a dedicated protocol for exchanging control information between audio/visual consumer electronic units, Consequently, the protocol must be simple and cheap to implement. The data is Manchester encoded, and transmitted on a shared bus implemented as a single wire. There are two interesting aspects of this protocol. One is that a certain tolerance is permitted on the timing of events to compensate for drift of hardware clocks and CPU contention. Philips permits a f 5 % tolerance on all the timing, while still being able to decode the transmitted signal correctly. The second aspect is that the collisions of messages on the bus must be detected. The protocol was first studied by Bosscher et al. in [4] . It was here proven formally that the signals can be correctly decoded if tolerances are less than A. The protocol has since been studied numerous times in the context of model checking.
The goal of generating tests for the protocol is to compute a test suite that can be used to determine if a given audio component implements the Manchester encoding and collision detection correctly, and within the allowed tolerances. A station is equipped with a module for encoding and transmitting data on the bus, and a module for receiving and decoding the data. An overview of the protocol is shown in Figure 5 . The sender obtains the bit stream to be transmitted via three actions: inO, inl, and empty, respectively representing a zero-bit, a one-bit, and an end of message delimiter. The sender Manchester encodes these bits, and uses the actions up and dn to drive the bus voltage high and low respectively.
The bus works as a logical or, so whenever a station drives the bus high, the bus will be high even if other stations previously has set it low. A sender can detect collision by checking that the bus is indeed low when it is itself sending a low. The i s u p action is used for this purpose. If a collision is detected, the upper protocol layer is informed via the coll action.
The receiver informs the upper layer of the decoded bits via the outl, out0, and end actions. Philips uses rising edge triggering to decode the electrical signal. A rising edge is indicated to the receiver by the vUP action. To decode the signal using only rising edge triggering as required by Philips, messages must start with a logical one, and be odd in length.
Using Manchester encoding, illustrated in Figure 6 , the time axis is divided into equal sized bit slots. In every bit slot one bit can be sent. A bit slot is further halved into two intervals. A logical zero is represented by a low voltage on Detection 'just' before up ( 2 0~s ) 'Around' 25% and 75% of the bit-slot ( 2 2~s ) 80000
Station Silence (8ms) the wire during the first interval of a bit slot, a rising edge at half the bit slot, and high voltage during the last interval. A logical one is represented by a high during the first interval, followed by falling edge, and a low through the last half. A bit slot in the Philips protocol is 8 8 8~s long. In the modeling we use quarters of bit slots, denoted q, equaling 222ps. The basic constants used in the model, and the derived tolerance levels are summarized in Table 7 . The basic operating principle of the sender, shown in Figure 8 , is that it inputs a new bit while encoding the current bit, i.e., it has read a bit ahead. The important states are labeled SXtoY, where X represents the bit currently being generated, and Y the bit to be generated next. Observe that whenever X and Y differ, the sender waits twice the normal duration before changing the status of the wire. The ERA for the receiver can be found in [ 171.
To detect collisions the bus must according to Phillips be sampled 'around' three specific time points, namely after a quarter of a bit slot after starting a low signal, again after three quarters (if still transmitting a low as in the one-tozero transition), and 'just' before setting the bus high.
The generated tests are exemplified in Figure 9 . Test case 1 produces the bit string ' IOOI', and checks whether the implementation can produce this sequence, and whether it like the specification refuses all actions at the state and time entered thereafter (s4). If one of the offered actions are accepted, the test execution will terminate in a state s, with fail verdict. Test case 2 checks whether collision detection is performed after, in this case, transmission of the single bit message ' 1'.
Using breadth first traversal RTCAT generates a test suite consisting of 68 test cases with a combined length of 393 steps. Using depth first traversal it generates 67 test cases with a combined length of 487 steps. The timed traces are generated using prompt selection. Generating these tests and writing them to a file took less that 2 seconds, and required less that 5 M b memory in total. The resulting test suite is so small that it can easily be executed, and there is plenty room for generating longer test suites, and further extreme values.
The case study illustrates that test cases can be generated from a real-life case, but it has also revealed a point where our current approach can be improved. For example, in our modeling of collision detection, the sender is required to be able to synchronize with the i s u p action at all instances in the &g interval. This is probably not what the Philips engineers have in mind. Rather, they intend to sample the bus at some point in this interval. However, this form of timing uncertainty cannot be readily modeled in the current ERA language. It is possible to change the specification (by using a non-deterministic choice) such that the proper verdict (inconclusive) is assigned to the tests, but executing them will most likely result in large number of inconclusive verdicts, because the action could not be observed at the chosen time.
Also it should be noted that the timing tolerances are modeled by permitting the upper protocol layer to deliver the next bit to be transmitted at some point in the "window of opportunity". The sender is therefore required to accept bits at any time within the tolerance interval. If the interface of the actual Philips components is different from this, the test cases will not be directly executable as is. An important lesson learned is that the specification model used for test generation must accurately reflect the behavior at the interface of the component to be tested.
We conclude that our technique is applicable "as is" for strictly timed embedded controllers that are deterministic with respect to time, but that it will be important to add support for timing uncertainty.
Conclusions and Future Work
We have presented a prototype tool for automatic and systematic generation of test cases, and have demonstrated its applicability for generating tests via a real-life example.
The number of tests generated, and the size of systems that can be handled, suggests that the basic technique is sound and practically relevant, but we also have identified a number of areas which can enlarge the application domain of our technique. It will be important to handle systems with timing uncertainty more effectively than presently done.
Timing uncertainty means that an action may occur some time in an interval, but which instant is non-deterministic. Effective support will affect our technique in two areas. First, the testing theory and algorithms need to distinguish between actions that are inputs controlled by the tester or environment, and actions that are outputs controlled by the system self. Our modeling effort suggests that timing uncertainty is typically associated with outputs. Second, because the time instances of actions with timing uncertainty will not be known until runtime, and because this time affects when the next action in the test case is to be offered/observed, test cases will need to be symbolic. The timed trace will thus be instantiated at test executed time rather than as presently done at test generation time. Fortunately, both aspects appear only to require a moderate effort to incorporate, e.g., the time constraints needed to distinguish when to produce pass, fail and inconclusive verdicts is available from the computed symbolic path.
A second aspect is to generate and select test cases through manually stated test purposes, and to generate very long test cases using a guided random simulation where the probability of choosing a given delay between a pair of actions is guided by the equivalence class partitioning.
Finally, our work has focused on generating test cases. It would be very interesting to also execute them against real implementations. This will provide valuable insight in what will be a good communication model between tester and implementation in practice.
