Abstract. Designing highly reliable embedded software is a challenge and several approaches are known to improve the reliability of this software. However, all approaches have their advantages and disadvantages which makes empirical evaluations investigating their potentials necessary. In this paper, different approaches of software reliability improvement for embedded systems were compared on basis of experiments conducted at our institute. The first approach is an instance of N-version programming based on forced diversity. Two fundamentally diverse hardware platforms (microcontroller and CPLD/FPGA) were used to force diversity. Another experiment was conducted in which participants designed their software on one hardware platform only. The second half of this experiment was used for review and testing. Based on our experiments, the potentials of our application of N-version programming, review and testing are compared with respect to different fault categories (specification, implementation, application) identified during evaluation.
Introduction
Designing highly reliable embedded systems is a challenge. The greatest challenge is seen by many developers in the software part of embedded systems [Sto96] . In the domain of embedded systems, more and more systems require certain levels of reliability as for example future drive-by-wire systems in the automotive industry. These embedded systems are subject to strong development constraints. Aspects as low cost, variant management, and (hard) real-time requirements have to be considered. In this context, applications of specific software reliability improvement 1 measures are needed. In this report 2 , we will have a closer look at the application of three important approaches, namely N-version programming (NVP) 3 , testing and review.
The approach of NVP, firstly introduced by [CA77] seems very promising, but in the well known experiment of Knight and Leveson [KL86] it has been shown that developers tend to make the same faults. Different approaches modeling this dependency structure (e.g. [LM89] ) and corresponding empirical studies [BBvdM04, CL04] are known which allow certain (model-based) predictions of failure probabilities in NVP systems. Other approaches try to decrease the dependencies between the different software versions. One of these approaches is "forced diversity" introduced by [LM89] with an empirical evaluation in [LH93] . The basic assumption is that different development methodologies lead to diversity in decision and thus diversity in the behavior of the resulting product.
However, recent publications [LPS99, SWK06] show, that even these improved approaches can lead to undesired dependencies between the diverse software versions. The approach of diverse NVP used for the evaluation in this report is based on different hardware platforms used in today's embedded systems. Beside classical platforms as microcontrollers (MCU), nowadays programmable logic devices (PLD) as CPLDs and FPGAs offer interesting alternatives for several applications which were not available before at reasonable conditions. Designing systems on basis of PLDs differs a lot from designing the same task on MCUs which offers possibilities to force a certain amount of diversity. Therefore, the effect of this additional diversity on NVP has been analyzed with the help of experiments conducted at our institute which are described in Section 2.
Testing, as the second approach investigated, can be applied at different stages in the design cycle and is currently one of the most important means of verification in embedded systems [PvSK90] . The disadvantage of all testing approaches is that in typical applications not every possible input combination can be checked. Embedded systems are often real-time systems which complicates the testing process additionally. Not only the values and the timing of the actual inputs but also previous inputs (values and timing) can affect the outputs increasing the amount of test cases needed. Additionally, at least the system test has to be done at hardware interfaces which makes particular test interfaces necessary. Special approaches as for example design for testability [VWVW96] try to overcome these problems, however, real-time properties are still a challenge for many testing approaches. The evaluation in this report applies black box testing with an automatic test environment designed for this purpose and focuses on the general potentials of testing (see Section 2).
Different approaches of reviews are well known [Fag76, PvSK90] and applied in industry. The idea is to inspect code, written by another person, to reveal problems in the code (insufficient documentation, bad code structure, potential risks as division by zero, undesired overflows, etc.) and to identify inconsistencies with the specification. Reviews offer chances to identify problems with respect to functional requirements, but also with respect to non-functional requirements as maintainability and reliability. However, the result of each review process depends a lot on the persons used as reviewers and their review performance is hard to quantify. If real-time properties have to be considered, as often necessary in embedded systems, additional challenges have to be met in the review process. Especially the time, which is needed for the execution of sequential code, is hard to determine not to speak of program parts which can be disrupted by interrupts at any time. In this report we investigate the potentials of code inspection. Details of the review-technique used can be found in Section 2.
The option of using as many of these approaches as possible seems promising but is resulting in high costs (development time and human resources). Since in many embedded applications reasonable costs of the resulting products have to be achieved (as in the mentioned automotive domain), evaluations which investigate the potentials of different approaches and their combinations to prevent failures are needed. This investigation is done for example in [LPS00, PSL00] in which an NVP approach is compared with one "good version" based on a reliability growth model. According to these results, both approaches could lead to better results. To achieve further results with respect to these problems of software reliability, we conducted controlled experiments based on an automotive realtime application investigating applications of diverse hardware NVP, review and testing and evaluated the potentials for reliability improvement of these different approaches.
The remaining report is structured as follows: In Section 2, the designs of the experiments used for the comparing analyses are presented while the following Section 3 presents the results of these experiments. Based on these results, we conducted a categorization of the faults handled with the different approaches in Section 4 and discussed potentials of the different approaches. Finally, we discussed the validity of the results in Section 5 and conclude with Section 6.
Design of Experiments
In this section, the experiments conducted at our institute are presented briefly. They were used to obtain the empirical data needed for the comparing evaluation of the different approaches of reliability improvement accomplished in later sections of this report.
NVP-Experiments
A first experiment with respect to diverse hardware NVP has been conducted at our institute and has been described in [SWK06] . In order to validate the results and to investigate additional aspects, the experiment was replicated in a modified form. This second experiment took place in winter term 2005/2006 with 24 computer science students (5th semester or higher) forming 12 groups. In this experiment, MCUs and FPGAs were used as diverse hardware platforms and each group had to program both hardware platforms starting with a platform picked randomly (6 groups started with MCUs, 6 groups started with FPGAs). While the task of the first experiment was mainly the frequency measurement of four independent speed signals and a communication via CAN bus (see [SWK05, SWK06] for details), the task of the second experiment had been extended with 6 additional tasks in order to increase the complexity of the application. Those tasks contained the generation and integration of additional information in the CAN message (mark old values, identify certain scenarios) and interaction with the user via three buttons (BTNA, BTNB, BTNC) resulting in an output of certain process values on LEDs in real-time (BTNA) or additional CAN messages (BTNB: a test message including a test counter which had to be incremented with each test message sent, BTNC: a message containing peak values of the input signals). For more information, see also device under test in Fig. 1 .
Test&Review Experiment
The aim of this experiment was to identify which types of faults could be identified by review and by testing respectively. The experiment took place in winter term 2006/2007 during a lab course with 19 computer science students (5th semester or higher) forming 12 groups. Each group had to program the same task as used in the second NVP-experiment described above. However, only the microcontroller hardware was used for implementation which took part in the first half of the lab course. The second half of the lab course was used for review and testing which was organized as follows: Randomly, each group was assigned a version for review and another version for testing. The versions were anonymised to avoid interaction between the implementation and the verification group and it was assured that no group checked their own version. Half of the groups started with review while the other half started with testing in order to mask out effects of execution order and to reduce the number of test equipment needed for the experiment. After three weeks (three appointments with 3h each) the groups changed from testing to review and vice versa. In the last two weeks, all groups had the chance to improve their own version on basis of the review and test reports. In the following, the test and review process used in the experiment will be presented briefly.
For testing, all six test groups were equipped with an own automated test environment similar to the one used for evaluation (see Fig. 1 and Section 2.3). In this case, the functions of the microcontroller (MCU) and the FPGA were combined on a single FPGA and the test signal generation excluded the signals for the three buttons (had to be activated manually) to allow a simpler hardware platform for the test environment. Each group received a documentation for the test environment and an empty test report form. Additionally, all test groups received the machine code for testing and an empty test report form. The following aspects had to be filled out in every test report:
-requirements tested + results -requirements not tested + reasons -scenarios identified which could lead to problems + results -quality of version tested (range 1..5, subjective opinion of the reviewers) -final remarks concerning implementation and specification This given form of the report helped the students during test case creation and eased the later analysis of all test reports for the evaluation.
As in the case of testing, every review group received an empty review report form, a short review instruction and the source code to review. Additionally they received the corresponding documentation which should help them to understand the code if necessary. The review report form covered the same aspects as in case of the test report and an additional grading of the reviewability.
Experiment Evaluation
Two different kinds of tests were applied during all experiments: acceptance tests and evaluation tests.
In order to receive versions with a certain minimum level of quality, each version had to pass an automated acceptance test. While this test consisted of 20 frequencies generated randomly in the first experiment (equal values for all inputs), a semi automated acceptance test was used for the second and third experiment (8 different representative input combinations generated by an FPGA programmed for this purpose). If a group failed this test, the group members had the chance to improve their implementation and try another acceptance test. However, an overall number of 5 groups did not pass the acceptance tests. For this reason only 55 of the 60 overall versions were considered for the evaluations presented in this report (see Fig. 2 
for details).
The following comprehensive evaluation test was used to determine the failures in all accepted versions. The failures identified during this evaluation test were used for the later reliability 4 evaluation. The versions were tested by an automatic real-time test environment designed for this purpose (see Fig. 1 ). The basic function of the test environment was to generate the physical signals (reset, 4 individual frequencies and 3 button signals) needed as an input for the device under test (DUT) and to record the CAN output of the DUT as shown in Fig. 1 . Different test cases based on the black-box approach (random values, extreme values, values representing certain scenarios and environment conditions) have been used during evaluation with an overall number of more than 54000 lines of test data 5 . The black box approach was used, since all versions had to be treated in the same way to avoid external influences with respect to the experiment results. Further details of the evaluation and challenges resulting from real-time requirements and specific properties of embedded systems can be found in [SK07c] . 
MCU

Experiment Results
As already mentioned before, all versions which passed the acceptance test were evaluated by using the automatic test environment described in section 2.3. The failures found during this evaluation are presented in the following.
First experiment (NVP)
The results of our first NVP experiment showed high numbers of dependent failures [SWK06] . For the analysis in this report we took a closer look at the most recent common mode failures 6 and listed them in Fig. 2 (Exp.1). The first failure mentioned in this table was present in the first messages after reset in all MCU and one CPLD version. The behavior after reset was not explicitly specified (resulting in the same requirements as during runtime) which could have led to this failure. Another reason could be seen in improper initialization methods, especially in case of the MCUs. According to the parallel structure of the CPLD and the fact that less initializations are needed in case of this type of hardware (no interrupt handlers, timers, etc have to be initialized) this specific fault occurred only in one CPLD version. Therefore, forced diversity by different hardware platforms was successful in this case.
The following failures (No. 2-4) result from not considering certain input situations like very low, very high or changing inputs. These failures occurred in many MCU and CPLD versions, however, differences can be seen between both hardware platforms. In CPLD versions the use of different measurement intervals was usually avoided (most probably because it was complicated to implement on this hardware platform). For this reason, changes of the input frequencies were handled faster with CPLDs (No. 4) coming at the cost that the accuracy in case of low input frequencies is insufficient (No. 2).
Although these failures occurred in MCUs and CPLDs with different intensity, it cannot be concluded that they guarantee diversity. Depending on the combination of versions, any failure mentioned could occur on both hardware platforms.
Second experiment (NVP, extended task)
A more complex task was used for the second NVP experiment as presented in Section 2. To avoid the comparatively simple errors (overflows, etc.) found in the first experiment, we conducted a stronger acceptance test in this second experiment. The results can be seen clearly in Fig. 2 (Exp.2, No. 2 and 3): Overflows were detected by the acceptance test as well as most of the problems with low frequency input. As before, a high number of dependent failures occurred as presented in Fig. 2 (Exp.2, No. 4, 5, 6) . It seems that the additional functionality added increased the problems of response times and accuracy, especially in FPGA versions (compared to CPLD, No. 4) . Further on, a common mode failure could be identified in the additional tasks (No. 10). Several versions did not implement the test message counter as specified starting with an incorrect value. The second failure within the test message (No. 11) was hardware dependent and occurred in FPGA versions only.
Obviously, some fault sources identified are implementation depended (No. 1, 6-12) while others are implementation independent (No. 2, 3-5). According to our results, the NVP approach applied might be more suitable to mitigate implementation specific faults. These issues will be looked at closer in Section 4.
Third experiment (Test & Review, extended task)
The failures, present in the majority of the versions created in the NVP experiments, seem to be found easily as soon as they have been identified once. Furthermore, we had added ambiguous statements to the specification which had Fig. 3 . Problems identified by review and test not been identified by any experiment participant. For this reason we conducted a third experiment including a verification part consisting of review and testing as described above. The most common failures/problems identified during this verification part are listed in Fig. 3 . A high number of the failures result from ambiguous statements in the specification (No. 1-4, Fig. 3 ). These problems occurred in every review and testing process and one might wonder why they were not identified by all of the groups. However, ambiguities were usually identified only if the verification group understood the specification differently than the implementation group which occurred only in the minority of the cases.
The problem according to fast changes of frequency values at the inputs (No. 6) has been identified by 5 of the 6 initial review groups as well as by 5 of the 6 initial test groups. This results might have influenced the second verification part, but surprisingly the problem was identified in less versions in the second part of the verification (only 3/6 of review groups and 4/6 of test groups identified the problem). However, it has to be mentioned that the resulting failures according to this problem differ in intensity and might be easier to find in some versions. Additionally, few review groups claimed that statements about the timing behavior of the code were not possible by review.
Further implementation faults in one or more versions have been identified (No. 5, 7, 8) . Of special interest is problem No. 5, as it occurred in several versions: A test counter should be realized starting with 0x00 and incrementing by 1 with every CAN message sent. In 5 of the 12 undebugged versions the test counter did not start with 0x00 but with 0x01. One reason for this failure was that the line of code for the incrementation was placed incorrectly (incrementation took place before the counter value was read the first time). This problem was identified in 4/5 of the versions, mostly by review.
In a last step, all experiment participants had the opportunity to improve their versions. The results are listed in the last column of Fig. 2 (Exp.3) . While many failures were identified and removed, the mitigation of some failures (as No. 4 and 5 of Fig. 2 ) would have needed major changes in the software architecture. According to limited time for these changes, many final versions of experiment 3 still contain faults (mostly: implementation cannot deal with fast changes of the input values within the specified time).
Fault Classification & Potentials of the Approaches
During evaluation of the NVP experiment results, it became obvious that different sources exist for the failures found. Those failure sources (faults) have been identified as follows:
-Specification specific faults: the specification was misleading or ambiguously. -Application specific faults: application specific problems and challenges have not been understood and thus have not been handled sufficiently (e.g. forget to handle a certain scenario/input constellation). -Implementation specific faults: specification and application specific problems have been identified correctly, but faults have been made during implementation (e.g. incomplete case structure).
The failures found in the NVP experiment have been analyzed with respect to these fault categories and many implementation specific faults identified in Fig. 2 could be mitigated by diverse hardware NVP. However, even implementation specific faults as No. 10 (test message counter, see Section 3.3) occurred in several versions developed independently on diverse hardware platforms. While it is stated in [BBvdM04] that the specification must be correct to allow successful NVP, potentials of NVP to deal with specification problems are seen in other publications [GR01] . Some specification specific faults could be tolerated by diverse hardware NVP in our experiments, if one hardware platform was guiding to the correct implementation, as it was the case for example in failure No.1 (Fig. 2) in the first experiment. Other specification specific faults can only be found if different teams will interpret the specification differently. According to our results, different developers interpret the specification differently, but we see no hint that the majority of the resulting versions is correct. For this reason, NVP might uncover specification problems, but redundancy concepts based on majority voting, as for example a two out of three (2oo3, TMR) system, would be probably no solution to mask the specification specific faults present in our experiment data.
As a third fault category, we introduced application specific faults which are, beside specification specific faults, a challenging problem in NVP. Despite the immense effort put into different development processes, languages and programming styles by using completely different hardware platforms in our NVP experiments, several problems remained the same in all implementations leading to identical wrong results in several cases. These problems (No. 2-5, Fig. 2 ) result from the application itself, are implementation independent and thus cannot be avoided by this approach of NVP. A solution might be functional diversity as described in [LPS99] which offers higher independence of failure behavior according to different functionalities implemented in the diverse software version (although this approach still comprises certain risks of common mode failures [LPS99] ).
In the following, the results of our application of NVP are compared with those of the third experiment in which review and testing were applied for reliability improvement. They were compared with the help of the failures presented in Fig. 2 and Fig. 3 and the results, presented in Fig.4 qualitatively, are discussed in the following.
As expected, diverse hardware NVP allowed to mitigate most of the implementation specific faults 11, 12 in Fig. 2) . One exception was the implementation of the test counter (No. 10). This easy task was faulty in several versions for the reasons already described. In case of review and testing, many, but not all implementation specific problems were identified. In case of non realtime tasks reviews uncovered more faults while testing was more successful in case of real-time functionalities. For this reason, high potentials of implementation specific fault mitigation are assigned for NVP (not the best value according to common mode failure described), closely followed by test and review, since many implementation specific problems had not been identified by all reviewers/testers.
Application specific faults were present in the majority of all versions, even in those created on different hardware platforms. Some of these application specific faults occurred less often on the first hardware while others were present less often on the other hardware. However, these differences were usually small (exceptions are No.2, Exp.1 and No.6, Exp2. in Fig. 2 ). For this reason, only low to medium potentials are seen for diverse hardware NVP to mitigate application specific faults. On the other hand, testing and review discovered most application specific problem as No.6 in Fig. 3 . With respect to application specific faults, testing showed a little more advantages in comparison to review since also unexpected faults were identified during testing while reviewers typically concentrated on finding known problems in the code.
Finally, specification specific faults were a problem for all three approaches. In case of diverse hardware NVP, only few specification specific faults could be avoided as described above. In case of testing and review, all known specification problems had been identified, but in several cases only by a minority of the groups (No.1-4 in Fig. 3 ). Review seemed to have the highest potentials to reveal specification specific problems, since the specification had been analyzed closely for review, while it was used only for test case generation in the test process.
Summarizing, diverse hardware NVP, review and testing showed different potentials of fault mitigation with respect to the three categories of failure sources as depicted in Fig. 4 . While further empirical results are needed to help designers of safety critical embedded software to apply the optimal combination of reliability improvement approaches, the results of this report dealing with three important approaches are a first step in this direction. Further on, the categorization into specification specific, application specific and implementation specific faults allows a useful and systematic comparison of the different approaches.
Threats on Validity
In the following two subsections, the results presented in this report were analyzed with respect to internal and external validity. In this context, internal validity represents the correctness of the experiment itself while external validity represents portability of the results to other applications [WRH + 00].
Threats on Internal Validity
With respect to the first two experiments (NVP) two threats on internal validity have to be considered. According to questionnaires, handed out at the beginning and the end of all lab courses, students were more experienced in C/microcontrollers than in VHDL/FPGAs. This different previous knowledge was tried to adjust by a two day introductory course (described in [SWK05]) prior to the experiment. The different skills in the programming languages might have influenced the structure and complexity of the resulting codes and thus might have influenced the results. Moreover, one compilation of VHDL code took up to 2 min while C code was compiled in a few seconds. This aspect might have influenced the development. Nevertheless, we do not see any threat on internal validity since this is a given difference between FPGA and MCU programming.
Although the students were asked not to exchange information of their way of review and test, this aspect could not be avoided completely and has to be considered as threat on internal validity in the third experiment. Accordingly, the number of groups which found a specific fault might have been smaller if information exchange between the groups could have been avoided completely.
Threats on External Validity
As in any other experiment, the results could depend on the type and the complexity of the task used. The task used in our experiments was an automotive real-time application which, in our opinion had typical properties of small to medium sized embedded applications. However, additional experiments are desirable to minimize potential dependencies between the task and the results.
The faults found by review and testing might have been influenced by the review technique used. In contrast to formalized reviews with several reviewers as for example code inspection described in [Fag76] we used only two reviewers and the guidelines for the reviewers were limited to the review report form. For testing, black box testing was applied which is the most popular testing technique for the verification of real-time systems, but other test techniques might be possible. In both cases, review and testing, it has to be considered that the students had implemented this application by themselves before and thus had a good understanding of the problems which might have eased the review and testing process.
Finally, an important point concerning external validity is the quality of participants regarding experience and development knowledge. Although this problem certainly is applicable on this experiment, the objective here was to show a general difference of an effect. Using students for empirical evaluations is a viable approach referring to [SAA + 03]. In addition, in [BDW02] it is stated that no difference of programming expertise between professional and non-professional developers could be found, while [HRW00] states that at least last-year software engineering students and professionals have a comparable assessment ability.
Conclusion and Future Work
Conclusion
Experiments have been conducted at our institute investigating the potentials of applications of diverse hardware NVP, testing and reviews for embedded systems.
During the evaluation of our empirical results, three major fault categories were identified, namely implementation specific, application specific and specification specific faults. The three different approaches analyzed in this report showed different potentials with respect to these three fault categories.
During the analysis of the two NVP experiments, high numbers of dependent faults had been found which resulted from application specific faults. In other words, the specification was correct and has been understood by the developer, but during implementation similar or identical faults have been introduced into many of the versions developed independently on different hardware platforms. The reason for this common mode failures are application specific difficulties, which exist independently from the implementation. A similar problem exists for specification specific problems. Although often stated that a correct specification is mandatory for the NVP approach, several ambiguities could be uncovered by NVP, or even mitigated by an implementation on one hardware platform (No.1, Exp1, Fig. 2) . However, our results show no hint that the majority of implementations delivers correct results with respect to the intended behavior (leading to the mentioned problem for approaches based on majority voting). For the third fault category, namely implementation specific faults, it had been expected that diverse hardware NVP would allow maximum diversity between the faults in the different versions. However, even in this case in which the specification was correct and easy to understand and no application specific difficulties were present, at least one common mode failure has been introduced in several versions (No.10 in Fig. 2 ). Review and testing showed medium to high potentials with respect to all three fault categories. With respect to specification specific faults, reviews showed the highest potentials. It seems that during review the specification is read more intensely than during testing in which the specification is used only for test case generation. In case of application specific faults, testing showed the highest potentials. A reason might be that, while review focuses on finding known problems in the code, testing could reveal unexpected faults. In the case of implementation specific faults, testing and review showed lower potentials for reliability improvement than our NVP approach. These implementation specific faults were often related to real-time requirements which are comparatively hard to detect by testing and review.
Finally, the results have been discussed with respect to internal and external validity. According to this discussion, it has to be mentioned that the aim of this report is to emphasize the special advantages and disadvantages of the different approaches presented in this report with respect to different sources of faults. However, additional experiments are needed to support the results presented in this report and investigations of other constructive (design processes, functional diversity, specific architectures, etc.) and analytical (specific testing approaches, model checking, etc.) reliability measures are desirable.
Summarizing, diverse hardware NVP was able to mitigate many (but not all) implementation specific faults present in our experiments while potentials to mitigate application and specification specific faults were comparatively low. Reviews were especially useful to uncover problems in the specification while testing was especially useful to detect application specific faults. To apply the results, each safety critical application has to be analyzed carefully to determine the application specific needs and potentials of fault mitigation. Based on this analysis, in combination with empirical results as those presented in this report, a combination of suitable approaches can be applied to achieve highly reliable embedded software.
Future Work
The potentials of review and testing might be different for FPGAs. Approaches of testing and of formal verification could benefit from the program structure of these devices (e.g. no problems according to interrupts). Reviews might benefit from possible separation of concerns in FPGAs (parallel structure) on the one hand, but interaction between several parallel units in real-time might be harder to understand on the other hand. These potentials are currently analyzed at our institute [SK07b] .
With the application of automatic code generation in the automotive industry, the idea of using this code generated automatically as one version in an NVP approach is arising. The second version would be implemented manually allowing certain diversity (approach with two fail silent units programmed with diverse software to avoid shut down of both units according to a software fault). This approach might have advantages and should be further evaluated.
[PvSK90] David L. Parnas, John van Schouwen, and Shu Po Kwan. Evaluation of safetycritical software. 
