2 research outputs found
Are found defects an indicator of software correctness? An investigation in a controlled case study
In quality assurance programs, we want indicators of software quality, especially software correctness. The number of found defects during inspection and testing are often used as the basis for indicators of software correctness. However, there is a paradox in this approach, since the remaining defects is what impacts negatively on software correctness, not the found ones. In order to investigate the validity of using found defects or other product or process metrics as indicators of software correctness, a controlled case study is launched. 57 sets of 10 different programs from the PSP course are assessed using acceptance test suites for each program. In the analysis, the number of defects found during the acceptance test are compared to the number of defects found during development, code size, share of development time spent on testing etc. It is concluded from a correlation analysis that 1) fewer defects remain in larger programs 2) more defects remain when larger share of development effort is spent on testing, and 3) no correlation exist between found defects and correctness. We interpret these observations as 1) the smaller programs do not fulfill the expected requirements 2) that large share effort spent of testing indicates a "hacker" approach to software development, and 3) more research is needed to elaborate this issue
Recommended from our members
The Effectiveness of Software Diversity
This research exploits a collection of more than 2,500,000 programs, written to over 1,500 specifications (problems), the Online Judge programming competition. This website invites people all over the world to submit programs to these specifications, and automatically checks them against a benchmark. The submitter receives feedback, the most frequent responses being "Correct" and "Wrong Answer".
This enormous collection of programs gives the opportunity to test common assumptions in software reliability engineering about software diversity, with a high confidence level and across several problem domains. The previous research has used collections of up to several dozens of programs for only one problem, which is large enough for exploratory insights, but not for statistical relevance.
For this research, a test harness was developed which automatically extracts, compiles, runs and checks the programs with a benchmark. For most problems, this benchmark consists of 2,500 or 10,000 demands. The demands are random, or cover the entirety or a contiguous section of the demand space.
For every program the test harness calculates: (1) The output for every demand in the benchmark. (2) The failure set, i.e. the demands for which the program fails. (3) The internal software metrics: Lines of Code, Halstead Volume, and McCabe's Cyclomatic Complexity. (4) The dependability metrics: number of faults and probability of failure on demand.
The only manual intervention in the above is the selection of a correct program. The test harness calculates the following for every problem: (1) Various characteristics: the number of programs submitted to the problem, the number of authors submitting, the number of correct programs in the first attempt, the average number of attempts until a correct solution is reached, etc. (2) The grouping of programs in equivalence classes, i.e. groups' of programs with the same failure set. (3) The difficulty function for the problem, i.e. the average probability of failure of the program for different demands. (4) The gain of multipleversion software diversity in a 1-out-of-2 pair as a function of the pfd of the set of programs. (5) The additional gain of language diversity in the pair. (6) Correlations between internal metrics, and between internal metrics and dependability metrics.
This research confirms many of the insights gained from earlier studies: (1) Software diversity is an effective means to increase the probability of failure on demand of a 1-out-of- 2 system. It decreases the probability of undetected failure on demand by on average around two orders of magnitude for reliable programs. (2) For unreliable programs the failure behaviour appears to be close to statistically independent. (3) Run-time checks reduce the probability of failure. However, the gain of run-time checks is much lower and far less consistent than that of multiple-version software diversity, i.e. it is fairly random whether a run-time check matches the faults actually made in programs. (4) Language diversity in a diverse pair provides an additional gain in the probability of undetected failure on demand, but this gain is not very high when choosing from the programming languages C, C++ and Pascal. (5) There is a very strong correlation between the internal software metrics.
The programs in the Online Judge do not behave according to some common assumptions: (1) For a given specification, there is no correlation between selected internal software metrics (Lines of Code, Halstead Volume and McCabe's Cyclomatic Complexity) and dependability metrics (pfd and Number of Faults) of the programs implementing it. This result could imply that for improving dependability, it is not effective to require programmers to write programs within given bounds for these internal software metrics. (2) Run-time checks are still effective for reliable programs. Often, programmers remove runtIme checks at the end of the development process, probably because the program is then deemed to be correct. However, the benefit of run-time checks does not correlate with pfd, i.e. they. are as effective for reliable as for unreliable programs. They are also shown to have a negliigible adverse impact on the program's dependability