OBJECTIVE - the aim of this investigation is to build up

a picture of the nature and type of data sets being used to

develop and evaluate different software project effort prediction systems. We believe this to be important since there is a growing body of published work that seeks to assess different prediction approaches. Unfortunately, results – to date – are rather inconsistent so we are interested in the extent to which this might be explained by different data sets.

METHOD - we performed an exhaustive search from 1980

onwards from three software engineering journals for research papers that used project data sets to compare cost

prediction systems.

RESULTS - this identified a total of 50 papers that used, one or more times, a total of 74 unique project data sets. We observed that some of the better known and publicly accessible data sets were used repeatedly making them potentially disproportionately influential. Such data sets also tend to be amongst the oldest with potential problems of obsolescence. We also note that only about 70% of all data sets are in the public domain and this can be particularly problematic when the data set description is incomplete or limited. Finally, extracting relevant information from research papers has been time consuming due to different styles of presentation and levels of contextural information.

CONCLUSIONS - we believe there are two lessons to learn.

First, the community needs to consider the quality and appropriateness of the data set being utilised; not all data sets are equal. Second, we need to assess the way results are presented in order to facilitate meta-analysis and whether a standard protocol would be appropriate

Jorgensen, Magne

Mair, Carolyn

Shepperd, Martin

English

Solent Electronic Archive

	
	

 	!∀
#∃%&∋()∗
+,−.
(	#/001		∀120−+340
	
	
	
	
			
	
	
An Analysis of Data Sets Used to Train and Validate Cost Prediction Systems
Carolyn Mair and Martin Shepperd
Bournemouth University, UK
{cmair, mshepper}@bmth.ac.uk
Magne Jørgensen
Simula Labs., Norway
magnej@simula.no
Abstract
OBJECTIVE - the aim of this investigation is to build up
a picture of the nature and type of data sets being used to
develop and evaluate different software project effort pre-
diction systems. We believe this to be important since there
is a growing body of published work that seeks to assess
different prediction approaches. Unfortunately, results – to
date – are rather inconsistent so we are interested in the ex-
tent to which this might be explained by different data sets.
METHOD - we performed an exhaustive search from 1980
onwards from three software engineering journals for re-
search papers that used project data sets to compare cost
prediction systems.
RESULTS - this identified a total of 50 papers that used, one
or more times, a total of 74 unique project data sets. We ob-
served that some of the better known and publicly accessi-
ble data sets were used repeatedly making them potentially
disproportionately influential. Such data sets also tend to
be amongst the oldest with potential problems of obsoles-
cence. We also note that only about 70% of all data sets
are in the public domain and this can be particularly prob-
lematic when the data set description is incomplete or lim-
ited. Finally, extracting relevant information from research
papers has been time consuming due to different styles of
presentation and levels of contextural information.
CONCLUSIONS - we believe there are two lessons to learn.
First, the community needs to consider the quality and ap-
propriateness of the data set being utilised; not all data sets
are equal. Second, we need to assess the way results are
presented in order to facilitate meta-analysis and whether a
standard protocol would be appropriate.
1. Introduction
The problem of how to generate useful software cost1 pre-
dictions at an early stage in a project has been the subject
1Strictly speaking we mean effort prediction since the non-labour costs
tend to be ignored in this type of research, however, cost is the more com-
monly used term.
of a considerable amount of research since the pioneering
work of Benington [1] almost 50 years ago. Subsequently
researchers such as Kitchenham [8] and Kemerer [5] iden-
tified the need for empirical validation of the different, and
in many senses competing, prediction systems that were be-
ing proposed. This has led to some hundreds of studies that
have used different (usually industrially derived), data sets
in order to conduct comparative empirical studies of the rel-
ative performance of different cost prediction systems. For
review articles see [2, 4].
Whilst it is clearly a positive development that cost esti-
mation researchers are active in empirically evaluating pre-
diction systems, this has resulted in a number of new prob-
lems. On the whole results have tended to be inconclusive in
the sense that study A using data set B finds prediction sys-
tem X is to be preferred to prediction system Y, whilst study
C using data set D finds the reverse. Potential explanations
include use of different evaluation procedures and accuracy
indicators [7] which can lead to rank reversal problems. An-
other, probably more significant area lies in the use of dif-
ferent data sets and their influence upon prediction system
performance [10]. This is the motivation for this paper. We
wish to investigate the nature and type of data sets being
used to develop and evaluate different software project ef-
fort prediction systems. This could prove useful for future
researchers considering how best to evaluate cost prediction
systems. It is also a foundation for meta-analysis when re-
searchers seek to systematically combine results from more
than one study.
The remainder of this paper is organised as follows. The
next section sets out the method of how we identified the re-
search papers for our analysis. We then present our findings
both by data set and by research study. We then conclude by
considering the implications of these results for future em-
pirical research studies and for those endeavouring to per-
form meta-analyses.
2. Method
In order to perform the analysis of data sets used to train and
validate cost prediction systems, we defined the following
inclusion criteria:
1. the papers were concerned with software cost estima-
tion, and not, for example, size or productivity estima-
tion;
2. the data set(s) were used to evaluate prediction systems
(including expert judgement);
3. the data were ‘real’, not simulated;
4. each dataset comprised at least 2 projects (this ex-
cluded case studies).
Given the size of the literature we decided to adopt a sam-
pling procedure. We decided to focus upon journals since
one would expect more mature and heavily refereed re-
search studies to be published in such outlets. Results
from this search identified three journals as those which
had most prolifically published relevant papers according
to our criteria over the past 25 years. The selected journals
were Information & Software Technology (IST), the Jour-
nal of Systems & Software (JSS) and IEEE Transactions
on Software Engineering (TSE) all of which have featured
in other software engineering literature reviews, e.g. Glass
and Chen [3]. Empirical Software Engineering (ESE) was
not included since it is not presently included within the
Thomson-ISI Scientific Citation Index.
The search began by using a personal informal bib-
liographic database2, and continued using the Web of
Knowledge (wok.mimas.ac.uk/), ScienceDirect (sciencedi-
rect.com), IEEE Explore (ieeexplore.ieee.org) and Google
(google. co.uk), using the search terms ‘cost, ‘estimation
and ‘effort within the three selected journals.
Details from each paper were catalogued according to
information availability within each journal paper. For each
paper we identified those data sets that were utilised. And
for each data set we collected the following:
• data set name
• version (if any)
• public availability
• contact person (useful for resolving queries concerning
the data set)
• start and completion date
• nationality
• number of organisations
• application domain (business sector)
2The database formed part of the Magne Jørgensen’s (Simula Labs,
Norway) BEST project.
Group Count %
Y 54 73.0
N 6 8.1
? 14 18.9
Table 1. Software Project Cost Data Sets in
the Public Domain
• number of projects
• project type (new or enhancement or mixed)
• number of features
• presence of missing values
Additional information was also collected since this pilot is
in fact part of a larger study to conduct a meta-analysis of all
empirical cost prediction results, however, this is beyond the
scope of this paper. We also note that this exercise was far
from straightforward and often involved reference to other
papers, analysis of the data directly (when available) and
discussions with those responsible for collecting the data.
3. Findings
Next we consider our findings, first in terms of the data sets
(some of which are used more than once) and then in terms
of research study, many of which use more than one data
set, i.e. there is a many to many relationship.
3.1 Data Sets
As indicated our search for empirical studies from the three
journals identified a total of 74 distinct data sets, though
many of them were used more than once. Of these data
sets, just over 70% are in the public domain (see Table 1).
In some cases, particularly for older data sets, we were un-
clear whether the data is available. Overall, something over
a quarter of data sets used are not easily available which
has clear implications for replication and transparency. It is
something of a moot point as to whether studies using con-
fidential data should be published since software develop-
ment organisations are subject to commercial pressures and
we do not wish to hinder the flow of data made available for
research. One possibility is, of course, the use of sanitisa-
tion procedures though this is at the expense of making the
research context less precise and the resultant danger that
data is used inappropriately.
These data sets varied in age 3 from 1979 onwards (see
Figure 1). The data sets varied in age from 1979 onwards.
3By age we mean the date of the last completed project as opposed to
when the research was actually published.
Figure 1. Project Data Sets By Age
Figure 2. Histogram of Data Set Size (Number
of Projects)
Of the 74 data sets, only 21 have exact start and end dates
detailed in any study which has used them. Some other
studies reported collection dates, often relative to publica-
tion. Whilst better than nothing this doesn’t give informa-
tion on when the projects actually completed (which for
some data sets can span a considerable period of time). Of
course one can also estimate dates by simply assuming the
completion date to be prior to the publication date of the
paper in which they were used. However, this does not in-
dicate how long prior to publication date the projects were
completed.
It is also instructive to observe that the data sets varied
considerably in size (the number of cases or projects - see
Figure 2) and the richness of information to describe each
project (the number of features or variables- see Figure 3).
Figure 3. Histogram of Data Set Size (Number
of Features)
Group Count %
single organisation 37 50.0
multi-organisation 18 24.3
? 19 25.7
Table 2. Single / Multi-Organisation Data Sets
One suspects that the patterns that might be discovered and
the prediction systems evolved for a data set of 3 features
differs somewhat from a data set of 40+ features. Both
histograms indicate a strong tendency towards smaller data
sets. As a community, we need to consider what impact this
may have upon our results and recommendations to practi-
tioners.
Another area that has been promoting debate recently
concerns the use of single or multi-organisation data. For
example some large benchmark data sets such as ISBSG
contain data from many organisations whereas other data
sets contain projects from a single company only. Table
2 indicates that the half of the data sets comprise projects
from a single organisation and a disturbing quarter of all
data sets fail to make this information clear at all.
Finally, we look at the country of origin of these data
sets (see Table 3). It is clear that Europe and North America
dominate, however, it is also striking that for almost 20% of
the data sets we are not even provided with this what might
be regarded as quite basic information.
3.2 Research Studies
The systematic search described in the previous section
identified a total of 50 papers that used a total of 74 unique
project data sets with some data sets being used repeatedly
and some in combination.
Country Count %
USA 16 21.6
UK 12 16.2
Other European 11 14.9
Australian / NZ 7 9.5
Japanese 6 8.1
Canadian 4 5.4
Multi-national 4 5.4
? 14 18.9
Table 3. Software Project Cost Data Sets by
Country of Origin
Journal Count Dates
JSS 19 1981 - 2003
TSE 18 1987 - 2004
IST 13 1994 - 2005
Total 50
Table 4. Research Studies by Journal and
Date
Table 4 shows the distribution of papers between the
three journals identified from 1981 to present. The publica-
tion trends are shown in 4 and broadly indicate an increase
in the number of research papers that use data sets to evalu-
ate cost prediction systems.
Figure 5 shows that the majority of data sets are used
only once. This is for two reasons. First our analysis is lim-
ited to only three journals so the majority of studies are ex-
cluded. Second, and less expectedly is that there are many
variants and versions of data sets. Examples are the ISBSG
and Finnish data sets that grow over time with new versions
Figure 4. Line Plot of Publications Over Time
Figure 5. Histogram of Frequency of Data Set
Utilisation
being released often on an annual basis. Clearly it is impor-
tant for researchers to be specific about which version they
are using. We also observed on occasions that researchers
combined two existing data sets or removed / added a small
number of data points. Moreover there is no unambiguous
naming convention so it is possible that use of synonyms
has caused additional confusion.
We noted that the most heavily data sets (COCOMO, De-
sharnais, Kemerer and Albrecht and Gaffney) are amongst
the oldest data sets dating from the 1970s or 80s. In one
sense this is to be expected since these data sets have had
the most opportunity for use. However, when conducting
meta-analyses or other forms of overall analysis we do need
to be somewhat cautious about their age in an industry char-
acterised by rapid change.
4. Discussion
In this study of 50 published empirical studies of cost pre-
diction systems from three software engineering journals
we have uncovered some interesting characteristics of data
sets that are used to train and evaluate software cost predic-
tion systems.
We observed that some of the better known and publicly
accessible data sets were used repeatedly making them po-
tentially disproportionately influential. Such data sets also
tend to be amongst the oldest with potential problems of
obsolescence. We also note that only about 70% of all data
sets are in the public domain and this can be particularly
problematic when the data set description is incomplete or
limited.
Data sets varied considerably in terms of size, number
of features, age, nationality, number of organisations, treat-
ment of missing data and so forth. This means we need to
be much more systematic in exploring the relation between
data set characteristics and prediction system performance.
We also need to avoid using data sets that are no longer rep-
resentative of modern software development practices and
current data collection opportunities. Since availability of
data sets is clearly factor we need to consider making some
of the more modern and complex data sets widely avail-
able. For this reason initiatives such as the PROMISE [9]
are very welcome. Having said this, there is the danger that
more complex data sets are more easily misunderstood, so
detailed protocols and dialogue with those associated with
collection are essential.
In addition, the process of extracting relevant informa-
tion from research papers has been time consuming due
to different styles of presentation and levels of contextu-
ral information. Again, we consider initiatives such as the
PROMISE [9] helpful.
A possible threat to our findings is the question of
how representative are the studies that we have identified?
Clearly it would be useful to continue this work in order to
construct a more complete picture. Nonetheless we believe
we have examined a considerable number of studies over a
period of almost 25 years from three international, refereed
and archival journals.
Overall we feel our pilot analysis highlights the need to
give very careful consideration to three issues. The data sets
we use are extremely varied so we need to consider which
data sets we use for training and validation, for instance is it
appropriate to use an old data set or study mixed (new and
enhancement) project types? Second, given this variation,
context is important so when publishing data sets it is es-
sential to provide enough contextual information to support
meaningful generalisation. Lastly, meta-analyses and sys-
tematic reviews [6] will be greatly facilitated by the use of
standard protocols.
Acknowledgment
This work was funded by the UK Engineering and Physical
Sciences Research Council under grant GR/S45119.
References
[1] Benington, H.“Production of large computer pro-
grams,” presented at Symp. on Advanced Computer
Programs for Digital Computers, Washington, D.C.,
1956.
[2] Briand, L. and Wieczorek, I. “Resource Modeling in
Software Engineering,” in Encyclopedia of Software
Engineering, J.J. Marciniak, Ed., 2nd ed. New York:
John Wiley, 2002.
[3] Glass, R. and Chen, T.Y. “An assessment of sys-
tems and software engineering scholars and institution
(19992003),” Journal of Systems & Software, 76(1),
pp91-97, 2005.
[4] Jørgensen, M. “A review of studies on expert estimation
of software development effort,” Journal of Systems &
Software, 70(1-2), pp37-60, 2004.
[5] Kemerer, C. “An empirical validation of software cost
estimation models,” Communications of the ACM, 30,
pp416-429, 1987.
[6] Kitchenham, B. “”Procedures for performing system-
atic reviews,” Keele University, UK, Technical Report
TR/SE-0401 - ISSN:1353-7776, July 2004.
[7] Kitchenham, B. MacDonell, S. Pickard, L. and Shep-
perd, M. “What accuracy statistics really measure,”
IEE Proceedings - Software Engineering, 48, pp81-85,
2001.
[8] Kitchenham, B. and Taylor, N. “Software project de-
velopment cost estimation”, Journal of Systems & Soft-
ware, 5(4) pp267-278, 1985.
[9] Sayyad Shirabad, J. and Menzies, T.J. “The PROMISE
Repository of Software Engineering Databases”.
School of Information Technology and Engineer-
ing, University of Ottawa, Canada. Available:
http://promise.site.uottawa.ca/SERepository [Last
accessed 21 February, 2005].
[10] Shepperd, M. and Kadoda, G. “Using simulation to
evaluate prediction techniques,” IEEE Trans. on Softw.
Eng., 27(11), pp987-998, 2001.


A review of studies on expert estimation of software development effort,”

A review ofstudies onexpert estimation of software development effort,”

An assessment of systems and software engineering scholars and institution

An empirical validation of software cost estimation models,”

H.“Production of large computer programs,” presented at

Procedures for performing systematic reviews,”

Resource Modeling

Software project development cost estimation”,

Using simulation to evaluate prediction techniques,”

What accuracy statistics really measure,”

An Analysis of Data Sets Used to Train and Validate Cost Prediction Systems

http://ssudl.solent.ac.uk/1385/1/An_Analysis_of_Data_Sets_Used_to_Train_and_Validate_Cost_Prediction_Systems.pdf

An Analysis of Data Sets Used to Train and Validate Cost Prediction Systems

Abstract

Similar works

Full text

Available Versions

Solent Electronic Archive