182,331 research outputs found

    Relationship between size, effort, duration and number of contributors in large FLOSS projects

    Get PDF
    This contribution presents initial results in the study of the relationship between size, effort, duration and number of contributors in eleven evolving Free/Libre Open Source Software (FLOSS) projects, in the range from approx. 650,000 to 5,300,000 lines of code. Our initial motivation was to estimate how much effort is involved in achieving a large FLOSS system. Software cost estimation for proprietary projects has been an active area of study for many years. However, to our knowledge, no previous similar research has been conducted in FLOSS effort estimation. This research can help planning the evolution of future FLOSS projects and in comparing them with proprietary systems. Companies that are actively developing FLOSS may benefit from such estimates. Such estimates may also help to identify the productivity ’baseline’ for evaluating improvements in process, methods and tools for FLOSS evolution

    Effort estimation of FLOSS projects: A study of the Linux kernel

    Get PDF
    This is the post-print version of the Article. The official published version can be accessed from the link below - Copyright @ 2011 SpringerEmpirical research on Free/Libre/Open Source Software (FLOSS) has shown that developers tend to cluster around two main roles: “core” contributors differ from “peripheral” developers in terms of a larger number of responsibilities and a higher productivity pattern. A further, cross-cutting characterization of developers could be achieved by associating developers with “time slots”, and different patterns of activity and effort could be associated to such slots. Such analysis, if replicated, could be used not only to compare different FLOSS communities, and to evaluate their stability and maturity, but also to determine within projects, how the effort is distributed in a given period, and to estimate future needs with respect to key points in the software life-cycle (e.g., major releases). This study analyses the activity patterns within the Linux kernel project, at first focusing on the overall distribution of effort and activity within weeks and days; then, dividing each day into three 8-hour time slots, and focusing on effort and activity around major releases. Such analyses have the objective of evaluating effort, productivity and types of activity globally and around major releases. They enable a comparison of these releases and patterns of effort and activities with traditional software products and processes, and in turn, the identification of company-driven projects (i.e., working mainly during office hours) among FLOSS endeavors. The results of this research show that, overall, the effort within the Linux kernel community is constant (albeit at different levels) throughout the week, signalling the need of updated estimation models, different from those used in traditional 9am–5pm, Monday to Friday commercial companies. It also becomes evident that the activity before a release is vastly different from after a release, and that the changes show an increase in code complexity in specific time slots (notably in the late night hours), which will later require additional maintenance efforts

    Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

    Get PDF
    Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%

    Integrate the GM(1,1) and Verhulst models to predict software stage effort

    Get PDF
    This is the author's accepted manuscript. The final published article is available from the link below. Copyright @ 2009 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.Software effort prediction clearly plays a crucial role in software project management. In keeping with more dynamic approaches to software development, it is not sufficient to only predict the whole-project effort at an early stage. Rather, the project manager must also dynamically predict the effort of different stages or activities during the software development process. This can assist the project manager to reestimate effort and adjust the project plan, thus avoiding effort or schedule overruns. This paper presents a method for software physical time stage-effort prediction based on grey models GM(1,1) and Verhulst. This method establishes models dynamically according to particular types of stage-effort sequences, and can adapt to particular development methodologies automatically by using a novel grey feedback mechanism. We evaluate the proposed method with a large-scale real-world software engineering dataset, and compare it with the linear regression method and the Kalman filter method, revealing that accuracy has been improved by at least 28% and 50%, respectively. The results indicate that the method can be effective and has considerable potential. We believe that stage predictions could be a useful complement to whole-project effort prediction methods.National Natural Science Foundation of China and the Hi-Tech Research and Development Program of Chin

    Investigating effort prediction of web-based applications using CBR on the ISBSG dataset

    Get PDF
    As web-based applications become more popular and more sophisticated, so does the requirement for early accurate estimates of the effort required to build such systems. Case-based reasoning (CBR) has been shown to be a reasonably effective estimation strategy, although it has not been widely explored in the context of web applications. This paper reports on a study carried out on a subset of the ISBSG dataset to examine the optimal number of analogies that should be used in making a prediction. The results show that it is not possible to select such a value with confidence, and that, in common with other findings in different domains, the effectiveness of CBR is hampered by other factors including the characteristics of the underlying dataset (such as the spread of data and presence of outliers) and the calculation employed to evaluate the distance function (in particular, the treatment of numeric and categorical data)

    An Analysis of Data Sets Used to Train and Validate Cost Prediction Systems

    Get PDF
    OBJECTIVE - the aim of this investigation is to build up a picture of the nature and type of data sets being used to develop and evaluate different software project effort prediction systems. We believe this to be important since there is a growing body of published work that seeks to assess different prediction approaches. Unfortunately, results – to date – are rather inconsistent so we are interested in the extent to which this might be explained by different data sets. METHOD - we performed an exhaustive search from 1980 onwards from three software engineering journals for research papers that used project data sets to compare cost prediction systems. RESULTS - this identified a total of 50 papers that used, one or more times, a total of 74 unique project data sets. We observed that some of the better known and publicly accessible data sets were used repeatedly making them potentially disproportionately influential. Such data sets also tend to be amongst the oldest with potential problems of obsolescence. We also note that only about 70% of all data sets are in the public domain and this can be particularly problematic when the data set description is incomplete or limited. Finally, extracting relevant information from research papers has been time consuming due to different styles of presentation and levels of contextural information. CONCLUSIONS - we believe there are two lessons to learn. First, the community needs to consider the quality and appropriateness of the data set being utilised; not all data sets are equal. Second, we need to assess the way results are presented in order to facilitate meta-analysis and whether a standard protocol would be appropriate

    The UK Benchmark Network - Designation, Evolution and Application

    Get PDF
    The UK has one of the densest gauging station networks in the world – a necessary response to its diversity in terms of climate, geology, land use and patterns of water utilisation. This diversity and, particularly, the compelling impact of artificial influences on natural flow regimes across most of the country, implies a considerable challenge in identifying, interpreting and indexing changes in river flow regimes. Quantifying and interpreting trends in river flows – in particular separating climate-driven changes from those resulting from other driving mechanisms – is a necessary pre-requisite to the development of improved river and water management strategies. It is also a primary strategic objective of many national and international river flow monitoring programmes. This paper charts the development of the UK Benchmark Network through its initial promotion phase – involving key institutional partners in both the hydrometric data acquisition and user communities – through to its exploitation across a wide a range of policy, scientific and engineering design applications. Particular consideration is given to the criteria used to appraise and select candidate catchments and gauging stations. Spatial characterisations (particularly physiographic, geological and land use) are used to determine the representativeness of individual candidate catchments and hydrometric performance (in the extreme flow ranges especially), together with record length, is of primary importance in relation to gauging station selection. Indexing the degree to which artificial influences disturb the natural flow regime is also a necessary pre-requisite for selection across much of the UK. Descriptions are given of a number of network and data review mechanisms developed to maximize the utility of the Benchmark Network and the burgeoning range of applications which have capitalized on it – embracing both national and international monitoring programmes. The review finishes with an overview of the strategic benefits deriving from the operation of the Benchmark Network and examines some of the enduring issues which require further work – including the continuing focus on operationally driven gauging activities; meeting the more stringent data demands of the Benchmark Network, and the need for further integration of catchment monitoring activities – embracing a wider range of hydrometeorogical variables

    Quantifying fisher responses to environmental and regulatory dynamics in marine systems

    Get PDF
    Thesis (Ph.D.) University of Alaska Fairbanks, 2017Commercial fisheries are part of an inherently complicated cycle. As fishers have adopted new technologies and larger vessels to compete for resources, fisheries managers have adapted regulatory structures to sustain stocks and to mitigate unintended impacts of fishing (e.g., bycatch). Meanwhile, the ecosystems that are targeted by fishers are affected by a changing climate, which in turn forces fishers to further adapt, and subsequently, will require regulations to be updated. From the management side, one of the great limitations for understanding how changes in fishery environments or regulations impact fishers has been a lack of sufficient data for resolving their behaviors. In some fisheries, observer programs have provided sufficient data for monitoring the dynamics of fishing fleets, but these programs are expensive and often do not cover every trip or vessel. In the last two decades however, vessel monitoring systems (VMS) have begun to provide vessel location data at regular intervals such that fishing effort and behavioral decisions can be resolved across time and space for many fisheries. I demonstrate the utility of such data by examining the responses of two disparate fishing fleets to environmental and regulatory changes. This study was one of "big data" and required the development of nuanced approaches to process and model millions of records from multiple datasets. I thus present the work in three components: (1) How can we extract the information that we need? I present a detailed characterization of the types of data and an algorithm used to derive relevant behavioral aspects of fishing, like the duration and distances traveled during fishing trips; (2) How do fishers' spatial behaviors in the Bering Sea pollock fishery change in response to environmental variability; and (3) How were fisher behaviors and economic performances affected by a series of regulatory changes in the Gulf of Mexico grouper-tilefish longline fishery? I found a high degree of heterogeneity among vessel behaviors within the pollock fishery, underscoring the role that markets and processor-level decisions play in facilitating fisher responses to environmental change. In the Gulf of Mexico, my VMS-based approach estimated unobserved fishing effort with a high degree of accuracy and confirmed that the regulatory shift (e.g., the longline endorsement program and catch share program) yielded the intended impacts of reducing effort and improving both the economic performance and the overall harvest efficiency for the fleet. Overall, this work provides broadly applicable approaches for testing hypotheses regarding the dynamics of spatial behaviors in response to regulatory and environmental changes in a diversity of fisheries around the world.General introduction -- Chapter 1 Using vessel monitoring system data to identify and characterize trips made by fishing vessels in the United States North Pacific -- Chapter 2 Paths to resilience: Alaska pollock fleet uses multiple fishing strategies to buffer against environmental change in the Bering Sea -- Chapter 3 Vessel monitoring systems (VMS) reveal increased fishing efficiency following regulatory change in a bottom longline fishery -- General Conclusions
    • …
    corecore