4 research outputs found

    Towards an Intelligent System for Software Traceability Datasets Generation

    Get PDF
    Software datasets and artifacts play a crucial role in advancing automated software traceability research. They can be used by researchers in different ways to develop or validate new automated approaches. Software artifacts, other than source code and issue tracking entities, can also provide a great deal of insight into a software system and facilitate knowledge sharing and information reuse. The diversity and quality of the datasets and artifacts within a research community have a significant impact on the accuracy, generalizability, and reproducibility of the results and consequently on the usefulness and practicality of the techniques under study. Collecting and assessing the quality of such datasets are not trivial tasks and have been reported as an obstacle by many researchers in the domain of software engineering. In this dissertation, we report our empirical work that aims to automatically generate and assess the quality of such datasets. Our goal is to introduce an intelligent system that can help researchers in the domain of software traceability in obtaining high-quality “training sets”, “testing sets” or appropriate “case studies” from open source repositories based on their needs. In the first project, we present a first-of-its-kind study to review and assess the datasets that have been used in software traceability research over the last fifteen years. It presents and articulates the current status of these datasets, their characteristics, and their threats to validity. Second, this dissertation introduces a Traceability-Dataset Quality Assessment (T-DQA) framework to categorize software traceability datasets and assist researchers to select appropriate datasets for their research based on different characteristics of the datasets and the context in which those datasets will be used. Third, we present the results of an empirical study with limited scope to generate datasets using three baseline approaches for the creation of training data. These approaches are (i) Expert-Based, (ii) Automated Web-Mining, which generates training sets by automatically mining tactic\u27s APIs from technical programming websites, and lastly, (iii) Automated Big-Data Analysis, which mines ultra-large-scale code repositories to generate training sets. We compare the trace-link creation accuracy achieved using each of these three baseline approaches and discuss the costs and benefits associated with them. Additionally, in a separate study, we investigate the impact of training set size on the accuracy of recovering trace links. Finally, we conduct a large-scale study to identify which types of software artifacts are produced by a wide variety of open-source projects at different levels of granularity. Then we propose an automated approach based on Machine Learning techniques to identify various types of software artifacts. Through a set of experiments, we report and compare the performance of these algorithms when applied to software artifacts. Finally, we conducted a study to understand how software traceability experts and practitioners evaluate the quality of their datasets. In addition, we aim at gathering experts’ opinions on all quality attributes and metrics proposed by T-DQA

    Efficient Information Retrieval for Software Bug Localization

    Get PDF
    Software systems are often shipped with defects. When a bug is reported, developers use the information available in the associated report to locate source code fragments that need to be modified to fix the bug. However, as software systems evolve in size and complexity, bug localization can become a tedious and time-consuming process. Contemporary bug localization tools utilize Information Retrieval (IR) methods for automated support to minimize the manual effort. IR methods exploit the textual content of bug reports to capture and rank relevant buggy source files. However, for an IR-based bug localization tool to be useful, it must achieve adequate retrieval accuracy. Lower precision and recall can leave developers with large amounts of incorrect information to wade through. Motivated by these observations, in this dissertation, we propose a new paradigm of information-theoretic IR methods to support bug localization tasks in software systems. These methods exploit the co-occurrence patterns of code terms in software systems to reveal latent semantic information that other methods often fail to capture. We further investigate the impact of combining various IR methods on the retrieval accuracy of bug localization engines. The main assumption is that different IR methods, targeting different dimensions of similarity between software artifacts, can enhance the confidence in each other\u27s results. Furthermore, we propose a novel approach for enhancing the performance of IR-enabled bug localization methods in the context of Open-Source Software (OSS). The proposed approach exploits knowledge from previously resolved bugs to help localize new bugs. Our analysis uses multiple datasets generated for multiple open-source and closed source projects. Our results show that a) information-theoretic IR methods can significantly outperform classical IR methods in bug localization tasks, b) optimized IR-hybrids can significantly outperform individual IR methods, and near-optimal global configurations can be determined for different combinations of IR methods, and c) information extracted from previously resolved bug reports can significantly enhance the accuracy of IR-enabled bug localization methods in OSS

    Atmospheric profiles of CO₂ as integrators of regional scale exchange

    Get PDF
    The global climate is changing due to the accumulation of greenhouse gases (GHGs) in the atmosphere, primarily due to anthropogenic activity. The dominant GHG is CO₂ which originates from combustion of fossil fuels, land use change and management. The terrestrial biosphere is a key driver of climate and biogeochemical cycles at regional and global scales. Furthermore, the response of the Earth system to future drivers of climate change will depend on feedbacks between biogeochemistry and climate. Therefore, understanding these processes requires a mechanistic approach in any model simulation framework. However ecosystem processes are complex and nonlinear and consequently models need to be validated against observations at multiple spatial scales. In this thesis the weather research and forecasting model (WRF) has been coupled to the mechanistic terrestrial ecosystem model soil-plant-atmosphere (SPA), creating WRF-SPA. The thesis is split into three main chapters: i. WRF-SPA model development and validation at multiple spatial scales, scaling from surface fluxes of CO₂ and energy to aircraft profiles and tall tower observations of atmospheric CO₂ concentrations. ii. Investigation of ecosystem contributions to observations of atmospheric CO₂ concentrations made at tall tower Angus, Dundee, Scotland using ecosystem specific CO₂ tracers at seasonal and interannual time scales. iii. An assessment of detectability of a policy relevant national scale afforestation by observations made at a tall tower. Detectability of changes in atmospheric CO₂ concentrations was assessed through a comparison of a control simulation, using current day forest extent, and an experimentally afforested simulation using WRF-SPA. WRF-SPA performs well at both site and regional scales, accurately simulating aircraft profiles of CO₂ concentration magnitudes (error <+- 4 ppm), indicating appropriate source sink distribution and realistic atmospheric transport. Hourly observations made at tall tower Angus were also well simulated by WRF-SPA (R² = 0.67, RMSE = 3.5 ppm, bias = 0.58 ppm). Analysis of CO₂ tracers at tall tower Angus show an increase in the seasonal error between WRF-SPA simulated atmospheric CO₂ and observations, which coincides with simulated cropland harvest. WRF-SPA does not simulate uncultivated land associated with agriculture, which in Scotland represents 36 % of agricultural holdings. Therefore, uncultivated land components may provide an explanation for the increase in model-data error. Interannual variation in weather is indicated to have a greater impact on ecosystem specific contributions to atmospheric CO₂ concentrations at Angus than variation in surface activity. In a model experiment, afforestation of Scotland was simulated to test the impact on Scotland’s carbon balance. The changes were shown to be potentially detectable by observations made at tall tower Angus. Afforestation results in a reduction in atmospheric CO₂ concentrations by up to 0.6 ppm at seasonal time scales at tall tower Angus. Detection of changes in forest surface net CO₂ uptake flux due to afforestation was improved through the use of a network of tall towers (R² = 0.83) compared to tall tower Angus alone (R² = 0.75)
    corecore